Preface |
|
xi | |
Acknowledgments |
|
xvii | |
Dedication |
|
xix | |
|
|
1 | (20) |
|
GPUs as Parallel Computers |
|
|
2 | (6) |
|
Architecture of a Modern GPU |
|
|
8 | (2) |
|
Why More Speed or Parallelism? |
|
|
10 | (3) |
|
Parallel Programming Languages and Models |
|
|
13 | (2) |
|
|
15 | (1) |
|
|
16 | (5) |
|
|
21 | (18) |
|
Evolution of Graphics Pipelines |
|
|
21 | (11) |
|
The Era of Fixed-Function Graphics Pipelines |
|
|
22 | (4) |
|
Evolution of Programmable Real-Time Graphics |
|
|
26 | (3) |
|
Unified Graphics and Computing Processors |
|
|
29 | (2) |
|
GPGPU: An Intermediate Step |
|
|
31 | (1) |
|
|
32 | (2) |
|
|
33 | (1) |
|
|
34 | (1) |
|
|
34 | (5) |
|
|
39 | (20) |
|
|
39 | (2) |
|
|
41 | (1) |
|
A Matrix-Matrix Multiplication Example |
|
|
42 | (4) |
|
Device Memories and Data Transfer |
|
|
46 | (5) |
|
Kernel Functions and Threading |
|
|
51 | (5) |
|
|
56 | (3) |
|
|
56 | (1) |
|
|
56 | (1) |
|
|
56 | (1) |
|
|
57 | (2) |
|
|
59 | (18) |
|
|
59 | (5) |
|
Using blockIdx and threadIdx |
|
|
64 | (4) |
|
Synchronization and Transparent Scalability |
|
|
68 | (2) |
|
|
70 | (1) |
|
Thread Scheduling and Latency Tolerance |
|
|
71 | (3) |
|
|
74 | (1) |
|
|
74 | (3) |
|
|
77 | (18) |
|
Importance of Memory Access Efficiency |
|
|
78 | (1) |
|
|
79 | (4) |
|
A Strategy for Reducing Global Memory Traffic |
|
|
83 | (7) |
|
Memory as a Limiting Factor to Parallelism |
|
|
90 | (2) |
|
|
92 | (1) |
|
|
93 | (2) |
|
Performance Considerations |
|
|
95 | (30) |
|
|
96 | (7) |
|
|
103 | (8) |
|
Dynamic Partitioning of SM Resources |
|
|
111 | (2) |
|
|
113 | (2) |
|
|
115 | (1) |
|
|
116 | (2) |
|
Measured Performance and Summary |
|
|
118 | (2) |
|
|
120 | (5) |
|
Floating Point Considerations |
|
|
125 | (16) |
|
|
126 | (3) |
|
Normalized Representation of M |
|
|
126 | (1) |
|
|
127 | (2) |
|
|
129 | (5) |
|
Special Bit Patterns and Precision |
|
|
134 | (1) |
|
Arithmetic Accuracy and Rounding |
|
|
135 | (1) |
|
|
136 | (2) |
|
|
138 | (1) |
|
|
138 | (3) |
|
Application Case Study: Advanced MRI Reconstruction |
|
|
141 | (32) |
|
|
142 | (2) |
|
|
144 | (4) |
|
|
148 | (19) |
|
Determine the Kernel Parallelism Structure |
|
|
149 | (7) |
|
Getting Around the Memory Bandwidth Limitation |
|
|
156 | (7) |
|
Using Hardware Trigonometry Functions |
|
|
163 | (3) |
|
Experimental Performance Tuning |
|
|
166 | (1) |
|
|
167 | (3) |
|
|
170 | (3) |
|
Application Case Study: Molecular Visualization and Analysis |
|
|
173 | (18) |
|
|
174 | (2) |
|
A Simple Kernel Implementation |
|
|
176 | (4) |
|
Instruction Execution Efficiency |
|
|
180 | (2) |
|
|
182 | (3) |
|
Additional Performance Comparisons |
|
|
185 | (2) |
|
|
187 | (1) |
|
|
188 | (3) |
|
Parallel Programming and Computational Thinking |
|
|
191 | (14) |
|
Goals of Parallcl Programming |
|
|
192 | (1) |
|
|
193 | (3) |
|
|
196 | (6) |
|
|
202 | (2) |
|
|
204 | (1) |
|
A Brief Introduction to Opencl™ |
|
|
205 | (16) |
|
|
205 | (2) |
|
|
207 | (2) |
|
|
209 | (2) |
|
|
211 | (1) |
|
Device Management and Kernel Launch |
|
|
212 | (2) |
|
Electrostatic Potential Map in OpenCL |
|
|
214 | (5) |
|
|
219 | (1) |
|
|
220 | (1) |
|
Conclusion and Future Outlook |
|
|
221 | (12) |
|
|
221 | (2) |
|
Memory Architecture Evolution |
|
|
223 | (4) |
|
Large Virtual and Physical Address Spaces |
|
|
223 | (1) |
|
Unified Device Memory Space |
|
|
224 | (1) |
|
Configurable Caching and Scratch Pad |
|
|
225 | (1) |
|
Enhanced Atomic Operations |
|
|
226 | (1) |
|
Enhanced Global Memory Access |
|
|
226 | (1) |
|
Kernel Execution Control Evolution |
|
|
227 | (2) |
|
Function Calls within Kernel Functions |
|
|
227 | (1) |
|
Exception Handling in Kernel Functions |
|
|
227 | (1) |
|
Simultaneous Execution of Multiple Kernels |
|
|
228 | (1) |
|
|
228 | (1) |
|
|
229 | (1) |
|
|
229 | (1) |
|
Better Control Flow Efficiency |
|
|
229 | (1) |
|
|
230 | (1) |
|
|
230 | (3) |
|
APPENDIX A MATRIX MULTIPLICATION HOST-ONLY VERSION SOURCE CODE |
|
|
233 | (12) |
|
|
233 | (4) |
|
|
237 | (1) |
|
|
238 | (1) |
|
|
239 | (4) |
|
|
243 | (2) |
|
APPENDIX B GPU COMPUTE CAPABILITIES |
|
|
245 | (6) |
|
GPU Compute Capability Tables |
|
|
245 | (1) |
|
Memory Coalescing Variations |
|
|
246 | (5) |
Index |
|
251 | |