Preface |
|
xv | |
Acknowledgements |
|
xxi | |
|
|
1 | (18) |
|
1.1 Heterogeneous Parallel Computing |
|
|
2 | (4) |
|
1.2 Architecture of a Modern GPU |
|
|
6 | (2) |
|
1.3 Why More Speed or Parallelism? |
|
|
8 | (2) |
|
1.4 Speeding Up Real Applications |
|
|
10 | (2) |
|
1.5 Challenges in Parallel Programming |
|
|
12 | (1) |
|
1.6 Parallel Programming Languages and Models |
|
|
12 | (2) |
|
|
14 | (1) |
|
1.8 Organization of the Book |
|
|
15 | (4) |
|
|
18 | (1) |
|
Chapter 2 Data Parallel Computing |
|
|
19 | (24) |
|
|
20 | (2) |
|
2.2 CUDA C Program Structure |
|
|
22 | (3) |
|
2.3 A Vector Addition Kernel |
|
|
25 | (2) |
|
2.4 Device Global Memory and Data Transfer |
|
|
27 | (5) |
|
2.5 Kernel Functions and Threading |
|
|
32 | (5) |
|
|
37 | (1) |
|
|
38 | (1) |
|
|
38 | (1) |
|
|
38 | (1) |
|
Built-in (Predefined) Variables |
|
|
39 | (1) |
|
|
39 | (1) |
|
|
39 | (4) |
|
|
41 | (2) |
|
Chapter 3 Scalable Parallel Execution |
|
|
43 | (28) |
|
3.1 CUDA Thread Organization |
|
|
43 | (4) |
|
3.2 Mapping Threads to Multidimensional Data |
|
|
47 | (7) |
|
3.3 Image Blur: A More Complex Kernel |
|
|
54 | (4) |
|
3.4 Synchronization and Transparent Scalability |
|
|
58 | (2) |
|
|
60 | (1) |
|
3.6 Querying Device Properties |
|
|
61 | (3) |
|
3.7 Thread Scheduling and Latency Tolerance |
|
|
64 | (3) |
|
|
67 | (1) |
|
|
67 | (4) |
|
Chapter 4 Memory and Data Locality |
|
|
71 | (32) |
|
4.1 Importance of Memory Access Efficiency |
|
|
72 | (1) |
|
4.2 Matrix Multiplication |
|
|
73 | (4) |
|
|
77 | (7) |
|
4.4 Tiling for Reduced Memory Traffic |
|
|
84 | (6) |
|
4.5 A Tiled Matrix Multiplication Kernel |
|
|
90 | (4) |
|
|
94 | (3) |
|
4.7 Memory as a Limiting Factor to Parallelism |
|
|
97 | (2) |
|
|
99 | (1) |
|
|
100 | (3) |
|
Chapter 5 Performance Considerations |
|
|
103 | (28) |
|
5.1 Global Memory Bandwidth |
|
|
104 | (8) |
|
5.2 More on Memory Parallelism |
|
|
112 | (5) |
|
5.3 Warps and SIMD Hardware |
|
|
117 | (8) |
|
5.4 Dynamic Partitioning of Resources |
|
|
125 | (2) |
|
|
127 | (1) |
|
|
128 | (1) |
|
|
128 | (3) |
|
|
130 | (1) |
|
Chapter 6 Numerical Considerations |
|
|
131 | (18) |
|
6.1 Floating-Point Data Representation |
|
|
132 | (2) |
|
Normalized Representation of M |
|
|
132 | (1) |
|
|
133 | (1) |
|
6.2 Representable Numbers |
|
|
134 | (4) |
|
6.3 Special Bit Patterns and Precision in IEEE Format |
|
|
138 | (1) |
|
6.4 Arithmetic Accuracy and Rounding |
|
|
139 | (1) |
|
6.5 Algorithm Considerations |
|
|
140 | (2) |
|
6.6 Linear Solvers and Numerical Stability |
|
|
142 | (4) |
|
|
146 | (1) |
|
|
147 | (2) |
|
|
147 | (2) |
|
Chapter 7 Parallel Patterns: Convolution |
|
|
149 | (26) |
|
|
150 | (3) |
|
7.2 ID Parallel Convolution---A Basic Algorithm |
|
|
153 | (3) |
|
7.3 Constant Memory and Caching |
|
|
156 | (4) |
|
7.4 Tiled 1D Convolution with Halo Cells |
|
|
160 | (5) |
|
7.5 A Simpler Tiled ID Convolution---General Caching |
|
|
165 | (1) |
|
7.6 Tiled 2D Convolution with Halo Cells |
|
|
166 | (6) |
|
|
172 | (1) |
|
|
173 | (2) |
|
Chapter 8 Parallel Patterns: Prefix Sum |
|
|
175 | (24) |
|
|
176 | (1) |
|
8.2 A Simple Parallel Scan |
|
|
177 | (4) |
|
8.3 Speed and Work Efficiency |
|
|
181 | (2) |
|
8.4 A More Work-Efficient Parallel Scan |
|
|
183 | (4) |
|
8.5 An Even More Work-Efficient Parallel Scan |
|
|
187 | (2) |
|
8.6 Hierarchical Parallel Scan for Arbitrary-Length Inputs |
|
|
189 | (3) |
|
8.7 Single-Pass Scan for Memory Access Efficiency |
|
|
192 | (3) |
|
|
195 | (1) |
|
|
195 | (4) |
|
|
196 | (3) |
|
Chapter 9 Parallel Patterns---Parallel Histogram Computation |
|
|
199 | (16) |
|
|
200 | (2) |
|
9.2 Use of Atomic Operations |
|
|
202 | (4) |
|
9.3 Block versus Interleaved Partitioning |
|
|
206 | (1) |
|
9.4 Latency versus Throughput of Atomic Operations |
|
|
207 | (3) |
|
9.5 Atomic Operation in Cache Memory |
|
|
210 | (1) |
|
|
210 | (1) |
|
|
211 | (2) |
|
|
213 | (1) |
|
|
213 | (2) |
|
|
214 | (1) |
|
Chapter 10 Parallel Patterns: Sparse Matrix Computation |
|
|
215 | (16) |
|
|
216 | (3) |
|
10.2 Parallel SpMV Using CSR |
|
|
219 | (2) |
|
10.3 Padding and Transposition |
|
|
221 | (3) |
|
10.4 Using a Hybrid Approach to Regulate Padding |
|
|
224 | (3) |
|
10.5 Sorting and Partitioning for Regularization |
|
|
227 | (2) |
|
|
229 | (1) |
|
|
229 | (2) |
|
|
230 | (1) |
|
Chapter 11 Parallel Patterns: Merge Sort |
|
|
231 | (26) |
|
|
231 | (2) |
|
11.2 A Sequential Merge Algorithm |
|
|
233 | (1) |
|
11.3 A Parallelization Approach |
|
|
234 | (2) |
|
11.4 Co-Rank Function Implementation |
|
|
236 | (5) |
|
11.5 A Basic Parallel Merge Kernel |
|
|
241 | (1) |
|
11.6 A Tiled Merge Kernel |
|
|
242 | (7) |
|
11.7 A Circular-Buffer Merge Kernel |
|
|
249 | (7) |
|
|
256 | (1) |
|
|
256 | (1) |
|
|
256 | (1) |
|
Chapter 12 Parallel Patterns: Graph Search |
|
|
257 | (18) |
|
|
258 | (2) |
|
12.2 Breadth-First Search |
|
|
260 | (2) |
|
12.3 A Sequential BFS Function |
|
|
262 | (3) |
|
12.4 A Parallel BFS Function |
|
|
265 | (5) |
|
|
270 | (3) |
|
|
270 | (1) |
|
|
271 | (1) |
|
|
272 | (1) |
|
|
273 | (1) |
|
|
273 | (1) |
|
|
273 | (2) |
|
|
274 | (1) |
|
Chapter 13 CUDA Dynamic Parallelism |
|
|
275 | (30) |
|
|
276 | (2) |
|
13.2 Dynamic Parallelism Overview |
|
|
278 | (1) |
|
|
279 | (2) |
|
13.4 Memory Data Visibility |
|
|
281 | (2) |
|
|
281 | (1) |
|
|
282 | (1) |
|
|
282 | (1) |
|
|
282 | (1) |
|
|
283 | (1) |
|
|
283 | (1) |
|
13.5 Configurations and Memory Management |
|
|
283 | (2) |
|
Launch Environment Configuration |
|
|
283 | (1) |
|
Memory Allocation and Lifetime |
|
|
283 | (1) |
|
|
284 | (1) |
|
Pending Launch Pool Configuration |
|
|
284 | (1) |
|
Errors and Launch Failures |
|
|
284 | (1) |
|
13.6 Synchronization, Streams, and Events |
|
|
285 | (2) |
|
|
285 | (1) |
|
|
285 | (1) |
|
|
286 | (1) |
|
|
287 | (1) |
|
13.7 A More Complex Example |
|
|
287 | (6) |
|
|
288 | (1) |
|
|
288 | (1) |
|
Bezier Curve Calculation (Without Dynamic Parallelism) |
|
|
288 | (2) |
|
Bezier Curve Calculation (With Dynamic Parallelism) |
|
|
290 | (2) |
|
|
292 | (1) |
|
|
292 | (1) |
|
|
293 | (4) |
|
|
297 | (2) |
|
|
299 | (2) |
|
|
301 | (1) |
|
|
301 | (4) |
|
Chapter 14 Application Case Study---non-Cartesian Magnetic Resonance Imaging |
|
|
305 | (26) |
|
|
306 | (2) |
|
14.2 Iterative Reconstruction |
|
|
308 | (2) |
|
|
310 | (17) |
|
Step 1 Determine the Kernel Parallelism Structure |
|
|
312 | (5) |
|
Step 2 Getting Around the Memory Bandwidth Limitation |
|
|
317 | (6) |
|
Step 3 Using Hardware Trigonometry Functions |
|
|
323 | (3) |
|
Step 4 Experimental Performance Tuning |
|
|
326 | (1) |
|
|
327 | (1) |
|
|
328 | (3) |
|
|
329 | (2) |
|
Chapter 15 Application Case Study---Molecular Visualization and Analysis |
|
|
331 | (14) |
|
|
332 | (1) |
|
15.2 A Simple Kernel Implementation |
|
|
333 | (4) |
|
15.3 Thread Granularity Adjustment |
|
|
337 | (1) |
|
|
338 | (4) |
|
|
342 | (1) |
|
|
343 | (2) |
|
|
344 | (1) |
|
Chapter 16 Application Case Study---Machine Learning |
|
|
345 | (24) |
|
|
346 | (1) |
|
16.2 Convolutional Neural Networks |
|
|
347 | (8) |
|
|
348 | (3) |
|
ConvNets: Backpropagation |
|
|
351 | (4) |
|
16.3 Convolutional Layer: A Basic CUDA Implementation of Forward Propagation |
|
|
355 | (4) |
|
16.4 Reduction of Convolutional Layer to Matrix Multiplication |
|
|
359 | (5) |
|
|
364 | (2) |
|
|
366 | (3) |
|
|
367 | (2) |
|
Chapter 17 Parallel Programming and Computational Thinking |
|
|
369 | (18) |
|
17.1 Goals of Parallel Computing |
|
|
370 | (1) |
|
17.2 Problem Decomposition |
|
|
371 | (3) |
|
|
374 | (5) |
|
17.4 Computational Thinking |
|
|
379 | (1) |
|
17.5 Single Program, Multiple Data, Shared Memory and Locality |
|
|
380 | (2) |
|
17.6 Strategies for Computational Thinking |
|
|
382 | (1) |
|
17.7 A Hypothetical Example: Sodium Map of the Brain |
|
|
383 | (3) |
|
|
386 | (1) |
|
|
386 | (1) |
|
|
386 | (1) |
|
Chapter 18 Programming a Heterogeneous Computing Cluster |
|
|
387 | (26) |
|
|
388 | (1) |
|
|
388 | (3) |
|
18.3 Message Passing Interface Basics |
|
|
391 | (2) |
|
18.4 Message Passing Interface Point-to-Point Communication |
|
|
393 | (7) |
|
18.5 Overlapping Computation and Communication |
|
|
400 | (8) |
|
18.6 Message Passing Interface Collective Communication |
|
|
408 | (1) |
|
18.7 CUDA-Aware Message Passing Interface |
|
|
409 | (1) |
|
|
410 | (1) |
|
|
410 | (3) |
|
|
411 | (2) |
|
Chapter 19 Parallel Programming with OpenACC |
|
|
413 | (30) |
|
19.1 The OpenACC Execution Model |
|
|
414 | (2) |
|
19.2 OpenACC Directive Format |
|
|
416 | (2) |
|
|
418 | (17) |
|
The OpenACC Kernels Directive |
|
|
419 | (3) |
|
The OpenACC Parallel Directive |
|
|
422 | (2) |
|
Comparison of Kernels and Parallel Directives |
|
|
424 | (1) |
|
|
425 | (5) |
|
OpenACC Loop Optimizations |
|
|
430 | (2) |
|
OpenACC Routine Directive |
|
|
432 | (2) |
|
Asynchronous Computation and Data |
|
|
434 | (1) |
|
19.4 Comparing OpenACC and CUDA |
|
|
435 | (2) |
|
|
435 | (1) |
|
|
436 | (1) |
|
|
436 | (1) |
|
19.5 Interoperability with CUDA and Libraries |
|
|
437 | (3) |
|
Calling CUDA or Libraries with OpenACC Arrays |
|
|
437 | (1) |
|
Using CUDA Pointers in OpenACC |
|
|
438 | (1) |
|
Calling CUDA Device Kernels from OpenACC |
|
|
439 | (1) |
|
19.6 The Future of OpenACC |
|
|
440 | (1) |
|
|
441 | (2) |
|
Chapter 20 More on CUDA and Graphics Processing Unit Computing |
|
|
443 | (14) |
|
20.1 Model of Host/Device Interaction |
|
|
444 | (5) |
|
20.2 Kernel Execution Control |
|
|
449 | (2) |
|
20.3 Memory Bandwidth and Compute Throughput |
|
|
451 | (2) |
|
20.4 Programming Environment |
|
|
453 | (2) |
|
|
455 | (2) |
|
|
456 | (1) |
|
Chapter 21 Conclusion and Outlook |
|
|
457 | (4) |
|
|
457 | (1) |
|
|
458 | (3) |
Appendix A An Introduction to OpenCL |
|
461 | (14) |
Appendix B THRUST: a Productivity-oriented Library for CUDA |
|
475 | (18) |
Appendix C CUDA Fortran |
|
493 | (22) |
Appendix D An introduction to C++ AMP |
|
515 | (20) |
Index |
|
535 | |