Foreword |
|
xv | |
Preface |
|
xvii | |
Acknowledgments |
|
xxvii | |
|
|
1 | (22) |
|
1.1 Heterogeneous parallel computing |
|
|
3 | (4) |
|
1.2 Why more speed or parallelism? |
|
|
7 | (2) |
|
1.3 Speeding up real applications |
|
|
9 | (2) |
|
1.4 Challenges in parallel programming |
|
|
11 | (2) |
|
1.5 Related parallel programming interfaces |
|
|
13 | (1) |
|
|
14 | (1) |
|
1.7 Organization of the book |
|
|
15 | (8) |
|
|
19 | (4) |
|
Part I Fundamental Concepts |
|
|
|
Chapter 2 Heterogeneous data parallel computing |
|
|
23 | (24) |
|
|
|
23 | (4) |
|
2.2 CUDAC program structure |
|
|
27 | (1) |
|
2.3 A vector addition kernel |
|
|
28 | (3) |
|
2.4 Device global memory and data transfer |
|
|
31 | (4) |
|
2.5 Kernel functions and threading |
|
|
35 | (5) |
|
2.6 Calling kernel functions |
|
|
40 | (2) |
|
|
42 | (1) |
|
|
43 | (4) |
|
|
44 | (2) |
|
|
46 | (1) |
|
Chapter 3 Multidimensional grids and data |
|
|
47 | (22) |
|
3.1 Multidimensional grid organization |
|
|
47 | (4) |
|
3.2 Mapping threads to multidimensional data |
|
|
51 | (7) |
|
3.3 Image blur: a more complex kernel |
|
|
58 | (4) |
|
3.4 Matrix multiplication |
|
|
62 | (4) |
|
|
66 | (3) |
|
|
67 | (2) |
|
Chapter 4 Compute architecture and scheduling |
|
|
69 | (24) |
|
4.1 Architecture of a modern GPU |
|
|
70 | (1) |
|
|
70 | (1) |
|
4.3 Synchronization and transparent scalability |
|
|
71 | (3) |
|
4.4 Warps and SIMD hardware |
|
|
74 | (5) |
|
|
79 | (4) |
|
4.6 Warp scheduling and latency tolerance |
|
|
83 | (2) |
|
4.7 Resource partitioning and occupancy |
|
|
85 | (2) |
|
4.8 Querying device properties |
|
|
87 | (3) |
|
|
90 | (3) |
|
|
90 | (2) |
|
|
92 | (1) |
|
Chapter 5 Memory architecture and data locality |
|
|
93 | (30) |
|
5.1 Importance of memory access efficiency |
|
|
94 | (2) |
|
|
96 | (7) |
|
5.3 Tiling for reduced memory traffic |
|
|
103 | (4) |
|
5.4 A tiled matrix multiplication kernel |
|
|
107 | (5) |
|
|
112 | (3) |
|
5.6 Impact of memory usage on occupancy |
|
|
115 | (3) |
|
|
118 | (5) |
|
|
119 | (4) |
|
Chapter 6 Performance considerations |
|
|
123 | (28) |
|
|
124 | (9) |
|
6.2 Hiding memory latency |
|
|
133 | (5) |
|
|
138 | (3) |
|
6.4 A checklist of optimizations |
|
|
141 | (4) |
|
6.5 Knowing your computation's bottleneck |
|
|
145 | (1) |
|
|
146 | (5) |
|
|
146 | (1) |
|
|
147 | (4) |
|
Part II Parallel Patterns |
|
|
|
Chapter 7 Convolution An introduction to constant memory and caching |
|
|
151 | (22) |
|
|
152 | (4) |
|
7.2 Parallel convolution: a basic algorithm |
|
|
156 | (3) |
|
7.3 Constant memory and caching |
|
|
159 | (4) |
|
7.4 Tiled convolution with halo cells |
|
|
163 | (5) |
|
7.5 Tiled convolution using caches for halo cells |
|
|
168 | (2) |
|
|
170 | (3) |
|
|
171 | (2) |
|
|
173 | (18) |
|
|
174 | (4) |
|
8.2 Parallel stencil: a basic algorithm |
|
|
178 | (1) |
|
8.3 Shared memory tiling for stencil sweep |
|
|
179 | (4) |
|
|
183 | (3) |
|
|
186 | (2) |
|
|
188 | (3) |
|
|
188 | (3) |
|
Chapter 9 Parallel histogram |
|
|
191 | (20) |
|
|
192 | (2) |
|
9.2 Atomic operations and a basic histogram kernel |
|
|
194 | (4) |
|
9.3 Latency and throughput of atomic operations |
|
|
198 | (2) |
|
|
200 | (3) |
|
|
203 | (3) |
|
|
206 | (2) |
|
|
208 | (3) |
|
|
209 | (1) |
|
|
210 | (1) |
|
Chapter 10 Reduction And minimizing divergence |
|
|
211 | (24) |
|
|
211 | (2) |
|
|
213 | (4) |
|
10.3 A simple reduction kernel |
|
|
217 | (2) |
|
10.4 Minimizing control divergence |
|
|
219 | (4) |
|
10.5 Minimizing memory divergence |
|
|
223 | (2) |
|
10.6 Minimizing global memory accesses |
|
|
225 | (1) |
|
10.7 Hierarchical reduction for arbitrary input length |
|
|
226 | (2) |
|
10.8 Thread coarsening for reduced overhead |
|
|
228 | (3) |
|
|
231 | (4) |
|
|
232 | (3) |
|
Chapter 11 Prefix sum (scan) An introduction to work efficiency in parallel algorithms |
|
|
235 | (28) |
|
|
|
|
|
236 | (2) |
|
11.2 Parallel scan with the Kogge-Stone algorithm |
|
|
238 | (6) |
|
11.3 Speed and work efficiency consideration |
|
|
244 | (2) |
|
11.4 Parallel scan with the Brent-Kung algorithm |
|
|
246 | (5) |
|
11.5 Coarsening for even more work efficiency |
|
|
251 | (2) |
|
11.6 Segmented parallel scan for arbitrary-length inputs |
|
|
253 | (3) |
|
11.7 Single-pass scan for memory access efficiency |
|
|
256 | (3) |
|
|
259 | (4) |
|
|
260 | (1) |
|
|
261 | (2) |
|
Chapter 12 Merge An introduction to dynamic input data identification |
|
|
263 | (30) |
|
|
|
|
263 | (2) |
|
12.2 A sequential merge algorithm |
|
|
265 | (1) |
|
12.3 A parallelization approach |
|
|
266 | (2) |
|
12.4 Co-rank function implementation |
|
|
268 | (5) |
|
12.5 A basic parallel merge kernel |
|
|
273 | (2) |
|
12.6 A tiled merge kernel to improve coalescing |
|
|
275 | (7) |
|
12.7 A circular buffer merge kernel |
|
|
282 | (6) |
|
12.8 Thread coarsening for merge |
|
|
288 | (1) |
|
|
288 | (5) |
|
|
289 | (1) |
|
|
289 | (4) |
|
Part III Advanced Patterns and Applications |
|
|
|
|
293 | (18) |
|
|
|
294 | (1) |
|
|
295 | (1) |
|
|
296 | (4) |
|
13.4 Optimizing for memory coalescing |
|
|
300 | (2) |
|
13.5 Choice of radix value |
|
|
302 | (3) |
|
13.6 Thread coarsening to improve coalescing |
|
|
305 | (1) |
|
|
306 | (2) |
|
13.8 Other parallel sort methods |
|
|
308 | (1) |
|
|
309 | (2) |
|
|
310 | (1) |
|
|
310 | (1) |
|
Chapter 14 Sparse matrix computation |
|
|
311 | (20) |
|
|
312 | (2) |
|
14.2 A simple SpMV kernel with the COO format |
|
|
314 | (3) |
|
14.3 Grouping row nonzeros with the CSR format |
|
|
317 | (3) |
|
14.4 Improving memory coalescing with the ELL format |
|
|
320 | (4) |
|
14.5 Regulating padding with the hybrid ELL-COO format |
|
|
324 | (1) |
|
14.6 Reducing control divergence with the JDS format |
|
|
325 | (3) |
|
|
328 | (3) |
|
|
329 | (1) |
|
|
329 | (2) |
|
Chapter 15 Graph traversal |
|
|
331 | (24) |
|
|
|
|
332 | (3) |
|
15.2 Breadth-first search |
|
|
335 | (3) |
|
15.3 Vertex-centric parallelization of breadth-first search |
|
|
338 | (5) |
|
15.4 Edge-centric parallelization of breadth-first search |
|
|
343 | (2) |
|
15.5 Improving efficiency with frontiers |
|
|
345 | (3) |
|
15.6 Reducing contention with privatization |
|
|
348 | (2) |
|
|
350 | (2) |
|
|
352 | (3) |
|
|
353 | (1) |
|
|
354 | (1) |
|
|
355 | (36) |
|
|
|
|
356 | (10) |
|
16.2 Convolutional neural networks |
|
|
366 | (10) |
|
16.3 Convolutional layer: a CUDA inference kernel |
|
|
376 | (3) |
|
16.4 Formulating a convolutional layer as GEMM |
|
|
379 | (6) |
|
|
385 | (2) |
|
|
387 | (4) |
|
|
388 | (1) |
|
|
388 | (3) |
|
Chapter 17 Iterative magnetic resonance imaging reconstruction |
|
|
391 | (24) |
|
|
391 | (3) |
|
17.2 Iterative reconstruction |
|
|
394 | (2) |
|
|
396 | (16) |
|
|
412 | (3) |
|
|
413 | (1) |
|
|
414 | (1) |
|
Chapter 18 Electrostatic potential map |
|
|
415 | (18) |
|
|
|
415 | (2) |
|
18.2 Scatter versus gather in kernel design |
|
|
417 | (5) |
|
|
422 | (2) |
|
|
424 | (1) |
|
18.5 Cutoff binning for data size scalability |
|
|
425 | (5) |
|
|
430 | (3) |
|
|
431 | (1) |
|
|
431 | (2) |
|
Chapter 19 Parallel programming and computational thinking |
|
|
433 | (16) |
|
19.1 Goals of parallel computing |
|
|
433 | (3) |
|
|
436 | (4) |
|
19.3 Problem decomposition |
|
|
440 | (4) |
|
19.4 Computational thinking |
|
|
444 | (2) |
|
|
446 | (3) |
|
|
446 | (3) |
|
Part IV Advanced Practices |
|
|
|
Chapter 20 Programming a heterogeneous computing cluster An introduction to CUDA streams |
|
|
449 | (26) |
|
|
|
|
449 | (1) |
|
|
450 | (2) |
|
20.3 Message passing interface basics |
|
|
452 | (3) |
|
20.4 Message passing interface point-to-point communication |
|
|
455 | (7) |
|
20.5 Overlapping computation and communication |
|
|
462 | (8) |
|
20.6 Message passing interface collective communication |
|
|
470 | (1) |
|
20.7 CUDA aware message passing interface |
|
|
471 | (1) |
|
|
472 | (3) |
|
|
472 | (1) |
|
|
473 | (2) |
|
Chapter 21 CUDA dynamic parallelism |
|
|
475 | (24) |
|
|
|
476 | (2) |
|
21.2 Dynamic parallelism overview |
|
|
478 | (3) |
|
21.3 An example: Bezier curves |
|
|
481 | (3) |
|
21.4 A recursive example: quadtrees |
|
|
484 | (6) |
|
21.5 Important considerations |
|
|
490 | (2) |
|
|
492 | (3) |
|
|
493 | (2) |
|
A21.1 Support code for quadtree example |
|
|
495 | (4) |
|
|
497 | (2) |
|
Chapter 22 Advanced practices and future evolution |
|
|
499 | (16) |
|
|
|
22.1 Model of host/device interaction |
|
|
500 | (5) |
|
22.2 Kernel execution control |
|
|
505 | (3) |
|
22.3 Memory bandwidth and compute throughput |
|
|
508 | (2) |
|
22.4 Programming environment |
|
|
510 | (3) |
|
|
513 | (2) |
|
|
513 | (2) |
|
Chapter 23 Conclusion and outlook |
|
|
515 | (4) |
|
|
515 | (1) |
|
|
516 | (3) |
Appendix A Numerical considerations |
|
519 | (18) |
Index |
|
537 | |