List of Figures |
|
x | |
List of Tables |
|
xiii | |
List of Examples |
|
xv | |
Preface |
|
xix | |
1 Introduction to GPU Kernels and Hardware |
|
1 | (21) |
|
|
1 | (1) |
|
|
2 | (8) |
|
|
10 | (1) |
|
|
11 | (1) |
|
1.5 CPU Memory Management: Latency Hiding Using Caches |
|
|
12 | (1) |
|
1.6 CPU: Parallel Instruction Set |
|
|
13 | (1) |
|
|
14 | (1) |
|
|
15 | (1) |
|
|
16 | (2) |
|
|
18 | (1) |
|
|
19 | (1) |
|
|
20 | (2) |
2 Thinking and Coding in Parallel |
|
22 | (50) |
|
|
22 | (8) |
|
|
30 | (1) |
|
|
31 | (6) |
|
2.4 Latency Hiding and Occupancy |
|
|
37 | (2) |
|
|
39 | (1) |
|
|
40 | (11) |
|
|
51 | (2) |
|
2.8 Matrix Multiplication |
|
|
53 | (8) |
|
2.9 Tiled Matrix Multiplication |
|
|
61 | (4) |
|
|
65 | (7) |
3 Warps and Cooperative Groups |
|
72 | (34) |
|
3.1 CUDA Objects in Cooperative Groups |
|
|
75 | (5) |
|
|
80 | (5) |
|
|
85 | (4) |
|
3.4 Warp-Level Intrinsic Functions and Sub-warps |
|
|
89 | (1) |
|
3.5 Thread Divergence and Synchronisation |
|
|
90 | (2) |
|
|
92 | (4) |
|
|
96 | (7) |
|
|
103 | (3) |
4 Parallel Stencils |
|
106 | (36) |
|
|
106 | (12) |
|
4.2 Cascaded Calculation of 2D Stencils |
|
|
118 | (5) |
|
|
123 | (3) |
|
4.4 Digital Image Processing |
|
|
126 | (8) |
|
|
134 | (1) |
|
|
135 | (7) |
5 Textures |
|
142 | (36) |
|
|
143 | (1) |
|
|
144 | (2) |
|
|
146 | (1) |
|
|
147 | (4) |
|
|
151 | (5) |
|
|
156 | (1) |
|
|
157 | (4) |
|
5.8 Affine Transformations of Volumetric Images |
|
|
161 | (6) |
|
5.9 3D Image Registration |
|
|
167 | (8) |
|
5.10 Image Registration Results |
|
|
175 | (3) |
6 Monte Carlo Applications |
|
178 | (31) |
|
|
178 | (7) |
|
|
185 | (11) |
|
6.3 Generating Other Distributions |
|
|
196 | (2) |
|
|
198 | (11) |
7 Concurrency Using CUDA Streams and Events |
|
209 | (30) |
|
7.1 Concurrent Kernel Execution |
|
|
209 | (2) |
|
7.2 CUDA Pipeline Example |
|
|
211 | (4) |
|
7.3 Thrust and cudaDeviceReset |
|
|
215 | (1) |
|
7.4 Results from the Pipeline Example |
|
|
216 | (2) |
|
|
218 | (7) |
|
|
225 | (8) |
|
|
233 | (6) |
8 Application to PET Scanners |
|
239 | (54) |
|
|
239 | (2) |
|
8.2 Data Storage and Definition of Scanner Geometry |
|
|
241 | (6) |
|
8.3 Simulating a PET Scanner |
|
|
247 | (12) |
|
8.4 Building the System Matrix |
|
|
259 | (3) |
|
|
262 | (4) |
|
|
266 | (2) |
|
8.7 Implementation of OSEM |
|
|
268 | (2) |
|
8.8 Depth of Interaction (DOI) |
|
|
270 | (3) |
|
8.9 PET Results Using DOI |
|
|
273 | (1) |
|
|
274 | (12) |
|
8.11 Richardson-Lucy Image Deblurring |
|
|
286 | (7) |
9 Scaling Up |
|
293 | (32) |
|
|
295 | (3) |
|
9.2 CUDA Unified Virtual Addressing (UVA) |
|
|
298 | (1) |
|
9.3 Peer-to-Peer Access in CUDA |
|
|
299 | (2) |
|
9.4 CUDA Zero-Copy Memory |
|
|
301 | (1) |
|
|
302 | (11) |
|
9.6 A Brief Introduction to MPI |
|
|
313 | (12) |
10 Tools for Profiling and Debugging |
|
325 | (33) |
|
|
325 | (5) |
|
10.2 Profiling with nvprof |
|
|
330 | (3) |
|
10.3 Profiling with the NVIDIA Visual Profiler (NVVP) |
|
|
333 | (3) |
|
|
336 | (2) |
|
|
338 | (1) |
|
10.6 Nsight Compute Sections |
|
|
339 | (8) |
|
10.7 Debugging with Printf |
|
|
347 | (2) |
|
10.8 Debugging with Microsoft Visual Studio |
|
|
349 | (3) |
|
10.9 Debugging Kernel Code |
|
|
352 | (2) |
|
|
354 | (4) |
11 Tensor Cores |
|
358 | (15) |
|
11.1 Tensor Cores and FP16 |
|
|
358 | (2) |
|
11.2 Warp Matrix Functions |
|
|
360 | (5) |
|
11.3 Supported Data Types |
|
|
365 | (1) |
|
11.4 Tensor Core Reduction |
|
|
366 | (5) |
|
|
371 | (2) |
Appendix A A Brief History of CUDA |
|
373 | (9) |
Appendix B Atomic Operations |
|
382 | (5) |
Appendix C The NVCC Compiler |
|
387 | (6) |
Appendix D AVX and the Intel Compiler |
|
393 | (9) |
Appendix E Number Formats |
|
402 | (4) |
Appendix F CUDA Documentation and Libraries |
|
406 | (4) |
Appendix G The CX Header Files |
|
410 | (25) |
Appendix H AI and Python |
|
435 | (3) |
Appendix I Topics in C++ |
|
438 | (10) |
Index |
|
448 | |