Preface |
|
xxi | |
Acknowledgments |
|
xxiii | |
About the Author |
|
xxv | |
|
|
1 | (118) |
|
|
3 | (8) |
|
|
5 | (1) |
|
|
6 | (1) |
|
|
6 | (1) |
|
|
7 | (1) |
|
1.2.3 Optimization Journeys |
|
|
7 | (1) |
|
|
7 | (1) |
|
|
7 | (1) |
|
1.3.2 CUDA Handbook Library (chLib) |
|
|
8 | (1) |
|
|
8 | (1) |
|
|
8 | (1) |
|
|
8 | (3) |
|
Chapter 2 Hardware Architecture |
|
|
11 | (40) |
|
|
11 | (6) |
|
|
12 | (1) |
|
2.1.2 Symmetric Multiprocessors |
|
|
13 | (1) |
|
2.1.3 Nonuniform Memory Access |
|
|
14 | (3) |
|
2.1.4 PCI Express Integration |
|
|
17 | (1) |
|
|
17 | (2) |
|
|
19 | (3) |
|
2.4 Address Spaces in CUDA |
|
|
22 | (10) |
|
2.4.1 Virtual Addressing: A Brief History |
|
|
22 | (4) |
|
2.4.2 Disjoint Address Spaces |
|
|
26 | (2) |
|
2.4.3 Mapped Pinned Memory |
|
|
28 | (1) |
|
2.4.4 Portable Pinned Memory |
|
|
29 | (1) |
|
|
30 | (1) |
|
2.4.6 Peer-to-Peer Mappings |
|
|
31 | (1) |
|
|
32 | (9) |
|
2.5.1 Pinned Host Memory and Command Buffers |
|
|
32 | (3) |
|
2.5.2 CPU/GPU Concurrency |
|
|
35 | (4) |
|
2.5.3 The Host Interface and Intra-GPU Synchronization |
|
|
39 | (2) |
|
2.5.4 Inter-GPU Synchronization |
|
|
41 | (1) |
|
|
41 | (9) |
|
|
42 | (4) |
|
2.6.2 Streaming Multiprocessors |
|
|
46 | (4) |
|
|
50 | (1) |
|
Chapter 3 Software Architecture |
|
|
51 | (42) |
|
|
51 | (8) |
|
3.1.1 CUDA Runtime and Driver |
|
|
53 | (1) |
|
|
54 | (3) |
|
3.1.3 NVCC, PTX, and Microcode |
|
|
57 | (2) |
|
3.2 Devices and Initialization |
|
|
59 | (8) |
|
|
60 | (1) |
|
|
60 | (3) |
|
3.2.3 When CUDA Is Not Present |
|
|
63 | (4) |
|
|
67 | (4) |
|
3.3.1 Lifetime and Scoping |
|
|
68 | (1) |
|
3.3.2 Preallocation of Resources |
|
|
68 | (1) |
|
|
69 | (1) |
|
3.3.4 Current Context Stack |
|
|
69 | (2) |
|
|
71 | (1) |
|
3.4 Modules and Functions |
|
|
71 | (2) |
|
|
73 | (2) |
|
|
75 | (1) |
|
|
76 | (3) |
|
3.7.1 Software Pipelining |
|
|
76 | (1) |
|
|
77 | (1) |
|
|
77 | (1) |
|
|
78 | (1) |
|
|
79 | (3) |
|
|
80 | (1) |
|
3.8.2 Portable Pinned Memory |
|
|
81 | (1) |
|
3.8.3 Mapped Pinned Memory |
|
|
81 | (1) |
|
3.8.4 Host Memory Registration |
|
|
81 | (1) |
|
3.9 CUDA Arrays and Texturing |
|
|
82 | (4) |
|
|
82 | (3) |
|
|
85 | (1) |
|
3.10 Graphics Interoperability |
|
|
86 | (1) |
|
3.11 The CUDA Runtime and CUDA Driver API |
|
|
87 | (6) |
|
Chapter 4 Software Environment |
|
|
93 | (26) |
|
4.1 nvcc---CUDA Compiler Driver |
|
|
93 | (7) |
|
4.2 ptxas---the PTX Assembler |
|
|
100 | (5) |
|
|
105 | (1) |
|
|
106 | (3) |
|
|
109 | (10) |
|
|
110 | (1) |
|
4.5.2 EC2 and Virtualization |
|
|
110 | (1) |
|
|
111 | (1) |
|
4.5.4 Availability Zones (AZs) and Regions |
|
|
112 | (1) |
|
|
112 | (1) |
|
|
113 | (1) |
|
|
113 | (1) |
|
|
114 | (1) |
|
|
115 | (4) |
|
|
119 | (232) |
|
|
121 | (52) |
|
|
122 | (8) |
|
5.1.1 Allocating Pinned Memory |
|
|
122 | (1) |
|
5.1.2 Portable Pinned Memory |
|
|
123 | (1) |
|
5.1.3 Mapped Pinned Memory |
|
|
124 | (1) |
|
5.1.4 Write-Combined Pinned Memory |
|
|
124 | (1) |
|
5.1.5 Registering Pinned Memory |
|
|
125 | (1) |
|
5.1.6 Pinned Memory and UVA |
|
|
126 | (1) |
|
5.1.7 Mapped Pinned Memory Usage |
|
|
127 | (1) |
|
5.1.8 NUMA, Thread Affinity, and Pinned Memory |
|
|
128 | (2) |
|
|
130 | (26) |
|
|
131 | (1) |
|
5.2.2 Dynamic Allocations |
|
|
132 | (5) |
|
5.2.3 Querying the Amount of Global Memory |
|
|
137 | (1) |
|
|
138 | (1) |
|
|
139 | (1) |
|
|
140 | (3) |
|
5.2.7 Peer-to-Peer Access |
|
|
143 | (1) |
|
5.2.8 Reading and Writing Global Memory |
|
|
143 | (1) |
|
5.2.9 Coalescing Constraints |
|
|
143 | (4) |
|
5.2.10 Microbenchmarks: Peak Memory Bandwidth |
|
|
147 | (5) |
|
|
152 | (3) |
|
5.2.12 Texturing from Global Memory |
|
|
155 | (1) |
|
5.2.13 ECC (Error Correcting Codes) |
|
|
155 | (1) |
|
|
156 | (2) |
|
5.3.1 Host and Device__constant__Memory |
|
|
157 | (1) |
|
5.3.2 Accessing__constant__Memory |
|
|
157 | (1) |
|
|
158 | (4) |
|
|
162 | (1) |
|
|
162 | (2) |
|
5.6.1 Unsized Shared Memory Declarations |
|
|
163 | (1) |
|
5.6.2 Warp-Synchronous Coding |
|
|
164 | (1) |
|
5.6.3 Pointers to Shared Memory |
|
|
164 | (1) |
|
|
164 | (9) |
|
5.7.1 Synchronous versus Asynchronous Memcpy |
|
|
165 | (1) |
|
5.7.2 Unified Virtual Addressing |
|
|
166 | (1) |
|
|
166 | (3) |
|
|
169 | (4) |
|
Chapter 6 Streams and Events |
|
|
173 | (32) |
|
6.1 CPU/GPU Concurrency: Covering Driver Overhead |
|
|
174 | (4) |
|
|
174 | (4) |
|
|
178 | (5) |
|
6.2.1 Asynchronous Memcpy: Host→Device |
|
|
179 | (2) |
|
6.2.2 Asynchronous Memcpy: Device→Host |
|
|
181 | (1) |
|
6.2.3 The NULL Stream and Concurrency Breaks |
|
|
181 | (2) |
|
6.3 CUDA Events: CPU/GPU Synchronization |
|
|
183 | (3) |
|
|
186 | (1) |
|
|
186 | (1) |
|
|
186 | (1) |
|
6.5 Concurrent Copying and Kernel Processing |
|
|
187 | (10) |
|
6.5.1 concurrencyMemcpyKernel.cu |
|
|
189 | (5) |
|
6.5.2 Performance Results |
|
|
194 | (2) |
|
6.5.3 Breaking Interengine Concurrency |
|
|
196 | (1) |
|
|
197 | (2) |
|
6.7 Concurrent Kernel Processing |
|
|
199 | (3) |
|
6.8 GPU/GPU Synchronization: cudaStreamWaitEvent() |
|
|
202 | (1) |
|
6.8.1 Streams and Events on Multi-GPU: Notes and Limitations |
|
|
202 | (1) |
|
6.9 Source Code Reference |
|
|
202 | (3) |
|
Chapter 7 Kernel Execution |
|
|
205 | (26) |
|
|
205 | (1) |
|
|
206 | (5) |
|
|
208 | (1) |
|
7.2.2 Caches and Coherency |
|
|
209 | (1) |
|
7.2.3 Asynchrony and Error Handling |
|
|
209 | (1) |
|
|
210 | (1) |
|
|
210 | (1) |
|
|
211 | (1) |
|
7.3 Blocks, Threads, Warps, and Lanes |
|
|
211 | (9) |
|
|
211 | (4) |
|
7.3.2 Execution Guarantees |
|
|
215 | (1) |
|
7.3.3 Block and Thread IDs |
|
|
216 | (4) |
|
|
220 | (2) |
|
|
222 | (9) |
|
7.5.1 Scoping and Synchronization |
|
|
223 | (1) |
|
|
224 | (1) |
|
|
225 | (1) |
|
|
225 | (1) |
|
7.5.5 Compiling and Linking |
|
|
226 | (1) |
|
7.5.6 Resource Management |
|
|
226 | (2) |
|
|
228 | (3) |
|
Chapter 8 Streaming Multiprocessors |
|
|
231 | (56) |
|
|
233 | (8) |
|
|
233 | (1) |
|
|
234 | (1) |
|
|
235 | (2) |
|
|
237 | (1) |
|
|
237 | (3) |
|
8.1.6 Barriers and Coherency |
|
|
240 | (1) |
|
|
241 | (3) |
|
|
241 | (1) |
|
8.2.2 Miscellaneous (Bit Manipulation) |
|
|
242 | (1) |
|
8.2.3 Funnel Shift (SM 3.5) |
|
|
243 | (1) |
|
8.3 Floating-Point Support |
|
|
244 | (23) |
|
|
244 | (6) |
|
8.3.2 Single Precision (32-Bit) |
|
|
250 | (3) |
|
8.3.3 Double Precision (64-Bit) |
|
|
253 | (1) |
|
8.3.4 Half Precision (16-Bit) |
|
|
253 | (1) |
|
8.3.5 Case Study: float→half Conversion |
|
|
253 | (5) |
|
|
258 | (8) |
|
|
266 | (1) |
|
|
267 | (2) |
|
|
267 | (1) |
|
8.4.2 Divergence and Convergence |
|
|
268 | (1) |
|
8.4.3 Special Cases: Min, Max and Absolute Value |
|
|
269 | (1) |
|
8.5 Textures and Surfaces |
|
|
269 | (1) |
|
8.6 Miscellaneous Instructions |
|
|
270 | (5) |
|
8.6.1 Warp-Level Primitives |
|
|
270 | (2) |
|
8.6.2 Block-Level Primitives |
|
|
272 | (1) |
|
8.6.3 Performance Counter |
|
|
272 | (1) |
|
|
272 | (3) |
|
|
275 | (1) |
|
|
275 | (12) |
|
|
287 | (18) |
|
|
287 | (1) |
|
|
288 | (3) |
|
9.2.1 Peer-to-Peer Memcpy |
|
|
288 | (1) |
|
9.2.2 Peer-to-Peer Addressing |
|
|
289 | (2) |
|
9.3 UVA: Inferring Device from Address |
|
|
291 | (1) |
|
9.4 Inter-GPU Synchronization |
|
|
292 | (2) |
|
9.5 Single-Threaded Multi-GPU |
|
|
294 | (5) |
|
9.5.1 Current Context Stack |
|
|
294 | (2) |
|
|
296 | (3) |
|
9.6 Multithreaded Multi-GPU |
|
|
299 | (6) |
|
|
305 | (46) |
|
|
305 | (1) |
|
|
306 | (1) |
|
|
306 | (8) |
|
|
307 | (1) |
|
10.2.2 CUDA Arrays and Block Linear Addressing |
|
|
308 | (5) |
|
10.2.3 Device Memory versus CUDA Arrays |
|
|
313 | (1) |
|
|
314 | (3) |
|
|
314 | (3) |
|
10.4 Texture as a Read Path |
|
|
317 | (6) |
|
10.4.1 Increasing Effective Address Coverage |
|
|
318 | (3) |
|
10.4.2 Texturing from Host Memory |
|
|
321 | (2) |
|
10.5 Texturing with Unnormalized Coordinates |
|
|
323 | (8) |
|
10.6 Texturing with Normalized Coordinates |
|
|
331 | (2) |
|
10.7 1D Surface Read/Write |
|
|
333 | (2) |
|
|
335 | (3) |
|
10.8.1 Microdemo: tex2d_opengl.cu |
|
|
335 | (3) |
|
10.9 2D Texturing: Copy Avoidance |
|
|
338 | (2) |
|
10.9.1 2D Texturing from Device Memory |
|
|
338 | (2) |
|
10.9.2 2D Surface Read/Write |
|
|
340 | (1) |
|
|
340 | (2) |
|
|
342 | (1) |
|
10.11.1 1D Layered Textures |
|
|
343 | (1) |
|
10.11.2 2D Layered Textures |
|
|
343 | (1) |
|
10.12 Optimal Block Sizing and Performance |
|
|
343 | (2) |
|
|
344 | (1) |
|
10.13 Texturing Quick References |
|
|
345 | (6) |
|
10.13.1 Hardware Capabilities |
|
|
345 | (2) |
|
|
347 | (2) |
|
|
349 | (2) |
|
|
351 | (120) |
|
Chapter 11 Streaming Workloads |
|
|
353 | (12) |
|
|
355 | (3) |
|
|
358 | (1) |
|
|
359 | (2) |
|
11.4 Mapped Pinned Memory |
|
|
361 | (1) |
|
11.5 Performance and Summary |
|
|
362 | (3) |
|
|
365 | (20) |
|
|
365 | (2) |
|
|
367 | (6) |
|
12.3 Single-Pass Reduction |
|
|
373 | (3) |
|
12.4 Reduction with Atomics |
|
|
376 | (1) |
|
12.5 Arbitrary Block Sizes |
|
|
377 | (1) |
|
12.6 Reduction Using Arbitrary Data Types |
|
|
378 | (4) |
|
|
382 | (1) |
|
12.8 Warp Reduction with Shuffle |
|
|
382 | (3) |
|
|
385 | (36) |
|
13.1 Definition and Variations |
|
|
385 | (2) |
|
|
387 | (3) |
|
13.3 Scan and Circuit Design |
|
|
390 | (4) |
|
13.4 CUDA Implementations |
|
|
394 | (13) |
|
|
394 | (6) |
|
13.4.2 Reduce-Then-Scan (Recursive) |
|
|
400 | (3) |
|
13.4.3 Reduce-Then-Scan (Two Pass) |
|
|
403 | (4) |
|
|
407 | (7) |
|
|
408 | (1) |
|
13.5.2 Templated Formulations |
|
|
409 | (1) |
|
|
410 | (2) |
|
13.5.4 Instruction Counts |
|
|
412 | (2) |
|
|
414 | (4) |
|
13.7 References (Parallel Scan Algorithms) |
|
|
418 | (1) |
|
13.8 Further Reading (Parallel Prefix Sum Circuits) |
|
|
419 | (2) |
|
|
421 | (28) |
|
|
423 | (5) |
|
14.1.1 A Matrix of Forces |
|
|
424 | (4) |
|
14.2 Naive Implementation |
|
|
428 | (4) |
|
|
432 | (2) |
|
|
434 | (2) |
|
|
436 | (2) |
|
14.6 Multiple GPUs and Scalability |
|
|
438 | (1) |
|
|
439 | (5) |
|
|
444 | (2) |
|
14.9 References and Further Reading |
|
|
446 | (3) |
|
Chapter 15 Image Processing: Normalized Correlation |
|
|
449 | (22) |
|
|
449 | (3) |
|
15.2 Naive Texture-Texture Implementation |
|
|
452 | (4) |
|
15.3 Template in Constant Memory |
|
|
456 | (3) |
|
15.4 Image in Shared Memory |
|
|
459 | (4) |
|
15.5 Further Optimizations |
|
|
463 | (2) |
|
|
463 | (1) |
|
|
464 | (1) |
|
|
465 | (1) |
|
15.7 Performance and Further Reading |
|
|
466 | (3) |
|
|
469 | (2) |
|
Appendix A The CUDA Handbook Library |
|
|
471 | (10) |
|
|
471 | (1) |
|
|
472 | (2) |
|
A.3 Driver API Facilities |
|
|
474 | (1) |
|
|
475 | (1) |
|
|
476 | (1) |
|
|
477 | (4) |
Glossary / TLA Decoder |
|
481 | (6) |
Index |
|
487 | |