Muutke küpsiste eelistusi

CUDA Handbook: A Comprehensive Guide to GPU Programming, The [Pehme köide]

  • Formaat: Paperback / softback, 528 pages, kõrgus x laius x paksus: 231x187x28 mm, kaal: 846 g
  • Ilmumisaeg: 27-Jun-2013
  • Kirjastus: Addison-Wesley Educational Publishers Inc
  • ISBN-10: 0321809467
  • ISBN-13: 9780321809469
  • Formaat: Paperback / softback, 528 pages, kõrgus x laius x paksus: 231x187x28 mm, kaal: 846 g
  • Ilmumisaeg: 27-Jun-2013
  • Kirjastus: Addison-Wesley Educational Publishers Inc
  • ISBN-10: 0321809467
  • ISBN-13: 9780321809469

The CUDA Handbook begins where CUDA by Example (Addison-Wesley, 2011) leaves off, discussing CUDA hardware and software in greater detail and covering both CUDA 5.0 and Kepler. Every CUDA developer, from the casual to the most sophisticated, will find something here of interest and immediate usefulness. Newer CUDA developers will see how the hardware processes commands and how the driver checks progress; more experienced CUDA developers will appreciate the expert coverage of topics such as the driver API and context migration, as well as the guidance on how best to structure CPU/GPU data interchange and synchronization.

The accompanying open source code–more than 25,000 lines of it, freely available at www.cudahandbook.com–is specifically intended to be reused and repurposed by developers.

Designed to be both a comprehensive reference and a practical cookbook, the text is divided into the following three parts:

Part I, Overview, gives high-level descriptions of the hardware and software that make CUDA possible.


Part II, Details, provides thorough descriptions of every aspect of CUDA, including

  • Memory
  • Streams and events
  • Models of execution, including the dynamic parallelism feature, new with CUDA 5.0 and SM 3.5
  • The streaming multiprocessors, including descriptions of all features through SM 3.5
  • Programming multiple GPUs
  • Texturing

The source code accompanying Part II is presented as reusable microbenchmarks and microdemos, designed to expose specific hardware characteristics or highlight specific use cases.


Part III, Select Applications, details specific families of CUDA applications and key parallel algorithms, including

  • Streaming workloads
  • Reduction
  • Parallel prefix sum (Scan)
  • N-body
  • Image Processing
These algorithms cover the full range of potential CUDA applications.

Preface xxi
Acknowledgments xxiii
About the Author xxv
PART I
1(118)
Chapter 1 Background
3(8)
1.1 Our Approach
5(1)
1.2 Code
6(1)
1.2.1 Microbenchmarks
6(1)
1.2.2 Microdemos
7(1)
1.2.3 Optimization Journeys
7(1)
1.3 Administrative Items
7(1)
1.3.1 Open Source
7(1)
1.3.2 CUDA Handbook Library (chLib)
8(1)
1.3.3 Coding Style
8(1)
1.3.4 CUDA SDK
8(1)
1.4 Road Map
8(3)
Chapter 2 Hardware Architecture
11(40)
2.1 CPU Configurations
11(6)
2.1.1 Front-Side Bus
12(1)
2.1.2 Symmetric Multiprocessors
13(1)
2.1.3 Nonuniform Memory Access
14(3)
2.1.4 PCI Express Integration
17(1)
2.2 Integrated GPUs
17(2)
2.3 Multiple GPUs
19(3)
2.4 Address Spaces in CUDA
22(10)
2.4.1 Virtual Addressing: A Brief History
22(4)
2.4.2 Disjoint Address Spaces
26(2)
2.4.3 Mapped Pinned Memory
28(1)
2.4.4 Portable Pinned Memory
29(1)
2.4.5 Unified Addressing
30(1)
2.4.6 Peer-to-Peer Mappings
31(1)
2.5 CPU/GPU Interactions
32(9)
2.5.1 Pinned Host Memory and Command Buffers
32(3)
2.5.2 CPU/GPU Concurrency
35(4)
2.5.3 The Host Interface and Intra-GPU Synchronization
39(2)
2.5.4 Inter-GPU Synchronization
41(1)
2.6 GPU Architecture
41(9)
2.6.1 Overview
42(4)
2.6.2 Streaming Multiprocessors
46(4)
2.7 Further Reading
50(1)
Chapter 3 Software Architecture
51(42)
3.1 Software Layers
51(8)
3.1.1 CUDA Runtime and Driver
53(1)
3.1.2 Driver Models
54(3)
3.1.3 NVCC, PTX, and Microcode
57(2)
3.2 Devices and Initialization
59(8)
3.2.1 Device Count
60(1)
3.2.2 Device Attributes
60(3)
3.2.3 When CUDA Is Not Present
63(4)
3.3 Contexts
67(4)
3.3.1 Lifetime and Scoping
68(1)
3.3.2 Preallocation of Resources
68(1)
3.3.3 Address Space
69(1)
3.3.4 Current Context Stack
69(2)
3.3.5 Context State
71(1)
3.4 Modules and Functions
71(2)
3.5 Kernels (Functions)
73(2)
3.6 Device Memory
75(1)
3.7 Streams and Events
76(3)
3.7.1 Software Pipelining
76(1)
3.7.2 Stream Callbacks
77(1)
3.7.3 The NULL Stream
77(1)
3.7.4 Events
78(1)
3.8 Host Memory
79(3)
3.8.1 Pinned Host Memory
80(1)
3.8.2 Portable Pinned Memory
81(1)
3.8.3 Mapped Pinned Memory
81(1)
3.8.4 Host Memory Registration
81(1)
3.9 CUDA Arrays and Texturing
82(4)
3.9.1 Texture References
82(3)
3.9.2 Surface References
85(1)
3.10 Graphics Interoperability
86(1)
3.11 The CUDA Runtime and CUDA Driver API
87(6)
Chapter 4 Software Environment
93(26)
4.1 nvcc---CUDA Compiler Driver
93(7)
4.2 ptxas---the PTX Assembler
100(5)
4.3 cuobjdump
105(1)
4.4 nvidia-smi
106(3)
4.5 Amazon Web Services
109(10)
4.5.1 Command-Line Tools
110(1)
4.5.2 EC2 and Virtualization
110(1)
4.5.3 Key Pairs
111(1)
4.5.4 Availability Zones (AZs) and Regions
112(1)
4.5.5 S3
112(1)
4.5.6 EBS
113(1)
4.5.7 AMIs
113(1)
4.5.8 Linux on EC2
114(1)
4.5.9 Windows on EC2
115(4)
PART II
119(232)
Chapter 5 Memory
121(52)
5.1 Host Memory
122(8)
5.1.1 Allocating Pinned Memory
122(1)
5.1.2 Portable Pinned Memory
123(1)
5.1.3 Mapped Pinned Memory
124(1)
5.1.4 Write-Combined Pinned Memory
124(1)
5.1.5 Registering Pinned Memory
125(1)
5.1.6 Pinned Memory and UVA
126(1)
5.1.7 Mapped Pinned Memory Usage
127(1)
5.1.8 NUMA, Thread Affinity, and Pinned Memory
128(2)
5.2 Global Memory
130(26)
5.2.1 Pointers
131(1)
5.2.2 Dynamic Allocations
132(5)
5.2.3 Querying the Amount of Global Memory
137(1)
5.2.4 Static Allocations
138(1)
5.2.5 Memset APIs
139(1)
5.2.6 Pointer Queries
140(3)
5.2.7 Peer-to-Peer Access
143(1)
5.2.8 Reading and Writing Global Memory
143(1)
5.2.9 Coalescing Constraints
143(4)
5.2.10 Microbenchmarks: Peak Memory Bandwidth
147(5)
5.2.11 Atomic Operations
152(3)
5.2.12 Texturing from Global Memory
155(1)
5.2.13 ECC (Error Correcting Codes)
155(1)
5.3 Constant Memory
156(2)
5.3.1 Host and Device__constant__Memory
157(1)
5.3.2 Accessing__constant__Memory
157(1)
5.4 Local Memory
158(4)
5.5 Texture Memory
162(1)
5.6 Shared Memory
162(2)
5.6.1 Unsized Shared Memory Declarations
163(1)
5.6.2 Warp-Synchronous Coding
164(1)
5.6.3 Pointers to Shared Memory
164(1)
5.7 Memory Copy
164(9)
5.7.1 Synchronous versus Asynchronous Memcpy
165(1)
5.7.2 Unified Virtual Addressing
166(1)
5.7.3 CUDA Runtime
166(3)
5.7.4 Driver API
169(4)
Chapter 6 Streams and Events
173(32)
6.1 CPU/GPU Concurrency: Covering Driver Overhead
174(4)
6.1.1 Kernel Launches
174(4)
6.2 Asynchronous Memcpy
178(5)
6.2.1 Asynchronous Memcpy: Host→Device
179(2)
6.2.2 Asynchronous Memcpy: Device→Host
181(1)
6.2.3 The NULL Stream and Concurrency Breaks
181(2)
6.3 CUDA Events: CPU/GPU Synchronization
183(3)
6.3.1 Blocking Events
186(1)
6.3.2 Queries
186(1)
6.4 CUDA Events: Timing
186(1)
6.5 Concurrent Copying and Kernel Processing
187(10)
6.5.1 concurrencyMemcpyKernel.cu
189(5)
6.5.2 Performance Results
194(2)
6.5.3 Breaking Interengine Concurrency
196(1)
6.6 Mapped Pinned Memory
197(2)
6.7 Concurrent Kernel Processing
199(3)
6.8 GPU/GPU Synchronization: cudaStreamWaitEvent()
202(1)
6.8.1 Streams and Events on Multi-GPU: Notes and Limitations
202(1)
6.9 Source Code Reference
202(3)
Chapter 7 Kernel Execution
205(26)
7.1 Overview
205(1)
7.2 Syntax
206(5)
7.2.1 Limitations
208(1)
7.2.2 Caches and Coherency
209(1)
7.2.3 Asynchrony and Error Handling
209(1)
7.2.4 Timeouts
210(1)
7.2.5 Local Memory
210(1)
7.2.6 Shared Memory
211(1)
7.3 Blocks, Threads, Warps, and Lanes
211(9)
7.3.1 Grids of Blocks
211(4)
7.3.2 Execution Guarantees
215(1)
7.3.3 Block and Thread IDs
216(4)
7.4 Occupancy
220(2)
7.5 Dynamic Parallelism
222(9)
7.5.1 Scoping and Synchronization
223(1)
7.5.2 Memory Model
224(1)
7.5.3 Streams and Events
225(1)
7.5.4 Error Handling
225(1)
7.5.5 Compiling and Linking
226(1)
7.5.6 Resource Management
226(2)
7.5.7 Summary
228(3)
Chapter 8 Streaming Multiprocessors
231(56)
8.1 Memory
233(8)
8.1.1 Registers
233(1)
8.1.2 Local Memory
234(1)
8.1.3 Global Memory
235(2)
8.1.4 Constant Memory
237(1)
8.1.5 Shared Memory
237(3)
8.1.6 Barriers and Coherency
240(1)
8.2 Integer Support
241(3)
8.2.1 Multiplication
241(1)
8.2.2 Miscellaneous (Bit Manipulation)
242(1)
8.2.3 Funnel Shift (SM 3.5)
243(1)
8.3 Floating-Point Support
244(23)
8.3.1 Formats
244(6)
8.3.2 Single Precision (32-Bit)
250(3)
8.3.3 Double Precision (64-Bit)
253(1)
8.3.4 Half Precision (16-Bit)
253(1)
8.3.5 Case Study: float→half Conversion
253(5)
8.3.6 Math Library
258(8)
8.3.7 Additional Reading
266(1)
8.4 Conditional Code
267(2)
8.4.1 Predication
267(1)
8.4.2 Divergence and Convergence
268(1)
8.4.3 Special Cases: Min, Max and Absolute Value
269(1)
8.5 Textures and Surfaces
269(1)
8.6 Miscellaneous Instructions
270(5)
8.6.1 Warp-Level Primitives
270(2)
8.6.2 Block-Level Primitives
272(1)
8.6.3 Performance Counter
272(1)
8.6.4 Video Instructions
272(3)
8.6.5 Special Registers
275(1)
8.7 Instruction Sets
275(12)
Chapter 9 Multiple GPUs
287(18)
9.1 Overview
287(1)
9.2 Peer-to-Peer
288(3)
9.2.1 Peer-to-Peer Memcpy
288(1)
9.2.2 Peer-to-Peer Addressing
289(2)
9.3 UVA: Inferring Device from Address
291(1)
9.4 Inter-GPU Synchronization
292(2)
9.5 Single-Threaded Multi-GPU
294(5)
9.5.1 Current Context Stack
294(2)
9.5.2 N-Body
296(3)
9.6 Multithreaded Multi-GPU
299(6)
Chapter 10 Texturing
305(46)
10.1 Overview
305(1)
10.1.1 Two Use Cases
306(1)
10.2 Texture Memory
306(8)
10.2.1 Device Memory
307(1)
10.2.2 CUDA Arrays and Block Linear Addressing
308(5)
10.2.3 Device Memory versus CUDA Arrays
313(1)
10.3 1D Texturing
314(3)
10.3.1 Texture Setup
314(3)
10.4 Texture as a Read Path
317(6)
10.4.1 Increasing Effective Address Coverage
318(3)
10.4.2 Texturing from Host Memory
321(2)
10.5 Texturing with Unnormalized Coordinates
323(8)
10.6 Texturing with Normalized Coordinates
331(2)
10.7 1D Surface Read/Write
333(2)
10.8 2D Texturing
335(3)
10.8.1 Microdemo: tex2d_opengl.cu
335(3)
10.9 2D Texturing: Copy Avoidance
338(2)
10.9.1 2D Texturing from Device Memory
338(2)
10.9.2 2D Surface Read/Write
340(1)
10.10 3D Texturing
340(2)
10.11 Layered Textures
342(1)
10.11.1 1D Layered Textures
343(1)
10.11.2 2D Layered Textures
343(1)
10.12 Optimal Block Sizing and Performance
343(2)
10.12.1 Results
344(1)
10.13 Texturing Quick References
345(6)
10.13.1 Hardware Capabilities
345(2)
10.13.2 CUDA Runtime
347(2)
10.13.3 Driver API
349(2)
PART III
351(120)
Chapter 11 Streaming Workloads
353(12)
11.1 Device Memory
355(3)
11.2 Asynchronous Memcpy
358(1)
11.3 Streams
359(2)
11.4 Mapped Pinned Memory
361(1)
11.5 Performance and Summary
362(3)
Chapter 12 Reduction
365(20)
12.1 Overview
365(2)
12.2 Two-Pass Reduction
367(6)
12.3 Single-Pass Reduction
373(3)
12.4 Reduction with Atomics
376(1)
12.5 Arbitrary Block Sizes
377(1)
12.6 Reduction Using Arbitrary Data Types
378(4)
12.7 Predicate Reduction
382(1)
12.8 Warp Reduction with Shuffle
382(3)
Chapter 13 Scan
385(36)
13.1 Definition and Variations
385(2)
13.2 Overview
387(3)
13.3 Scan and Circuit Design
390(4)
13.4 CUDA Implementations
394(13)
13.4.1 Scan-Then-Fan
394(6)
13.4.2 Reduce-Then-Scan (Recursive)
400(3)
13.4.3 Reduce-Then-Scan (Two Pass)
403(4)
13.5 Warp Scans
407(7)
13.5.1 Zero Padding
408(1)
13.5.2 Templated Formulations
409(1)
13.5.3 Warp Shuffle
410(2)
13.5.4 Instruction Counts
412(2)
13.6 Stream Compaction
414(4)
13.7 References (Parallel Scan Algorithms)
418(1)
13.8 Further Reading (Parallel Prefix Sum Circuits)
419(2)
Chapter 14 N-Body
421(28)
14.1 Introduction
423(5)
14.1.1 A Matrix of Forces
424(4)
14.2 Naive Implementation
428(4)
14.3 Shared Memory
432(2)
14.4 Constant Memory
434(2)
14.5 Warp Shuffle
436(2)
14.6 Multiple GPUs and Scalability
438(1)
14.7 CPU Optimizations
439(5)
14.8 Conclusion
444(2)
14.9 References and Further Reading
446(3)
Chapter 15 Image Processing: Normalized Correlation
449(22)
15.1 Overview
449(3)
15.2 Naive Texture-Texture Implementation
452(4)
15.3 Template in Constant Memory
456(3)
15.4 Image in Shared Memory
459(4)
15.5 Further Optimizations
463(2)
15.5.1 SM-Aware Coding
463(1)
15.5.2 Loop Unrolling
464(1)
15.6 Source Code
465(1)
15.7 Performance and Further Reading
466(3)
15.8 Further Reading
469(2)
Appendix A The CUDA Handbook Library
471(10)
A.1 Timing
471(1)
A.2 Threading
472(2)
A.3 Driver API Facilities
474(1)
A.4 Shmoos
475(1)
A.5 Command Line Parsing
476(1)
A.6 Error Handling
477(4)
Glossary / TLA Decoder 481(6)
Index 487
Nicholas Wilt has been programming professionally for more than twenty-five years in a variety of areas, including industrial machine vision, graphics, and low-level multimedia software. While at Microsoft, he served as the development lead for Direct3D 5.0 and 6.0, built the prototype for the Desktop Window Manager, and did early GPU computing work. At NVIDIA, he worked on CUDA from its inception, designing and often implementing most of CUDAs low-level abstractions. Now at Amazon, Mr. Wilt is working on cloud computing technologies relating to GPUs.