List of Figures |
|
xxiii | |
List of Tables |
|
xxix | |
Preface |
|
xxxiii | |
About the Author |
|
xxxv | |
Part I Understanding CPU Parallelism |
|
|
Chapter 1 Introduction to CPU Parallel Programming |
|
|
3 | (24) |
|
1.1 Evolution Of Parallel Programming |
|
|
3 | (1) |
|
1.2 More Cores, More Parallelism |
|
|
4 | (1) |
|
|
5 | (3) |
|
1.3.1 More Threads or More Cores to Parallelize? |
|
|
5 | (2) |
|
1.3.2 Influence of Core Resource Sharing |
|
|
7 | (1) |
|
1.3.3 Influence of Memory Resource Sharing |
|
|
7 | (1) |
|
1.4 Our First Serial Program |
|
|
8 | (5) |
|
1.4.1 Understanding Data Transfer Speeds |
|
|
8 | (2) |
|
1.4.2 The main() Function in imflip.c |
|
|
10 | (1) |
|
1.4.3 Flipping Rows Vertically: FliplmageV() |
|
|
11 | (1) |
|
1.4.4 Flipping Columns Horizontally: FliplmageH() |
|
|
12 | (1) |
|
1.5 Writing, Compiling, Running Our Programs |
|
|
13 | (2) |
|
1.5.1 Choosing an Editor and a Compiler |
|
|
13 | (1) |
|
1.5.2 Developing in Windows 7, 8, and Windows 10 Platforms |
|
|
13 | (2) |
|
1.5.3 Developing in a Mac Platform |
|
|
15 | (1) |
|
1.5.4 Developing in a Unix Platform |
|
|
15 | (1) |
|
|
15 | (4) |
|
1.6.1 Unix Directory-Related Commands |
|
|
15 | (1) |
|
1.6.2 Unix File-Related Commands |
|
|
16 | (3) |
|
1.7 Debugging Your Programs |
|
|
19 | (4) |
|
|
20 | (1) |
|
1.7.2 Old School Debugging |
|
|
21 | (1) |
|
|
22 | (1) |
|
1.8 Performance Of Our First Serial Program |
|
|
23 | (4) |
|
1.8.1 Can We Estimate the Execution Time? |
|
|
24 | (1) |
|
1.8.2 What Does the OS Do When Our Code Is Executing? |
|
|
24 | (1) |
|
1.8.3 How Do We Parallelize It? |
|
|
25 | (1) |
|
1.8.4 Thinking About the Resources |
|
|
25 | (2) |
|
Chapter 2 Developing Our First Parallel CPU Program |
|
|
27 | (26) |
|
2.1 Our First Parallel Program |
|
|
27 | (10) |
|
2.1.1 The main() Function in imflipP.c |
|
|
28 | (1) |
|
2.1.2 Timing the Execution |
|
|
29 | (1) |
|
2.1.3 Split Code Listing for main() in imflipP.c |
|
|
29 | (3) |
|
2.1.4 Thread Initialization |
|
|
32 | (1) |
|
|
32 | (2) |
|
2.1.6 Thread Launch/Execution |
|
|
34 | (1) |
|
2.1.7 Thread Termination (Join) |
|
|
35 | (1) |
|
2.1.8 Thread Task and Data Splitting |
|
|
35 | (2) |
|
2.2 Working With Bitmap (BMP) Files |
|
|
37 | (5) |
|
2.2.1 BMP is a Non-Lossy/Uncompressed File Format |
|
|
37 | (1) |
|
2.2.2 BMP Image File Format |
|
|
38 | (1) |
|
2.2.3 Header File ImageStuff.h |
|
|
39 | (1) |
|
2.2.4 Image Manipulation Routines in ImageStuff.c |
|
|
40 | (2) |
|
2.3 Task Execution By Threads |
|
|
42 | (9) |
|
|
43 | (2) |
|
2.3.2 Multithreaded Vertical Flip: MTFlipV() |
|
|
45 | (3) |
|
2.3.3 Comparing FliplmageV() and MTFlipV() |
|
|
48 | (2) |
|
2.3.4 Multithreaded Horizontal Flip: MTFlipH() |
|
|
50 | (1) |
|
2.4 Testing/Timing The Multithreaded Code |
|
|
51 | (2) |
|
Chapter 3 Improving Our First Parallel CPU Program |
|
|
53 | (26) |
|
3.1 Effect Of The "Programmer" On Performance |
|
|
53 | (1) |
|
3.2 Effect Of The "CPU" On Performance |
|
|
54 | (3) |
|
3.2.1 In-Order versus Out-Of-Order Cores |
|
|
55 | (2) |
|
3.2.2 Thin versus Thick Threads |
|
|
57 | (1) |
|
3.3 Performance Of IMFLIPP |
|
|
57 | (1) |
|
3.4 Effect Of The "OS" On Performance |
|
|
58 | (5) |
|
|
59 | (1) |
|
3.4.2 Thread Launch and Execution |
|
|
59 | (1) |
|
|
60 | (1) |
|
3.4.4 Mapping Software Threads to Hardware Threads |
|
|
61 | (1) |
|
3.4.5 Program Performance versus Launched Pthreads |
|
|
62 | (1) |
|
|
63 | (4) |
|
3.5.1 Analyzing Memory Access Patterns in MTFlipH() |
|
|
64 | (1) |
|
3.5.2 Multithreaded Memory Access of MTFlipH() |
|
|
64 | (2) |
|
3.5.3 DRAM Access Rules of Thumb |
|
|
66 | (1) |
|
3.6 IMFLIPPM: Obeying Dram Rules Of Thumb |
|
|
67 | (5) |
|
3.6.1 Chaotic Memory Access Patterns of imflipP |
|
|
67 | (1) |
|
3.6.2 Improving Memory Access Patterns of imflipP |
|
|
68 | (1) |
|
3.6.3 MTFlipHM(): The Memory Friendly MTFlipH() |
|
|
69 | (2) |
|
3.6.4 MTFlipVM(): The Memory Friendly MTFlipV() |
|
|
71 | (1) |
|
3.7 Performance Of IMFLIPPM.C |
|
|
72 | (2) |
|
3.7.1 Comparing Performances of imflipP.c and imflipPM.c |
|
|
72 | (1) |
|
3.7.2 Speed Improvement: MTFlipV() versus MTFlipVM() |
|
|
73 | (1) |
|
3.7.3 Speed Improvement: MTFlipH() versus MTFlipHM() |
|
|
73 | (1) |
|
3.7.4 Understanding the Speedup: MTFlipH() versus MTFlipHM() |
|
|
73 | (1) |
|
|
74 | (2) |
|
3.9 Intel Mic Architecture: Xeon Phi |
|
|
76 | (1) |
|
|
77 | (1) |
|
|
78 | (1) |
|
Chapter 4 Understanding the Cores and Memory |
|
|
79 | (28) |
|
4.1 Once Upon A Time...INTEL... |
|
|
79 | (1) |
|
4.2 CPU And Memory Manufacturers |
|
|
80 | (1) |
|
4.3 Dynamic (DRAM) Versus Static (SRAM) Memory |
|
|
81 | (2) |
|
4.3.1 Static Random Access Memory (SRAM) |
|
|
81 | (1) |
|
4.3.2 Dynamic Random Access Memory (DRAM) |
|
|
81 | (1) |
|
4.3.3 DRAM Interface Standards |
|
|
81 | (1) |
|
4.3.4 Influence of DRAM on our Program Performance |
|
|
82 | (1) |
|
4.3.5 Influence of SRAM (Cache) on our Program Performance |
|
|
83 | (1) |
|
4.4 Image Rotation Program: IMROTATE.C |
|
|
83 | (6) |
|
4.4.1 Description of the imrotate.c |
|
|
84 | (1) |
|
4.4.2 imrotate.c: Parametric Restrictions and Simplifications |
|
|
84 | (1) |
|
4.4.3 imrotate.c: Theory of Operation |
|
|
85 | (4) |
|
4.5 Performance Of Imrotate |
|
|
89 | (2) |
|
4.5.1 Qualitative Analysis of Threading Efficiency |
|
|
89 | (1) |
|
4.5.2 Quantitative Analysis: Defining Threading Efficiency |
|
|
89 | (2) |
|
4.6 The Architecture Of The Computer |
|
|
91 | (6) |
|
4.6.1 The Cores, L1$ and L2$ |
|
|
91 | (1) |
|
4.6.2 Internal Core Resources |
|
|
92 | (2) |
|
4.6.3 The Shared L3 Cache Memory (L3$) |
|
|
94 | (1) |
|
4.6.4 The Memory Controller |
|
|
94 | (1) |
|
|
95 | (1) |
|
4.6.6 Queue, Uncore, and I/O |
|
|
96 | (1) |
|
4.7 IMROTATEMC: Making Imrotate More Efficient |
|
|
97 | (9) |
|
4.7.1 Rotate2(): How Bad is Square Root and FP Division? |
|
|
99 | (1) |
|
4.7.2 Rotate3() and Rotate4(): How Bad Is sin() and cos()? |
|
|
100 | (2) |
|
4.7.3 Rotate5(): How Bad Is Integer Division/Multiplication? |
|
|
102 | (1) |
|
4.7.4 Rotate6(): Consolidating Computations |
|
|
102 | (2) |
|
4.7.5 Rotate7(): Consolidating More Computations |
|
|
104 | (1) |
|
4.7.6 Overall Performance of imrotateMC |
|
|
104 | (2) |
|
|
106 | (1) |
|
Chapter 5 Thread Management and Synchronization |
|
|
107 | (30) |
|
5.1 Edge Detection Program: IMEDGE.C |
|
|
107 | (4) |
|
5.1.1 Description of the imedge.c |
|
|
108 | (1) |
|
5.1.2 imedge.c: Parametric Restrictions and Simplifications |
|
|
108 | (1) |
|
5.1.3 imedge.c: Theory of Operation |
|
|
109 | (2) |
|
5.2 IMEDGE.C: Implementation |
|
|
111 | (7) |
|
5.2.1 Initialization and Time-Stamping |
|
|
112 | (1) |
|
5.2.2 Initialization Functions for Different Image Representations |
|
|
113 | (1) |
|
5.2.3 Launching and Terminating Threads |
|
|
114 | (1) |
|
|
115 | (1) |
|
|
116 | (1) |
|
|
117 | (1) |
|
5.3 Performance Of IMEDGE |
|
|
118 | (1) |
|
5.4 IMEDGEMC: Making Imedge More Efficient |
|
|
118 | (8) |
|
5.4.1 Using Precomputation to Reduce Bandwidth |
|
|
119 | (1) |
|
5.4.2 Storing the Precomputed Pixel Values |
|
|
120 | (1) |
|
5.4.3 Precomputing Pixel Values |
|
|
121 | (1) |
|
5.4.4 Reading the Image and Precomputing Pixel Values |
|
|
122 | (1) |
|
|
123 | (1) |
|
|
124 | (1) |
|
|
125 | (1) |
|
5.5 Performance Of IMEDGEMC |
|
|
126 | (1) |
|
5.6 IMEDGEMCT: Synchronizing Threads Efficiently |
|
|
127 | (3) |
|
5.6.1 Barrier Synchronization |
|
|
128 | (1) |
|
5.6.2 MUTEX Structure for Data Sharing |
|
|
129 | (1) |
|
5.7 IMEDGEMCT: Implementation |
|
|
130 | (4) |
|
5.7.1 Using a MUTEX: Read Image, Precompute |
|
|
132 | (1) |
|
5.7.2 Precomputing One Row at a Time |
|
|
133 | (1) |
|
5.8 Performance Of IMEDGEMCT |
|
|
134 | (3) |
Part II GPU Programming Using CUDA |
|
|
Chapter 6 Introduction to GPU Parallelism and CUDA |
|
|
137 | (48) |
|
6.1 Once Upon A Time...NVIDIA |
|
|
137 | (6) |
|
6.1.1 The Birth of the GPU |
|
|
137 | (1) |
|
6.1.2 Early GPU Architectures |
|
|
138 | (2) |
|
6.1.3 The Birth of the GPGPU |
|
|
140 | (1) |
|
6.1.4 Nvidia, ATI Technologies, and Intel |
|
|
141 | (2) |
|
6.2 Compute-Unified Device Architecture (CUDA) |
|
|
143 | (1) |
|
6.2.1 CUDA, OpenCL, and Other GPU Languages |
|
|
143 | (1) |
|
6.2.2 Device Side versus Host Side Code |
|
|
143 | (1) |
|
6.3 Understanding GPU Parallelism |
|
|
144 | (3) |
|
6.3.1 How Does the GPU Achieve High Performance? |
|
|
145 | (1) |
|
6.3.2 CPU versus GPU Architectural Differences |
|
|
146 | (1) |
|
6.4 CUDA Version Of The Image Flipper: IMFLIPG.CU |
|
|
147 | (23) |
|
6.4.1 imflipG.cu: Read the Image into a CPU-Side Array |
|
|
149 | (2) |
|
6.4.2 Initialize and Query the GPUs |
|
|
151 | (2) |
|
6.4.3 GPU-Side Time-Stamping |
|
|
153 | (2) |
|
6.4.4 GPU-Side Memory Allocation |
|
|
155 | (1) |
|
6.4.5 GPU Drivers and Nvidia Runtime Engine |
|
|
155 | (1) |
|
6.4.6 CPU-+GPU Data Transfer |
|
|
156 | (1) |
|
6.4.7 Error Reporting Using Wrapper Functions |
|
|
157 | (1) |
|
6.4.8 GPU Kernel Execution |
|
|
157 | (3) |
|
6.4.9 Finish Executing the GPU Kernel |
|
|
160 | (1) |
|
6.4.10 Transfer GPU Results Back to the CPU |
|
|
161 | (1) |
|
6.4.11 Complete Time-Stamping |
|
|
161 | (1) |
|
6.4.12 Report the Results and Cleanup |
|
|
162 | (1) |
|
6.4.13 Reading and Writing the BMP File |
|
|
163 | (1) |
|
6.4.14 Vflip(): The GPU Kernel for Vertical Flipping |
|
|
164 | (2) |
|
6.4.15 What Is My Thread ID, Block ID, and Block Dimension? |
|
|
166 | (3) |
|
6.4.16 Hflip(): The GPU Kernel for Horizontal Flipping |
|
|
169 | (1) |
|
6.4.17 Hardware Parameters: threadlDx.x, blockIdx.x, blockDim.x |
|
|
169 | (1) |
|
6.4.18 PixCopy(): The GPU Kernel for Copying an Image |
|
|
169 | (1) |
|
|
170 | (1) |
|
6.5 CUDA Program Development In Windows |
|
|
170 | (9) |
|
6.5.1 Installing MS Visual Studio 2015 and CUDA Toolkit 8.0 |
|
|
171 | (1) |
|
6.5.2 Creating Project imflipG.cu in Visual Studio 2015 |
|
|
172 | (2) |
|
6.5.3 Compiling Project imflipG.cu in Visual Studio 2015 |
|
|
174 | (3) |
|
6.5.4 Running Our First CUDA Application: imflipG.exe |
|
|
177 | (1) |
|
6.5.5 Ensuring Your Program's Correctness |
|
|
178 | (1) |
|
6.6 CUDA Program Development On A Mac Platform |
|
|
179 | (2) |
|
6.6.1 Installing XCode on Your Mac |
|
|
179 | (1) |
|
6.6.2 Installing the CUDA Driver and CUDA Toolkit |
|
|
180 | (1) |
|
6.6.3 Compiling and Running CUDA Applications on a Mac |
|
|
180 | (1) |
|
6.7 CUDA Program Development In A Unix Platform |
|
|
181 | (4) |
|
6.7.1 Installing Eclipse and CUDA Toolkit |
|
|
181 | (1) |
|
|
182 | (1) |
|
6.7.3 Compiling and Executing Your CUDA Code |
|
|
182 | (3) |
|
Chapter 7 CUDA Host/Device Programming Model |
|
|
185 | (40) |
|
7.1 Designing Your Program's Parallelism |
|
|
185 | (4) |
|
7.1.1 Conceptually Parallelizing a Task |
|
|
186 | (1) |
|
7.1.2 What Is a Good Block Size for Vflip()? |
|
|
187 | (1) |
|
7.1.3 imflipG.cu: Interpreting the Program Output |
|
|
187 | (1) |
|
7.1.4 imflipG.cu: Performance Impact of Block and Image Size |
|
|
188 | (1) |
|
7.2 Kernel Launch Components |
|
|
189 | (4) |
|
|
189 | (1) |
|
|
190 | (1) |
|
|
191 | (1) |
|
|
192 | (1) |
|
7.3 IMFLIPG.CU: Understanding The Kernel Details |
|
|
193 | (6) |
|
7.3.1 Launching Kernels in main() and Passing Arguments to Them |
|
|
193 | (1) |
|
7.3.2 Thread Execution Steps |
|
|
194 | (1) |
|
7.3.3 Vflip() Kernel Details |
|
|
195 | (1) |
|
7.3.4 Comparing Vflip() and MTFlipV() |
|
|
196 | (1) |
|
7.3.5 Hflip() Kernel Details |
|
|
197 | (1) |
|
7.3.6 PixCopy() Kernel Details |
|
|
197 | (2) |
|
7.4 Dependence Of PCI Express Speed On The CPU |
|
|
199 | (1) |
|
7.5 Performance Impact Of PCI Express Bus |
|
|
200 | (4) |
|
7.5.1 Data Transfer Time, Speed, Latency, Throughput, and Bandwidth |
|
|
200 | (1) |
|
7.5.2 PCIe Throughput Achieved with imflipG.cu |
|
|
201 | (3) |
|
7.6 Performance Impact Of Global Memory Bus |
|
|
204 | (2) |
|
7.7 Performance Impact Of Compute Capability |
|
|
206 | (8) |
|
7.7.1 Fermi, Kepler, Maxwell, Pascal, and Volta Families |
|
|
207 | (1) |
|
7.7.2 Relative Bandwidth Achieved in Different Families |
|
|
207 | (1) |
|
7.7.3 imflipG2.cu: Compute Capability 2.0 Version of imflipG.cu |
|
|
208 | (2) |
|
7.7.4 imflipG2.cu: Changes in main() |
|
|
210 | (1) |
|
7.7.5 The PxCC20() Kernel |
|
|
211 | (1) |
|
7.7.6 The VfCC20() Kernel |
|
|
212 | (2) |
|
7.8 Performance Of IMFLIPG2.CU |
|
|
214 | (1) |
|
7.9 Old-School CUDA Debugging |
|
|
214 | (7) |
|
|
216 | (2) |
|
|
218 | (2) |
|
7.9.3 Comment-Based Debugging |
|
|
220 | (1) |
|
|
220 | (1) |
|
7.10 Biological Reasons For Software Bugs |
|
|
221 | (4) |
|
7.10.1 How Is Our Brain Involved in Writing/Debugging Code? |
|
|
222 | (1) |
|
7.10.2 Do We Write Buggy Code When We Are Tired? |
|
|
222 | (4) |
|
|
223 | (1) |
|
7.10.2.2 Physical Tiredness |
|
|
223 | (1) |
|
7.10.2.3 Tiredness Due to Heavy Physical Activity |
|
|
223 | (1) |
|
7.10.2.4 Tiredness Due to Needing Sleep |
|
|
223 | (1) |
|
7.10.2.5 Mental Tiredness |
|
|
224 | (1) |
|
Chapter 8 Understanding GPU Hardware Architecture |
|
|
225 | (38) |
|
8.1 GPU Hardware Architecture |
|
|
226 | (1) |
|
8.2 GPU Hardware Components |
|
|
226 | (4) |
|
8.2.1 SM: Streaming Multiprocessor |
|
|
226 | (1) |
|
|
227 | (1) |
|
8.2.3 Giga-Thread Scheduler |
|
|
227 | (2) |
|
|
229 | (1) |
|
8.2.5 Shared Cache Memory (L2$) |
|
|
229 | (1) |
|
|
229 | (1) |
|
8.3 NVIDIA GPU Architectures |
|
|
230 | (3) |
|
|
231 | (1) |
|
8.3.2 GT, GTX, and Compute Accelerators |
|
|
231 | (1) |
|
8.3.3 Kepler Architecture |
|
|
232 | (1) |
|
8.3.4 Maxwell Architecture |
|
|
232 | (1) |
|
8.3.5 Pascal Architecture and NVLink |
|
|
233 | (1) |
|
8.4 CUDA Edge Detection: IMEDGEG.CU |
|
|
233 | (9) |
|
8.4.1 Variables to Store the Image in CPU, GPU Memory |
|
|
233 | (2) |
|
8.4.1.1 TheImage and Copylmage |
|
|
233 | (1) |
|
|
234 | (1) |
|
|
234 | (1) |
|
|
234 | (1) |
|
8.4.1.5 GPUGradient and GPUTheta |
|
|
234 | (1) |
|
|
235 | (1) |
|
8.4.2 Allocating Memory for the GPU Variables |
|
|
235 | (3) |
|
8.4.3 Calling the Kernels and Time-Stamping Their Execution |
|
|
238 | (1) |
|
8.4.4 Computing the Kernel Performance |
|
|
239 | (1) |
|
8.4.5 Computing the Amount of Kernel Data Movement |
|
|
239 | (3) |
|
8.4.6 Reporting the Kernel Performance |
|
|
242 | (1) |
|
|
242 | (7) |
|
|
242 | (2) |
|
|
244 | (2) |
|
|
246 | (3) |
|
|
249 | (1) |
|
8.6 Performance Of IMEDGEG.CU |
|
|
249 | (4) |
|
8.6.1 imedgeG.cu: PCIe Bus Utilization |
|
|
250 | (1) |
|
8.6.2 imedgeG.cu: Runtime Results |
|
|
250 | (2) |
|
8.6.3 imedgeG.cu: Kernel Performance Comparison |
|
|
252 | (1) |
|
8.7 GPU Code: Compile Time |
|
|
253 | (2) |
|
8.7.1 Designing CUDA Code |
|
|
253 | (2) |
|
8.7.2 Compiling CUDA Code |
|
|
255 | (1) |
|
8.7.3 GPU Assembly: PTX, CUBIN |
|
|
255 | (1) |
|
|
255 | (2) |
|
8.8.1 OS Involvement and CUDA DLL File |
|
|
255 | (1) |
|
8.8.2 GPU Graphics Driver |
|
|
256 | (1) |
|
8.8.3 CPU4-4GPU Memory Transfers |
|
|
256 | (1) |
|
8.9 GPU Code: Execution (Run Time) |
|
|
257 | (6) |
|
|
257 | (1) |
|
8.9.2 Getting the Code and Parameters |
|
|
257 | (1) |
|
8.9.3 Launching Grids of Blocks |
|
|
258 | (1) |
|
8.9.4 Giga Thread Scheduler (GTS) |
|
|
258 | (1) |
|
|
259 | (1) |
|
|
260 | (1) |
|
8.9.7 Transparent Scalability |
|
|
261 | (2) |
|
Chapter 9 Understanding GPU Cores |
|
|
263 | (40) |
|
9.1 GPU Architecture Families |
|
|
263 | (12) |
|
|
263 | (1) |
|
|
264 | (2) |
|
9.1.3 Kepler Architecture |
|
|
266 | (1) |
|
9.1.4 Kepler SMX Structure |
|
|
267 | (1) |
|
9.1.5 Maxwell Architecture |
|
|
268 | (1) |
|
9.1.6 Maxwell SMM Structure |
|
|
268 | (2) |
|
9.1.7 Pascal GP100 Architecture |
|
|
270 | (1) |
|
9.1.8 Pascal GP100 SM Structure |
|
|
271 | (1) |
|
9.1.9 Family Comparison: Peak GFLOPS and Peak DGFLOPS |
|
|
272 | (1) |
|
|
273 | (1) |
|
9.1.11 GPU Power Consumption |
|
|
274 | (1) |
|
9.1.12 Computer Power Supply |
|
|
274 | (1) |
|
9.2 Streaming Multiprocessor (SM) Building Blocks |
|
|
275 | (4) |
|
|
275 | (1) |
|
9.2.2 Double Precision Units (DPU) |
|
|
276 | (1) |
|
9.2.3 Special Function Units (SFU) |
|
|
276 | (1) |
|
|
276 | (1) |
|
9.2.5 Load/Store Queues (LDST) |
|
|
277 | (1) |
|
9.2.6 L1$ and Texture Cache |
|
|
277 | (1) |
|
|
278 | (1) |
|
|
278 | (1) |
|
|
278 | (1) |
|
9.2.10 Instruction Buffer |
|
|
278 | (1) |
|
|
278 | (1) |
|
|
279 | (1) |
|
9.3 Parallel Thread Execution (PTX) Data Types |
|
|
279 | (7) |
|
9.3.1 INT8: 8-bit Integer |
|
|
280 | (1) |
|
9.3.2 INT16: 16-bit Integer |
|
|
280 | (1) |
|
|
280 | (1) |
|
9.3.4 INT32: 32-bit Integer |
|
|
281 | (1) |
|
9.3.5 Predicate Registers (32-bit) |
|
|
281 | (1) |
|
9.3.6 INT64: 64-bit Integer |
|
|
282 | (1) |
|
|
282 | (1) |
|
9.3.8 FP32: Single Precision Floating Point (float) |
|
|
282 | (1) |
|
9.3.9 FP64: Double Precision Floating Point (double) |
|
|
283 | (1) |
|
9.3.10 FP16: Half Precision Floating Point (half) |
|
|
284 | (1) |
|
|
284 | (1) |
|
9.3.12 Fused Multiply-Accumulate (FMA) versus Multiply-Add (MAD) |
|
|
285 | (1) |
|
9.3.13 Quad and Octo Precision Floating Point |
|
|
285 | (1) |
|
9.3.14 Pascal GP104 Engine SM Structure |
|
|
285 | (1) |
|
9.4 IMFLIPGC.CU: CORE-FRIENDLY IMFLIPG |
|
|
286 | (13) |
|
9.4.1 Hflip2(): Precomputing Kernel Parameters |
|
|
288 | (2) |
|
9.4.2 Vflip2(): Precomputing Kernel Parameters |
|
|
290 | (1) |
|
9.4.3 Computing Image Coordinates by a Thread |
|
|
290 | (1) |
|
9.4.4 Block ID versus Image Row Mapping |
|
|
291 | (1) |
|
9.4.5 Hflip3(): Using a 2D Launch Grid |
|
|
292 | (1) |
|
9.4.6 Vflip3(): Using a 2D Launch Grid |
|
|
293 | (1) |
|
9.4.7 Hflip4(): Computing Two Consecutive Pixels |
|
|
294 | (1) |
|
9.4.8 Vflip4(): Computing Two Consecutive Pixels |
|
|
295 | (1) |
|
9.4.9 Hflip5(): Computing Four Consecutive Pixels |
|
|
296 | (1) |
|
9.4.10 Vflip5(): Computing Four Consecutive Pixels |
|
|
297 | (1) |
|
9.4.11 PixCopy2(), PixCopy3(): Copying 2,4 Consecutive Pixels at a Time |
|
|
298 | (1) |
|
9.5 IMEDGEGC.CU: Core-Friendly IMEDGEG |
|
|
299 | (4) |
|
9.5.1 BWKernel20: Using Precomputed Values and 2D Blocks |
|
|
299 | (1) |
|
9.5.2 GaussKernel20: Using Precomputed Values and 2D Blocks |
|
|
300 | (3) |
|
Chapter 10 Understanding GPU Memory |
|
|
303 | (42) |
|
|
303 | (1) |
|
|
304 | (1) |
|
|
304 | (1) |
|
|
305 | (2) |
|
10.4.1 Split versus Dedicated Shared Memory |
|
|
305 | (1) |
|
10.4.2 Memory Resources Available Per Core |
|
|
306 | (1) |
|
10.4.3 Using Shared Memory as Software Cache |
|
|
306 | (1) |
|
10.4.4 Allocating Shared Memory in an SM |
|
|
307 | (1) |
|
|
307 | (1) |
|
|
307 | (1) |
|
10.7 IMFLIPGCM.CU: Core And Memory Friendly IMFLIPG |
|
|
308 | (11) |
|
10.7.1 Hflip6(),Vflip6(): Using Shared Memory as Buffer |
|
|
308 | (2) |
|
10.7.2 Hflip7(): Consecutive Swap Operations in Shared Memory |
|
|
310 | (2) |
|
10.7.3 HfIip8(): Using Registers to Swap Four Pixels |
|
|
312 | (2) |
|
10.7.4 VfIip7(): Copying 4 Bytes (int) at a Time |
|
|
314 | (1) |
|
10.7.5 Aligned versus Unaligned Data Access in Memory |
|
|
314 | (1) |
|
10.7.6 VfIip8(): Copying 8 Bytes at a Time |
|
|
315 | (1) |
|
10.7.7 Vflip9(): Using Only Global Memory, 8 Bytes at a Time |
|
|
316 | (1) |
|
10.7.8 PixCopy4(), PixCopy5(): Copying One versus 4 Bytes Using Shared Memory |
|
|
317 | (1) |
|
10.7.9 PixCopy6(), PixCopy7(): Copying One/Two Integers Using Global Memory |
|
|
318 | (1) |
|
10.8 IMEDGEGCM.CU: Core- & Memory-Friendly IMEDGEG |
|
|
319 | (14) |
|
10.8.1 BWKernel3(): Using Byte Manipulation to Extract RGB |
|
|
319 | (2) |
|
10.8.2 GaussKernel3(): Using Constant Memory |
|
|
321 | (1) |
|
10.8.3 Ways to Handle Constant Values |
|
|
321 | (2) |
|
10.8.4 GaussKernel4(): Buffering Neighbors of 1 Pixel in Shared Memory |
|
|
323 | (2) |
|
10.8.5 GaussKernel5(): Buffering Neighbors of 4 Pixels in Shared Memory |
|
|
325 | (2) |
|
10.8.6 GaussKernel6(): Reading 5 Vertical Pixels into Shared Memory |
|
|
327 | (2) |
|
10.8.7 GaussKernel7(): Eliminating the Need to Account for Edge Pixels |
|
|
329 | (2) |
|
10.8.8 GaussKernel8(): Computing 8 Vertical Pixels |
|
|
331 | (2) |
|
10.9 CUDA Occupancy Calculator |
|
|
333 | (12) |
|
10.9.1 Choosing the Optimum Threads/Block |
|
|
334 | (1) |
|
10.9.2 SM-Level Resource Limitations |
|
|
335 | (1) |
|
10.9.3 What is "Occupancy"? |
|
|
336 | (1) |
|
10.9.4 CUDA Occupancy Calculator: Resource Computation |
|
|
336 | (4) |
|
10.9.5 Case Study: GaussKernel7() |
|
|
340 | (3) |
|
10.9.6 Case Study: GaussKernel8() |
|
|
343 | (2) |
|
|
345 | (38) |
|
|
347 | (2) |
|
11.1.1 Execution Overlapping |
|
|
347 | (1) |
|
11.1.2 Exposed versus Coalesced Runtime |
|
|
348 | (1) |
|
|
349 | (2) |
|
11.2.1 Physical versus Virtual Memory |
|
|
349 | (1) |
|
11.2.2 Physical to Virtual Address Translation |
|
|
350 | (1) |
|
|
350 | (1) |
|
11.2.4 Allocating Pinned Memory with cudaMallocHost() |
|
|
351 | (1) |
|
11.3 Fast CPUGPU Data Transfers |
|
|
351 | (1) |
|
11.3.1 Synchronous Data Transfers |
|
|
351 | (1) |
|
11.3.2 Asynchronous Data Transfers |
|
|
351 | (1) |
|
|
352 | (4) |
|
11.4.1 CPU->GPU Transfer, Kernel Exec, GPU-*CPUTransfer |
|
|
352 | (1) |
|
11.4.2 Implementing Streaming in CUDA |
|
|
353 | (1) |
|
|
353 | (1) |
|
11.4.4 Kernel Execution Engine |
|
|
353 | (1) |
|
11.4.5 Concurrent Upstream and Downstream PCIe Transfers |
|
|
354 | (1) |
|
11.4.6 Creating CUDA Streams |
|
|
355 | (1) |
|
11.4.7 Destroying CUDA Streams |
|
|
355 | (1) |
|
11.4.8 Synchronizing CUDA Streams |
|
|
355 | (1) |
|
11.5 IMGSTR.CU: Streaming Image Processing |
|
|
356 | (10) |
|
11.5.1 Reading the Image into Pinned Memory |
|
|
356 | (2) |
|
11.5.2 Synchronous versus Single Stream |
|
|
358 | (1) |
|
|
359 | (2) |
|
11.5.4 Data Dependence Across Multiple Streams |
|
|
361 | (10) |
|
11.5.4.1 Horizontal Flip: No Data Dependence |
|
|
362 | (1) |
|
11.5.4.2 Edge Detection: Data Dependence |
|
|
363 | (1) |
|
11.5.4.3 Preprocessing Overlapping Rows Synchronously |
|
|
363 | (1) |
|
11.5.4.4 Asynchronous Processing the Non-Overlapping Rows |
|
|
364 | (2) |
|
11.6 Streaming Horizontal Flip Kernel |
|
|
366 | (1) |
|
11.7 IMGSTR.CU: Streaming Edge Detection |
|
|
367 | (4) |
|
11.8 Performance Comparison: IMGSTR.CU |
|
|
371 | (4) |
|
11.8.1 Synchronous versus Asynchronous Results |
|
|
371 | (1) |
|
11.8.2 Randomness in the Results |
|
|
372 | (1) |
|
|
372 | (1) |
|
11.8.4 Best Case Streaming Results |
|
|
373 | (1) |
|
11.8.5 Worst Case Streaming Results |
|
|
374 | (1) |
|
11.9 NVIDIA Visual Profiler: NWP |
|
|
375 | (8) |
|
11.9.1 Installing nvvp and nvprof |
|
|
375 | (1) |
|
|
376 | (1) |
|
|
377 | (1) |
|
11.9.4 imGStr Synchronous and Single-Stream Results |
|
|
377 | (1) |
|
11.9.5 imGStr 2- and 4-Stream Results |
|
|
378 | (5) |
Part III More To Know |
|
|
Chapter 12 CUDA Libraries |
|
|
383 | (14) |
|
|
|
|
|
383 | (7) |
|
|
383 | (1) |
|
|
384 | (1) |
|
|
385 | (1) |
|
12.1.4 Variable Declaration and Initialization |
|
|
385 | (1) |
|
12.1.5 Device Memory Allocation |
|
|
386 | (1) |
|
|
386 | (1) |
|
12.1.7 Transferring Data to the Device |
|
|
386 | (1) |
|
12.1.8 Calling cuBLAS Functions |
|
|
387 | (1) |
|
12.1.9 Transfer Data Back to the Host |
|
|
388 | (1) |
|
12.1.10 Deallocating Memory |
|
|
388 | (1) |
|
12.1.11 Example cuBLAS Program: Matrix Scalar |
|
|
388 | (2) |
|
|
390 | (2) |
|
12.2.1 cuFFT Library Characteristics |
|
|
390 | (1) |
|
12.2.2 A Sample Complex-to-Complex Transform |
|
|
390 | (1) |
|
12.2.3 A Sample Real-to-Complex Transform |
|
|
391 | (1) |
|
12.3 NVIDIA Performance Primitives (NPP) |
|
|
392 | (1) |
|
|
393 | (4) |
|
Chapter 13 Introduction to OpenCL |
|
|
397 | (16) |
|
|
|
|
397 | (1) |
|
|
397 | (1) |
|
|
397 | (1) |
|
13.2 Image Flip Kernel In OPENCL |
|
|
398 | (1) |
|
|
399 | (7) |
|
13.3.1 Selecting a Device |
|
|
400 | (1) |
|
13.3.2 Running the Kernel |
|
|
401 | (1) |
|
13.3.2.1 Creating a Compute Context |
|
|
401 | (1) |
|
13.3.2.2 Creating a Command Queue |
|
|
401 | (1) |
|
13.3.2.3 Loading Kernel File |
|
|
402 | (1) |
|
13.3.2.4 Setting Up Kernel Invocation |
|
|
403 | (2) |
|
13.3.3 Runtimes of Our OpenCL Program |
|
|
405 | (1) |
|
13.4 Edge Detection In OpenCL |
|
|
406 | (7) |
|
Chapter 14 Other GPU Programming Languages |
|
|
413 | (12) |
|
|
|
|
14.1 GPU Programming With Python |
|
|
413 | (7) |
|
14.1.1 PyOpenCL Version of imflip |
|
|
414 | (4) |
|
14.1.2 PyOpenCL Element-Wise Kernel |
|
|
418 | (2) |
|
|
420 | (1) |
|
14.3 OPENGL ES: OPENGL For Embedded Systems |
|
|
420 | (1) |
|
|
421 | (1) |
|
14.5 Microsoft's High-Level Shading Language (HLSL) |
|
|
421 | (1) |
|
|
421 | (1) |
|
|
422 | (1) |
|
|
422 | (1) |
|
14.7 Apple's Swift Programming Language |
|
|
423 | (1) |
|
|
423 | (2) |
|
14.8.1 Installing OpenCV and Face Recognition |
|
|
423 | (1) |
|
14.8.2 Mobile-Cloudlet-Cloud Real-Time Face Recognition |
|
|
423 | (1) |
|
14.8.3 Acceleration as a Service (AXaas) |
|
|
423 | (2) |
|
Chapter 15 Deep Learning Using CUDA |
|
|
425 | (10) |
|
|
|
15.1 Artificial Neural Networks (ANNS) |
|
|
425 | (1) |
|
|
425 | (1) |
|
15.1.2 Activation Functions |
|
|
425 | (1) |
|
15.2 Fully Connected Neural Networks |
|
|
425 | (2) |
|
15.3 Deep Networks/Convolutional Neural Networks |
|
|
427 | (1) |
|
|
428 | (1) |
|
15.5 cuDNN Library For Deep Learning |
|
|
428 | (4) |
|
|
429 | (1) |
|
15.5.2 Creating a Network |
|
|
430 | (1) |
|
15.5.3 Forward Propagation |
|
|
431 | (1) |
|
|
431 | (1) |
|
15.5.5 Using cuBLAS in the Network |
|
|
431 | (1) |
|
|
432 | (3) |
Bibliography |
|
435 | (4) |
Index |
|
439 | |