GPU Parallel Program Development Using CUDA [Kõva köide]

(University at Albany, SUNY, USA)
  • Formaat: Hardback, 476 pages, kõrgus x laius: 254x178 mm, kaal: 1000 g, 57 Tables, black and white; 24 Line drawings, black and white; 30 Halftones, black and white; 54 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Computational Science
  • Ilmumisaeg: 16-Feb-2018
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 1498750753
  • ISBN-13: 9781498750752
Teised raamatud teemal:
  • Kõva köide
  • Hind: 69,93 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Tavahind: 93,24 €
  • Säästad 25%
  • Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Tellimisaeg 2-4 nädalat
  • Lisa soovinimekirja
  • Formaat: Hardback, 476 pages, kõrgus x laius: 254x178 mm, kaal: 1000 g, 57 Tables, black and white; 24 Line drawings, black and white; 30 Halftones, black and white; 54 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Computational Science
  • Ilmumisaeg: 16-Feb-2018
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 1498750753
  • ISBN-13: 9781498750752
Teised raamatud teemal:
GPU Parallel Program Development using CUDA teaches GPU programming by showing the differences among different families of GPUs. This approach prepares the reader for the next generation and future generations of GPUs. The book emphasizes concepts that will remain relevant for a long time, rather than concepts that are platform-specific. At the same time, the book also provides platform-dependent explanations that are as valuable as generalized GPU concepts.

The book consists of three separate parts; it starts by explaining parallelism using CPU multi-threading in Part I. A few simple programs are used to demonstrate the concept of dividing a large task into multiple parallel sub-tasks and mapping them to CPU threads. Multiple ways of parallelizing the same task are analyzed and their pros/cons are studied in terms of both core and memory operation.

Part II of the book introduces GPU massive parallelism. The same programs are parallelized on multiple Nvidia GPU platforms and the same performance analysis is repeated. Because the core and memory structures of CPUs and GPUs are different, the results differ in interesting ways. The end goal is to make programmers aware of all the good ideas, as well as the bad ideas, so readers can apply the good ideas and avoid the bad ideas in their own programs.

Part III of the book provides pointer for readers who want to expand their horizons. It provides a brief introduction to popular CUDA libraries (such as cuBLAS, cuFFT, NPP, and Thrust),the OpenCL programming language, an overview of GPU programming using other programming languages and API libraries (such as Python, OpenCV, OpenGL, and Apples Swift and Metal,) and the deep learning library cuDNN.
List of Figures xxiii
List of Tables xxix
Preface xxxiii
About the Author xxxv
Part I Understanding CPU Parallelism
Chapter 1 Introduction to CPU Parallel Programming
3(24)
1.1 Evolution Of Parallel Programming
3(1)
1.2 More Cores, More Parallelism
4(1)
1.3 Cores Versus Threads
5(3)
1.3.1 More Threads or More Cores to Parallelize?
5(2)
1.3.2 Influence of Core Resource Sharing
7(1)
1.3.3 Influence of Memory Resource Sharing
7(1)
1.4 Our First Serial Program
8(5)
1.4.1 Understanding Data Transfer Speeds
8(2)
1.4.2 The main() Function in imflip.c
10(1)
1.4.3 Flipping Rows Vertically: FliplmageV()
11(1)
1.4.4 Flipping Columns Horizontally: FliplmageH()
12(1)
1.5 Writing, Compiling, Running Our Programs
13(2)
1.5.1 Choosing an Editor and a Compiler
13(1)
1.5.2 Developing in Windows 7, 8, and Windows 10 Platforms
13(2)
1.5.3 Developing in a Mac Platform
15(1)
1.5.4 Developing in a Unix Platform
15(1)
1.6 Crash Course On UNIX
15(4)
1.6.1 Unix Directory-Related Commands
15(1)
1.6.2 Unix File-Related Commands
16(3)
1.7 Debugging Your Programs
19(4)
1.7.1 gdb
20(1)
1.7.2 Old School Debugging
21(1)
1.7.3 valgrind
22(1)
1.8 Performance Of Our First Serial Program
23(4)
1.8.1 Can We Estimate the Execution Time?
24(1)
1.8.2 What Does the OS Do When Our Code Is Executing?
24(1)
1.8.3 How Do We Parallelize It?
25(1)
1.8.4 Thinking About the Resources
25(2)
Chapter 2 Developing Our First Parallel CPU Program
27(26)
2.1 Our First Parallel Program
27(10)
2.1.1 The main() Function in imflipP.c
28(1)
2.1.2 Timing the Execution
29(1)
2.1.3 Split Code Listing for main() in imflipP.c
29(3)
2.1.4 Thread Initialization
32(1)
2.1.5 Thread Creation
32(2)
2.1.6 Thread Launch/Execution
34(1)
2.1.7 Thread Termination (Join)
35(1)
2.1.8 Thread Task and Data Splitting
35(2)
2.2 Working With Bitmap (BMP) Files
37(5)
2.2.1 BMP is a Non-Lossy/Uncompressed File Format
37(1)
2.2.2 BMP Image File Format
38(1)
2.2.3 Header File ImageStuff.h
39(1)
2.2.4 Image Manipulation Routines in ImageStuff.c
40(2)
2.3 Task Execution By Threads
42(9)
2.3.1 Launching a Thread
43(2)
2.3.2 Multithreaded Vertical Flip: MTFlipV()
45(3)
2.3.3 Comparing FliplmageV() and MTFlipV()
48(2)
2.3.4 Multithreaded Horizontal Flip: MTFlipH()
50(1)
2.4 Testing/Timing The Multithreaded Code
51(2)
Chapter 3 Improving Our First Parallel CPU Program
53(26)
3.1 Effect Of The "Programmer" On Performance
53(1)
3.2 Effect Of The "CPU" On Performance
54(3)
3.2.1 In-Order versus Out-Of-Order Cores
55(2)
3.2.2 Thin versus Thick Threads
57(1)
3.3 Performance Of IMFLIPP
57(1)
3.4 Effect Of The "OS" On Performance
58(5)
3.4.1 Thread Creation
59(1)
3.4.2 Thread Launch and Execution
59(1)
3.4.3 Thread Status
60(1)
3.4.4 Mapping Software Threads to Hardware Threads
61(1)
3.4.5 Program Performance versus Launched Pthreads
62(1)
3.5 Improving IMFLIPP
63(4)
3.5.1 Analyzing Memory Access Patterns in MTFlipH()
64(1)
3.5.2 Multithreaded Memory Access of MTFlipH()
64(2)
3.5.3 DRAM Access Rules of Thumb
66(1)
3.6 IMFLIPPM: Obeying Dram Rules Of Thumb
67(5)
3.6.1 Chaotic Memory Access Patterns of imflipP
67(1)
3.6.2 Improving Memory Access Patterns of imflipP
68(1)
3.6.3 MTFlipHM(): The Memory Friendly MTFlipH()
69(2)
3.6.4 MTFlipVM(): The Memory Friendly MTFlipV()
71(1)
3.7 Performance Of IMFLIPPM.C
72(2)
3.7.1 Comparing Performances of imflipP.c and imflipPM.c
72(1)
3.7.2 Speed Improvement: MTFlipV() versus MTFlipVM()
73(1)
3.7.3 Speed Improvement: MTFlipH() versus MTFlipHM()
73(1)
3.7.4 Understanding the Speedup: MTFlipH() versus MTFlipHM()
73(1)
3.8 Process Memory Map
74(2)
3.9 Intel Mic Architecture: Xeon Phi
76(1)
3.10 What About The GPU?
77(1)
3.11
Chapter Summary
78(1)
Chapter 4 Understanding the Cores and Memory
79(28)
4.1 Once Upon A Time...INTEL...
79(1)
4.2 CPU And Memory Manufacturers
80(1)
4.3 Dynamic (DRAM) Versus Static (SRAM) Memory
81(2)
4.3.1 Static Random Access Memory (SRAM)
81(1)
4.3.2 Dynamic Random Access Memory (DRAM)
81(1)
4.3.3 DRAM Interface Standards
81(1)
4.3.4 Influence of DRAM on our Program Performance
82(1)
4.3.5 Influence of SRAM (Cache) on our Program Performance
83(1)
4.4 Image Rotation Program: IMROTATE.C
83(6)
4.4.1 Description of the imrotate.c
84(1)
4.4.2 imrotate.c: Parametric Restrictions and Simplifications
84(1)
4.4.3 imrotate.c: Theory of Operation
85(4)
4.5 Performance Of Imrotate
89(2)
4.5.1 Qualitative Analysis of Threading Efficiency
89(1)
4.5.2 Quantitative Analysis: Defining Threading Efficiency
89(2)
4.6 The Architecture Of The Computer
91(6)
4.6.1 The Cores, L1$ and L2$
91(1)
4.6.2 Internal Core Resources
92(2)
4.6.3 The Shared L3 Cache Memory (L3$)
94(1)
4.6.4 The Memory Controller
94(1)
4.6.5 The Main Memory
95(1)
4.6.6 Queue, Uncore, and I/O
96(1)
4.7 IMROTATEMC: Making Imrotate More Efficient
97(9)
4.7.1 Rotate2(): How Bad is Square Root and FP Division?
99(1)
4.7.2 Rotate3() and Rotate4(): How Bad Is sin() and cos()?
100(2)
4.7.3 Rotate5(): How Bad Is Integer Division/Multiplication?
102(1)
4.7.4 Rotate6(): Consolidating Computations
102(2)
4.7.5 Rotate7(): Consolidating More Computations
104(1)
4.7.6 Overall Performance of imrotateMC
104(2)
4.8
Chapter Summary
106(1)
Chapter 5 Thread Management and Synchronization
107(30)
5.1 Edge Detection Program: IMEDGE.C
107(4)
5.1.1 Description of the imedge.c
108(1)
5.1.2 imedge.c: Parametric Restrictions and Simplifications
108(1)
5.1.3 imedge.c: Theory of Operation
109(2)
5.2 IMEDGE.C: Implementation
111(7)
5.2.1 Initialization and Time-Stamping
112(1)
5.2.2 Initialization Functions for Different Image Representations
113(1)
5.2.3 Launching and Terminating Threads
114(1)
5.2.4 Gaussian Filter
115(1)
5.2.5 Sobel
116(1)
5.2.6 Threshold
117(1)
5.3 Performance Of IMEDGE
118(1)
5.4 IMEDGEMC: Making Imedge More Efficient
118(8)
5.4.1 Using Precomputation to Reduce Bandwidth
119(1)
5.4.2 Storing the Precomputed Pixel Values
120(1)
5.4.3 Precomputing Pixel Values
121(1)
5.4.4 Reading the Image and Precomputing Pixel Values
122(1)
5.4.5 PrGaussianFilter
123(1)
5.4.6 PrSobel
124(1)
5.4.7 PrThreshold
125(1)
5.5 Performance Of IMEDGEMC
126(1)
5.6 IMEDGEMCT: Synchronizing Threads Efficiently
127(3)
5.6.1 Barrier Synchronization
128(1)
5.6.2 MUTEX Structure for Data Sharing
129(1)
5.7 IMEDGEMCT: Implementation
130(4)
5.7.1 Using a MUTEX: Read Image, Precompute
132(1)
5.7.2 Precomputing One Row at a Time
133(1)
5.8 Performance Of IMEDGEMCT
134(3)
Part II GPU Programming Using CUDA
Chapter 6 Introduction to GPU Parallelism and CUDA
137(48)
6.1 Once Upon A Time...NVIDIA
137(6)
6.1.1 The Birth of the GPU
137(1)
6.1.2 Early GPU Architectures
138(2)
6.1.3 The Birth of the GPGPU
140(1)
6.1.4 Nvidia, ATI Technologies, and Intel
141(2)
6.2 Compute-Unified Device Architecture (CUDA)
143(1)
6.2.1 CUDA, OpenCL, and Other GPU Languages
143(1)
6.2.2 Device Side versus Host Side Code
143(1)
6.3 Understanding GPU Parallelism
144(3)
6.3.1 How Does the GPU Achieve High Performance?
145(1)
6.3.2 CPU versus GPU Architectural Differences
146(1)
6.4 CUDA Version Of The Image Flipper: IMFLIPG.CU
147(23)
6.4.1 imflipG.cu: Read the Image into a CPU-Side Array
149(2)
6.4.2 Initialize and Query the GPUs
151(2)
6.4.3 GPU-Side Time-Stamping
153(2)
6.4.4 GPU-Side Memory Allocation
155(1)
6.4.5 GPU Drivers and Nvidia Runtime Engine
155(1)
6.4.6 CPU-+GPU Data Transfer
156(1)
6.4.7 Error Reporting Using Wrapper Functions
157(1)
6.4.8 GPU Kernel Execution
157(3)
6.4.9 Finish Executing the GPU Kernel
160(1)
6.4.10 Transfer GPU Results Back to the CPU
161(1)
6.4.11 Complete Time-Stamping
161(1)
6.4.12 Report the Results and Cleanup
162(1)
6.4.13 Reading and Writing the BMP File
163(1)
6.4.14 Vflip(): The GPU Kernel for Vertical Flipping
164(2)
6.4.15 What Is My Thread ID, Block ID, and Block Dimension?
166(3)
6.4.16 Hflip(): The GPU Kernel for Horizontal Flipping
169(1)
6.4.17 Hardware Parameters: threadlDx.x, blockIdx.x, blockDim.x
169(1)
6.4.18 PixCopy(): The GPU Kernel for Copying an Image
169(1)
6.4.19 CUDA Keywords
170(1)
6.5 CUDA Program Development In Windows
170(9)
6.5.1 Installing MS Visual Studio 2015 and CUDA Toolkit 8.0
171(1)
6.5.2 Creating Project imflipG.cu in Visual Studio 2015
172(2)
6.5.3 Compiling Project imflipG.cu in Visual Studio 2015
174(3)
6.5.4 Running Our First CUDA Application: imflipG.exe
177(1)
6.5.5 Ensuring Your Program's Correctness
178(1)
6.6 CUDA Program Development On A Mac Platform
179(2)
6.6.1 Installing XCode on Your Mac
179(1)
6.6.2 Installing the CUDA Driver and CUDA Toolkit
180(1)
6.6.3 Compiling and Running CUDA Applications on a Mac
180(1)
6.7 CUDA Program Development In A Unix Platform
181(4)
6.7.1 Installing Eclipse and CUDA Toolkit
181(1)
6.7.2 ssh into a Cluster
182(1)
6.7.3 Compiling and Executing Your CUDA Code
182(3)
Chapter 7 CUDA Host/Device Programming Model
185(40)
7.1 Designing Your Program's Parallelism
185(4)
7.1.1 Conceptually Parallelizing a Task
186(1)
7.1.2 What Is a Good Block Size for Vflip()?
187(1)
7.1.3 imflipG.cu: Interpreting the Program Output
187(1)
7.1.4 imflipG.cu: Performance Impact of Block and Image Size
188(1)
7.2 Kernel Launch Components
189(4)
7.2.1 Grids
189(1)
7.2.2 Blocks
190(1)
7.2.3 Threads
191(1)
7.2.4 Warps and Lanes
192(1)
7.3 IMFLIPG.CU: Understanding The Kernel Details
193(6)
7.3.1 Launching Kernels in main() and Passing Arguments to Them
193(1)
7.3.2 Thread Execution Steps
194(1)
7.3.3 Vflip() Kernel Details
195(1)
7.3.4 Comparing Vflip() and MTFlipV()
196(1)
7.3.5 Hflip() Kernel Details
197(1)
7.3.6 PixCopy() Kernel Details
197(2)
7.4 Dependence Of PCI Express Speed On The CPU
199(1)
7.5 Performance Impact Of PCI Express Bus
200(4)
7.5.1 Data Transfer Time, Speed, Latency, Throughput, and Bandwidth
200(1)
7.5.2 PCIe Throughput Achieved with imflipG.cu
201(3)
7.6 Performance Impact Of Global Memory Bus
204(2)
7.7 Performance Impact Of Compute Capability
206(8)
7.7.1 Fermi, Kepler, Maxwell, Pascal, and Volta Families
207(1)
7.7.2 Relative Bandwidth Achieved in Different Families
207(1)
7.7.3 imflipG2.cu: Compute Capability 2.0 Version of imflipG.cu
208(2)
7.7.4 imflipG2.cu: Changes in main()
210(1)
7.7.5 The PxCC20() Kernel
211(1)
7.7.6 The VfCC20() Kernel
212(2)
7.8 Performance Of IMFLIPG2.CU
214(1)
7.9 Old-School CUDA Debugging
214(7)
7.9.1 Common CUDA Bugs
216(2)
7.9.2 return Debugging
218(2)
7.9.3 Comment-Based Debugging
220(1)
7.9.4 printf() Debugging
220(1)
7.10 Biological Reasons For Software Bugs
221(4)
7.10.1 How Is Our Brain Involved in Writing/Debugging Code?
222(1)
7.10.2 Do We Write Buggy Code When We Are Tired?
222(4)
7.10.2.1 Attention
223(1)
7.10.2.2 Physical Tiredness
223(1)
7.10.2.3 Tiredness Due to Heavy Physical Activity
223(1)
7.10.2.4 Tiredness Due to Needing Sleep
223(1)
7.10.2.5 Mental Tiredness
224(1)
Chapter 8 Understanding GPU Hardware Architecture
225(38)
8.1 GPU Hardware Architecture
226(1)
8.2 GPU Hardware Components
226(4)
8.2.1 SM: Streaming Multiprocessor
226(1)
8.2.2 GPU Cores
227(1)
8.2.3 Giga-Thread Scheduler
227(2)
8.2.4 Memory Controllers
229(1)
8.2.5 Shared Cache Memory (L2$)
229(1)
8.2.6 Host Interface
229(1)
8.3 NVIDIA GPU Architectures
230(3)
8.3.1 Fermi Architecture
231(1)
8.3.2 GT, GTX, and Compute Accelerators
231(1)
8.3.3 Kepler Architecture
232(1)
8.3.4 Maxwell Architecture
232(1)
8.3.5 Pascal Architecture and NVLink
233(1)
8.4 CUDA Edge Detection: IMEDGEG.CU
233(9)
8.4.1 Variables to Store the Image in CPU, GPU Memory
233(2)
8.4.1.1 TheImage and Copylmage
233(1)
8.4.1.2 GPUImg
234(1)
8.4.1.3 GPUBWImg
234(1)
8.4.1.4 GPUGaussImg
234(1)
8.4.1.5 GPUGradient and GPUTheta
234(1)
8.4.1.6 GPUResultImg
235(1)
8.4.2 Allocating Memory for the GPU Variables
235(3)
8.4.3 Calling the Kernels and Time-Stamping Their Execution
238(1)
8.4.4 Computing the Kernel Performance
239(1)
8.4.5 Computing the Amount of Kernel Data Movement
239(3)
8.4.6 Reporting the Kernel Performance
242(1)
8.5 IMEDGEG: Kernels
242(7)
8.5.1 BWKernel()
242(2)
8.5.2 GaussKernel()
244(2)
8.5.3 SobelKernel()
246(3)
8.5.4 ThresholdKernel()
249(1)
8.6 Performance Of IMEDGEG.CU
249(4)
8.6.1 imedgeG.cu: PCIe Bus Utilization
250(1)
8.6.2 imedgeG.cu: Runtime Results
250(2)
8.6.3 imedgeG.cu: Kernel Performance Comparison
252(1)
8.7 GPU Code: Compile Time
253(2)
8.7.1 Designing CUDA Code
253(2)
8.7.2 Compiling CUDA Code
255(1)
8.7.3 GPU Assembly: PTX, CUBIN
255(1)
8.8 GPU Code: Launch
255(2)
8.8.1 OS Involvement and CUDA DLL File
255(1)
8.8.2 GPU Graphics Driver
256(1)
8.8.3 CPU4-4GPU Memory Transfers
256(1)
8.9 GPU Code: Execution (Run Time)
257(6)
8.9.1 Getting the Data
257(1)
8.9.2 Getting the Code and Parameters
257(1)
8.9.3 Launching Grids of Blocks
258(1)
8.9.4 Giga Thread Scheduler (GTS)
258(1)
8.9.5 Scheduling Blocks
259(1)
8.9.6 Executing Blocks
260(1)
8.9.7 Transparent Scalability
261(2)
Chapter 9 Understanding GPU Cores
263(40)
9.1 GPU Architecture Families
263(12)
9.1.1 Fermi Architecture
263(1)
9.1.2 Fermi SM Structure
264(2)
9.1.3 Kepler Architecture
266(1)
9.1.4 Kepler SMX Structure
267(1)
9.1.5 Maxwell Architecture
268(1)
9.1.6 Maxwell SMM Structure
268(2)
9.1.7 Pascal GP100 Architecture
270(1)
9.1.8 Pascal GP100 SM Structure
271(1)
9.1.9 Family Comparison: Peak GFLOPS and Peak DGFLOPS
272(1)
9.1.10 GPU Boost
273(1)
9.1.11 GPU Power Consumption
274(1)
9.1.12 Computer Power Supply
274(1)
9.2 Streaming Multiprocessor (SM) Building Blocks
275(4)
9.2.1 GPU Cores
275(1)
9.2.2 Double Precision Units (DPU)
276(1)
9.2.3 Special Function Units (SFU)
276(1)
9.2.4 Register File (RF)
276(1)
9.2.5 Load/Store Queues (LDST)
277(1)
9.2.6 L1$ and Texture Cache
277(1)
9.2.7 Shared Memory
278(1)
9.2.8 Constant Cache
278(1)
9.2.9 Instruction Cache
278(1)
9.2.10 Instruction Buffer
278(1)
9.2.11 Warp Schedulers
278(1)
9.2.12 Dispatch Units
279(1)
9.3 Parallel Thread Execution (PTX) Data Types
279(7)
9.3.1 INT8: 8-bit Integer
280(1)
9.3.2 INT16: 16-bit Integer
280(1)
9.3.3 24-bit Integer
280(1)
9.3.4 INT32: 32-bit Integer
281(1)
9.3.5 Predicate Registers (32-bit)
281(1)
9.3.6 INT64: 64-bit Integer
282(1)
9.3.7 128-bit Integer
282(1)
9.3.8 FP32: Single Precision Floating Point (float)
282(1)
9.3.9 FP64: Double Precision Floating Point (double)
283(1)
9.3.10 FP16: Half Precision Floating Point (half)
284(1)
9.3.11 What is a FLOP?
284(1)
9.3.12 Fused Multiply-Accumulate (FMA) versus Multiply-Add (MAD)
285(1)
9.3.13 Quad and Octo Precision Floating Point
285(1)
9.3.14 Pascal GP104 Engine SM Structure
285(1)
9.4 IMFLIPGC.CU: CORE-FRIENDLY IMFLIPG
286(13)
9.4.1 Hflip2(): Precomputing Kernel Parameters
288(2)
9.4.2 Vflip2(): Precomputing Kernel Parameters
290(1)
9.4.3 Computing Image Coordinates by a Thread
290(1)
9.4.4 Block ID versus Image Row Mapping
291(1)
9.4.5 Hflip3(): Using a 2D Launch Grid
292(1)
9.4.6 Vflip3(): Using a 2D Launch Grid
293(1)
9.4.7 Hflip4(): Computing Two Consecutive Pixels
294(1)
9.4.8 Vflip4(): Computing Two Consecutive Pixels
295(1)
9.4.9 Hflip5(): Computing Four Consecutive Pixels
296(1)
9.4.10 Vflip5(): Computing Four Consecutive Pixels
297(1)
9.4.11 PixCopy2(), PixCopy3(): Copying 2,4 Consecutive Pixels at a Time
298(1)
9.5 IMEDGEGC.CU: Core-Friendly IMEDGEG
299(4)
9.5.1 BWKernel20: Using Precomputed Values and 2D Blocks
299(1)
9.5.2 GaussKernel20: Using Precomputed Values and 2D Blocks
300(3)
Chapter 10 Understanding GPU Memory
303(42)
10.1 Global Memory
303(1)
10.2 L2 CACHE
304(1)
10.3 Texture/L1 Cache
304(1)
10.4 Shared Memory
305(2)
10.4.1 Split versus Dedicated Shared Memory
305(1)
10.4.2 Memory Resources Available Per Core
306(1)
10.4.3 Using Shared Memory as Software Cache
306(1)
10.4.4 Allocating Shared Memory in an SM
307(1)
10.5 Instruction Cache
307(1)
10.6 Constant Memory
307(1)
10.7 IMFLIPGCM.CU: Core And Memory Friendly IMFLIPG
308(11)
10.7.1 Hflip6(),Vflip6(): Using Shared Memory as Buffer
308(2)
10.7.2 Hflip7(): Consecutive Swap Operations in Shared Memory
310(2)
10.7.3 HfIip8(): Using Registers to Swap Four Pixels
312(2)
10.7.4 VfIip7(): Copying 4 Bytes (int) at a Time
314(1)
10.7.5 Aligned versus Unaligned Data Access in Memory
314(1)
10.7.6 VfIip8(): Copying 8 Bytes at a Time
315(1)
10.7.7 Vflip9(): Using Only Global Memory, 8 Bytes at a Time
316(1)
10.7.8 PixCopy4(), PixCopy5(): Copying One versus 4 Bytes Using Shared Memory
317(1)
10.7.9 PixCopy6(), PixCopy7(): Copying One/Two Integers Using Global Memory
318(1)
10.8 IMEDGEGCM.CU: Core- & Memory-Friendly IMEDGEG
319(14)
10.8.1 BWKernel3(): Using Byte Manipulation to Extract RGB
319(2)
10.8.2 GaussKernel3(): Using Constant Memory
321(1)
10.8.3 Ways to Handle Constant Values
321(2)
10.8.4 GaussKernel4(): Buffering Neighbors of 1 Pixel in Shared Memory
323(2)
10.8.5 GaussKernel5(): Buffering Neighbors of 4 Pixels in Shared Memory
325(2)
10.8.6 GaussKernel6(): Reading 5 Vertical Pixels into Shared Memory
327(2)
10.8.7 GaussKernel7(): Eliminating the Need to Account for Edge Pixels
329(2)
10.8.8 GaussKernel8(): Computing 8 Vertical Pixels
331(2)
10.9 CUDA Occupancy Calculator
333(12)
10.9.1 Choosing the Optimum Threads/Block
334(1)
10.9.2 SM-Level Resource Limitations
335(1)
10.9.3 What is "Occupancy"?
336(1)
10.9.4 CUDA Occupancy Calculator: Resource Computation
336(4)
10.9.5 Case Study: GaussKernel7()
340(3)
10.9.6 Case Study: GaussKernel8()
343(2)
Chapter 11 CUDA Streams
345(38)
11.1 What Is Pipelining?
347(2)
11.1.1 Execution Overlapping
347(1)
11.1.2 Exposed versus Coalesced Runtime
348(1)
11.2 Memory Allocation
349(2)
11.2.1 Physical versus Virtual Memory
349(1)
11.2.2 Physical to Virtual Address Translation
350(1)
11.2.3 Pinned Memory
350(1)
11.2.4 Allocating Pinned Memory with cudaMallocHost()
351(1)
11.3 Fast CPUGPU Data Transfers
351(1)
11.3.1 Synchronous Data Transfers
351(1)
11.3.2 Asynchronous Data Transfers
351(1)
11.4 CUDA Streams
352(4)
11.4.1 CPU->GPU Transfer, Kernel Exec, GPU-*CPUTransfer
352(1)
11.4.2 Implementing Streaming in CUDA
353(1)
11.4.3 Copy Engine
353(1)
11.4.4 Kernel Execution Engine
353(1)
11.4.5 Concurrent Upstream and Downstream PCIe Transfers
354(1)
11.4.6 Creating CUDA Streams
355(1)
11.4.7 Destroying CUDA Streams
355(1)
11.4.8 Synchronizing CUDA Streams
355(1)
11.5 IMGSTR.CU: Streaming Image Processing
356(10)
11.5.1 Reading the Image into Pinned Memory
356(2)
11.5.2 Synchronous versus Single Stream
358(1)
11.5.3 Multiple Streams
359(2)
11.5.4 Data Dependence Across Multiple Streams
361(10)
11.5.4.1 Horizontal Flip: No Data Dependence
362(1)
11.5.4.2 Edge Detection: Data Dependence
363(1)
11.5.4.3 Preprocessing Overlapping Rows Synchronously
363(1)
11.5.4.4 Asynchronous Processing the Non-Overlapping Rows
364(2)
11.6 Streaming Horizontal Flip Kernel
366(1)
11.7 IMGSTR.CU: Streaming Edge Detection
367(4)
11.8 Performance Comparison: IMGSTR.CU
371(4)
11.8.1 Synchronous versus Asynchronous Results
371(1)
11.8.2 Randomness in the Results
372(1)
11.8.3 Optimum Queuing
372(1)
11.8.4 Best Case Streaming Results
373(1)
11.8.5 Worst Case Streaming Results
374(1)
11.9 NVIDIA Visual Profiler: NWP
375(8)
11.9.1 Installing nvvp and nvprof
375(1)
11.9.2 Using nvvp
376(1)
11.9.3 Using nvprof
377(1)
11.9.4 imGStr Synchronous and Single-Stream Results
377(1)
11.9.5 imGStr 2- and 4-Stream Results
378(5)
Part III More To Know
Chapter 12 CUDA Libraries
383(14)
Mohamadhadi Habiszadeh
Omid Rajabi Shishvan
Tolga Soyata
12.1 cuBLAS
383(7)
12.1.1 BLAS Levels
383(1)
12.1.2 cuBLAS Datatypes
384(1)
12.1.3 Installing cuBLAS
385(1)
12.1.4 Variable Declaration and Initialization
385(1)
12.1.5 Device Memory Allocation
386(1)
12.1.6 Creating Context
386(1)
12.1.7 Transferring Data to the Device
386(1)
12.1.8 Calling cuBLAS Functions
387(1)
12.1.9 Transfer Data Back to the Host
388(1)
12.1.10 Deallocating Memory
388(1)
12.1.11 Example cuBLAS Program: Matrix Scalar
388(2)
12.2 cuFFT
390(2)
12.2.1 cuFFT Library Characteristics
390(1)
12.2.2 A Sample Complex-to-Complex Transform
390(1)
12.2.3 A Sample Real-to-Complex Transform
391(1)
12.3 NVIDIA Performance Primitives (NPP)
392(1)
12.4 Thrust Library
393(4)
Chapter 13 Introduction to OpenCL
397(16)
Chase Conklin
Tolga Soyata
13.1 What Is OpenCL?
397(1)
13.1.1 Multiplatform
397(1)
13.1.2 Queue-Based
397(1)
13.2 Image Flip Kernel In OPENCL
398(1)
13.3 Running Our Kernel
399(7)
13.3.1 Selecting a Device
400(1)
13.3.2 Running the Kernel
401(1)
13.3.2.1 Creating a Compute Context
401(1)
13.3.2.2 Creating a Command Queue
401(1)
13.3.2.3 Loading Kernel File
402(1)
13.3.2.4 Setting Up Kernel Invocation
403(2)
13.3.3 Runtimes of Our OpenCL Program
405(1)
13.4 Edge Detection In OpenCL
406(7)
Chapter 14 Other GPU Programming Languages
413(12)
Sia Miller
Andrew Boggio-Dandry
Tolga Soyata
14.1 GPU Programming With Python
413(7)
14.1.1 PyOpenCL Version of imflip
414(4)
14.1.2 PyOpenCL Element-Wise Kernel
418(2)
14.2 OPENGL
420(1)
14.3 OPENGL ES: OPENGL For Embedded Systems
420(1)
14.4 VULKAN
421(1)
14.5 Microsoft's High-Level Shading Language (HLSL)
421(1)
14.5.1 Shading
421(1)
14.5.2 Microsoft HLSL
422(1)
14.6 Apple's Metal API
422(1)
14.7 Apple's Swift Programming Language
423(1)
14.8 OPENCV
423(2)
14.8.1 Installing OpenCV and Face Recognition
423(1)
14.8.2 Mobile-Cloudlet-Cloud Real-Time Face Recognition
423(1)
14.8.3 Acceleration as a Service (AXaas)
423(2)
Chapter 15 Deep Learning Using CUDA
425(10)
Omid Rajabi Shishvan
Tolga Soyata
15.1 Artificial Neural Networks (ANNS)
425(1)
15.1.1 Neurons
425(1)
15.1.2 Activation Functions
425(1)
15.2 Fully Connected Neural Networks
425(2)
15.3 Deep Networks/Convolutional Neural Networks
427(1)
15.4 Training A Network
428(1)
15.5 cuDNN Library For Deep Learning
428(4)
15.5.1 Creating a Layer
429(1)
15.5.2 Creating a Network
430(1)
15.5.3 Forward Propagation
431(1)
15.5.4 Backpropagation
431(1)
15.5.5 Using cuBLAS in the Network
431(1)
15.6 Keras
432(3)
Bibliography 435(4)
Index 439
Tolga Soyata is an associate professor in the Electrical and Computer Engineering department of SUNY Albany.