Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

GPU Parallel Program Development Using CUDA [Kõva köide]

Tolga Soyata (University at Albany, SUNY, USA)

Formaat: Hardback, 476 pages, kõrgus x laius: 254x178 mm, kaal: 1000 g, 57 Tables, black and white; 24 Line drawings, black and white; 30 Halftones, black and white; 54 Illustrations, black and white
Sari: Chapman & Hall/CRC Computational Science
Ilmumisaeg: 16-Feb-2018
Kirjastus: Chapman & Hall/CRC
ISBN-10: 1498750753
ISBN-13: 9781498750752

Teised raamatud teemal:

Grid & parallel computing

Kõva köide
Hind: 69,93 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Tavahind: 93,24 €
Säästad 25%
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Hardback, 476 pages, kõrgus x laius: 254x178 mm, kaal: 1000 g, 57 Tables, black and white; 24 Line drawings, black and white; 30 Halftones, black and white; 54 Illustrations, black and white
Sari: Chapman & Hall/CRC Computational Science
Ilmumisaeg: 16-Feb-2018
Kirjastus: Chapman & Hall/CRC
ISBN-10: 1498750753
ISBN-13: 9781498750752

Teised raamatud teemal:

Grid & parallel computing

Teised raamatud sellest sooduspakkumisest: Kevadine sooduspakkumine - kuni 17. maini üle miljoni raamatu kuni 25% soodsamalt

Püsilink: https://www.kriso.ee/db/9781498750752.html

Märksõnad:

GPU Parallel Program Development using CUDA teaches GPU programming by showing the differences among different families of GPUs. This approach prepares the reader for the next generation and future generations of GPUs. The book emphasizes concepts that will remain relevant for a long time, rather than concepts that are platform-specific. At the same time, the book also provides platform-dependent explanations that are as valuable as generalized GPU concepts.

The book consists of three separate parts; it starts by explaining parallelism using CPU multi-threading in Part I. A few simple programs are used to demonstrate the concept of dividing a large task into multiple parallel sub-tasks and mapping them to CPU threads. Multiple ways of parallelizing the same task are analyzed and their pros/cons are studied in terms of both core and memory operation.

Part II of the book introduces GPU massive parallelism. The same programs are parallelized on multiple Nvidia GPU platforms and the same performance analysis is repeated. Because the core and memory structures of CPUs and GPUs are different, the results differ in interesting ways. The end goal is to make programmers aware of all the good ideas, as well as the bad ideas, so readers can apply the good ideas and avoid the bad ideas in their own programs.

Part III of the book provides pointer for readers who want to expand their horizons. It provides a brief introduction to popular CUDA libraries (such as cuBLAS, cuFFT, NPP, and Thrust),the OpenCL programming language, an overview of GPU programming using other programming languages and API libraries (such as Python, OpenCV, OpenGL, and Apples Swift and Metal,) and the deep learning library cuDNN.

List of Figures

xxiii

List of Tables

xxix

Preface

xxxiii

About the Author

xxxv

Part I Understanding CPU Parallelism

Chapter 1 Introduction to CPU Parallel Programming

(24)

1.1 Evolution Of Parallel Programming

(1)

1.2 More Cores, More Parallelism

(1)

1.3 Cores Versus Threads

(3)

1.3.1 More Threads or More Cores to Parallelize?

(2)

1.3.2 Influence of Core Resource Sharing

(1)

1.3.3 Influence of Memory Resource Sharing

(1)

1.4 Our First Serial Program

(5)

1.4.1 Understanding Data Transfer Speeds

(2)

1.4.2 The main() Function in imflip.c

(1)

1.4.3 Flipping Rows Vertically: FliplmageV()

(1)

1.4.4 Flipping Columns Horizontally: FliplmageH()

(1)

1.5 Writing, Compiling, Running Our Programs

(2)

1.5.1 Choosing an Editor and a Compiler

(1)

1.5.2 Developing in Windows 7, 8, and Windows 10 Platforms

(2)

1.5.3 Developing in a Mac Platform

(1)

1.5.4 Developing in a Unix Platform

(1)

1.6 Crash Course On UNIX

(4)

1.6.1 Unix Directory-Related Commands

(1)

1.6.2 Unix File-Related Commands

(3)

1.7 Debugging Your Programs

(4)

1.7.1 gdb

(1)

1.7.2 Old School Debugging

(1)

1.7.3 valgrind

(1)

1.8 Performance Of Our First Serial Program

(4)

1.8.1 Can We Estimate the Execution Time?

(1)

1.8.2 What Does the OS Do When Our Code Is Executing?

(1)

1.8.3 How Do We Parallelize It?

(1)

1.8.4 Thinking About the Resources

(2)

Chapter 2 Developing Our First Parallel CPU Program

(26)

2.1 Our First Parallel Program

(10)

2.1.1 The main() Function in imflipP.c

(1)

2.1.2 Timing the Execution

(1)

2.1.3 Split Code Listing for main() in imflipP.c

(3)

2.1.4 Thread Initialization

(1)

2.1.5 Thread Creation

(2)

2.1.6 Thread Launch/Execution

(1)

2.1.7 Thread Termination (Join)

(1)

2.1.8 Thread Task and Data Splitting

(2)

2.2 Working With Bitmap (BMP) Files

(5)

2.2.1 BMP is a Non-Lossy/Uncompressed File Format

(1)

2.2.2 BMP Image File Format

(1)

2.2.3 Header File ImageStuff.h

(1)

2.2.4 Image Manipulation Routines in ImageStuff.c

(2)

2.3 Task Execution By Threads

(9)

2.3.1 Launching a Thread

(2)

2.3.2 Multithreaded Vertical Flip: MTFlipV()

(3)

2.3.3 Comparing FliplmageV() and MTFlipV()

(2)

2.3.4 Multithreaded Horizontal Flip: MTFlipH()

(1)

2.4 Testing/Timing The Multithreaded Code

(2)

Chapter 3 Improving Our First Parallel CPU Program

(26)

3.1 Effect Of The "Programmer" On Performance

(1)

3.2 Effect Of The "CPU" On Performance

(3)

3.2.1 In-Order versus Out-Of-Order Cores

(2)

3.2.2 Thin versus Thick Threads

(1)

3.3 Performance Of IMFLIPP

(1)

3.4 Effect Of The "OS" On Performance

(5)

3.4.1 Thread Creation

(1)

3.4.2 Thread Launch and Execution

(1)

3.4.3 Thread Status

(1)

3.4.4 Mapping Software Threads to Hardware Threads

(1)

3.4.5 Program Performance versus Launched Pthreads

(1)

3.5 Improving IMFLIPP

(4)

3.5.1 Analyzing Memory Access Patterns in MTFlipH()

(1)

3.5.2 Multithreaded Memory Access of MTFlipH()

(2)

3.5.3 DRAM Access Rules of Thumb

(1)

3.6 IMFLIPPM: Obeying Dram Rules Of Thumb

(5)

3.6.1 Chaotic Memory Access Patterns of imflipP

(1)

3.6.2 Improving Memory Access Patterns of imflipP

(1)

3.6.3 MTFlipHM(): The Memory Friendly MTFlipH()

(2)

3.6.4 MTFlipVM(): The Memory Friendly MTFlipV()

(1)

3.7 Performance Of IMFLIPPM.C

(2)

3.7.1 Comparing Performances of imflipP.c and imflipPM.c

(1)

3.7.2 Speed Improvement: MTFlipV() versus MTFlipVM()

(1)

3.7.3 Speed Improvement: MTFlipH() versus MTFlipHM()

(1)

3.7.4 Understanding the Speedup: MTFlipH() versus MTFlipHM()

(1)

3.8 Process Memory Map

(2)

3.9 Intel Mic Architecture: Xeon Phi

(1)

3.10 What About The GPU?

(1)

3.11
Chapter Summary

(1)

Chapter 4 Understanding the Cores and Memory

(28)

4.1 Once Upon A Time...INTEL...

(1)

4.2 CPU And Memory Manufacturers

(1)

4.3 Dynamic (DRAM) Versus Static (SRAM) Memory

(2)

4.3.1 Static Random Access Memory (SRAM)

(1)

4.3.2 Dynamic Random Access Memory (DRAM)

(1)

4.3.3 DRAM Interface Standards

(1)

4.3.4 Influence of DRAM on our Program Performance

(1)

4.3.5 Influence of SRAM (Cache) on our Program Performance

(1)

4.4 Image Rotation Program: IMROTATE.C

(6)

4.4.1 Description of the imrotate.c

(1)

4.4.2 imrotate.c: Parametric Restrictions and Simplifications

(1)

4.4.3 imrotate.c: Theory of Operation

(4)

4.5 Performance Of Imrotate

(2)

4.5.1 Qualitative Analysis of Threading Efficiency

(1)

4.5.2 Quantitative Analysis: Defining Threading Efficiency

(2)

4.6 The Architecture Of The Computer

(6)

4.6.1 The Cores, L1$ and L2$

(1)

4.6.2 Internal Core Resources

(2)

4.6.3 The Shared L3 Cache Memory (L3$)

(1)

4.6.4 The Memory Controller

(1)

4.6.5 The Main Memory

(1)

4.6.6 Queue, Uncore, and I/O

(1)

4.7 IMROTATEMC: Making Imrotate More Efficient

(9)

4.7.1 Rotate2(): How Bad is Square Root and FP Division?

(1)

4.7.2 Rotate3() and Rotate4(): How Bad Is sin() and cos()?

100

(2)

4.7.3 Rotate5(): How Bad Is Integer Division/Multiplication?

102

(1)

4.7.4 Rotate6(): Consolidating Computations

102

(2)

4.7.5 Rotate7(): Consolidating More Computations

104

(1)

4.7.6 Overall Performance of imrotateMC

104

(2)

4.8
Chapter Summary

106

(1)

Chapter 5 Thread Management and Synchronization

107

(30)

5.1 Edge Detection Program: IMEDGE.C

107

(4)

5.1.1 Description of the imedge.c

108

(1)

5.1.2 imedge.c: Parametric Restrictions and Simplifications

108

(1)

5.1.3 imedge.c: Theory of Operation

109

(2)

5.2 IMEDGE.C: Implementation

111

(7)

5.2.1 Initialization and Time-Stamping

112

(1)

5.2.2 Initialization Functions for Different Image Representations

113

(1)

5.2.3 Launching and Terminating Threads

114

(1)

5.2.4 Gaussian Filter

115

(1)

5.2.5 Sobel

116

(1)

5.2.6 Threshold

117

(1)

5.3 Performance Of IMEDGE

118

(1)

5.4 IMEDGEMC: Making Imedge More Efficient

118

(8)

5.4.1 Using Precomputation to Reduce Bandwidth

119

(1)

5.4.2 Storing the Precomputed Pixel Values

120

(1)

5.4.3 Precomputing Pixel Values

121

(1)

5.4.4 Reading the Image and Precomputing Pixel Values

122

(1)

5.4.5 PrGaussianFilter

123

(1)

5.4.6 PrSobel

124

(1)

5.4.7 PrThreshold

125

(1)

5.5 Performance Of IMEDGEMC

126

(1)

5.6 IMEDGEMCT: Synchronizing Threads Efficiently

127

(3)

5.6.1 Barrier Synchronization

128

(1)

5.6.2 MUTEX Structure for Data Sharing

129

(1)

5.7 IMEDGEMCT: Implementation

130

(4)

5.7.1 Using a MUTEX: Read Image, Precompute

132

(1)

5.7.2 Precomputing One Row at a Time

133

(1)

5.8 Performance Of IMEDGEMCT

134

(3)

Part II GPU Programming Using CUDA

Chapter 6 Introduction to GPU Parallelism and CUDA

137

(48)

6.1 Once Upon A Time...NVIDIA

137

(6)

6.1.1 The Birth of the GPU

137

(1)

6.1.2 Early GPU Architectures

138

(2)

6.1.3 The Birth of the GPGPU

140

(1)

6.1.4 Nvidia, ATI Technologies, and Intel

141

(2)

6.2 Compute-Unified Device Architecture (CUDA)

143

(1)

6.2.1 CUDA, OpenCL, and Other GPU Languages

143

(1)

6.2.2 Device Side versus Host Side Code

143

(1)

6.3 Understanding GPU Parallelism

144

(3)

6.3.1 How Does the GPU Achieve High Performance?

145

(1)

6.3.2 CPU versus GPU Architectural Differences

146

(1)

6.4 CUDA Version Of The Image Flipper: IMFLIPG.CU

147

(23)

6.4.1 imflipG.cu: Read the Image into a CPU-Side Array

149

(2)

6.4.2 Initialize and Query the GPUs

151

(2)

6.4.3 GPU-Side Time-Stamping

153

(2)

6.4.4 GPU-Side Memory Allocation

155

(1)

6.4.5 GPU Drivers and Nvidia Runtime Engine

155

(1)

6.4.6 CPU-+GPU Data Transfer

156

(1)

6.4.7 Error Reporting Using Wrapper Functions

157

(1)

6.4.8 GPU Kernel Execution

157

(3)

6.4.9 Finish Executing the GPU Kernel

160

(1)

6.4.10 Transfer GPU Results Back to the CPU

161

(1)

6.4.11 Complete Time-Stamping

161

(1)

6.4.12 Report the Results and Cleanup

162

(1)

6.4.13 Reading and Writing the BMP File

163

(1)

6.4.14 Vflip(): The GPU Kernel for Vertical Flipping

164

(2)

6.4.15 What Is My Thread ID, Block ID, and Block Dimension?

166

(3)

6.4.16 Hflip(): The GPU Kernel for Horizontal Flipping

169

(1)

6.4.17 Hardware Parameters: threadlDx.x, blockIdx.x, blockDim.x

169

(1)

6.4.18 PixCopy(): The GPU Kernel for Copying an Image

169

(1)

6.4.19 CUDA Keywords

170

(1)

6.5 CUDA Program Development In Windows

170

(9)

6.5.1 Installing MS Visual Studio 2015 and CUDA Toolkit 8.0

171

(1)

6.5.2 Creating Project imflipG.cu in Visual Studio 2015

172

(2)

6.5.3 Compiling Project imflipG.cu in Visual Studio 2015

174

(3)

6.5.4 Running Our First CUDA Application: imflipG.exe

177

(1)

6.5.5 Ensuring Your Program's Correctness

178

(1)

6.6 CUDA Program Development On A Mac Platform

179

(2)

6.6.1 Installing XCode on Your Mac

179

(1)

6.6.2 Installing the CUDA Driver and CUDA Toolkit

180

(1)

6.6.3 Compiling and Running CUDA Applications on a Mac

180

(1)

6.7 CUDA Program Development In A Unix Platform

181

(4)

6.7.1 Installing Eclipse and CUDA Toolkit

181

(1)

6.7.2 ssh into a Cluster

182

(1)

6.7.3 Compiling and Executing Your CUDA Code

182

(3)

Chapter 7 CUDA Host/Device Programming Model

185

(40)

7.1 Designing Your Program's Parallelism

185

(4)

7.1.1 Conceptually Parallelizing a Task

186

(1)

7.1.2 What Is a Good Block Size for Vflip()?

187

(1)

7.1.3 imflipG.cu: Interpreting the Program Output

187

(1)

7.1.4 imflipG.cu: Performance Impact of Block and Image Size

188

(1)

7.2 Kernel Launch Components

189

(4)

7.2.1 Grids

189

(1)

7.2.2 Blocks

190

(1)

7.2.3 Threads

191

(1)

7.2.4 Warps and Lanes

192

(1)

7.3 IMFLIPG.CU: Understanding The Kernel Details

193

(6)

7.3.1 Launching Kernels in main() and Passing Arguments to Them

193

(1)

7.3.2 Thread Execution Steps

194

(1)

7.3.3 Vflip() Kernel Details

195

(1)

7.3.4 Comparing Vflip() and MTFlipV()

196

(1)

7.3.5 Hflip() Kernel Details

197

(1)

7.3.6 PixCopy() Kernel Details

197

(2)

7.4 Dependence Of PCI Express Speed On The CPU

199

(1)

7.5 Performance Impact Of PCI Express Bus

200

(4)

7.5.1 Data Transfer Time, Speed, Latency, Throughput, and Bandwidth

200

(1)

7.5.2 PCIe Throughput Achieved with imflipG.cu

201

(3)

7.6 Performance Impact Of Global Memory Bus

204

(2)

7.7 Performance Impact Of Compute Capability

206

(8)

7.7.1 Fermi, Kepler, Maxwell, Pascal, and Volta Families

207

(1)

7.7.2 Relative Bandwidth Achieved in Different Families

207

(1)

7.7.3 imflipG2.cu: Compute Capability 2.0 Version of imflipG.cu

208

(2)

7.7.4 imflipG2.cu: Changes in main()

210

(1)

7.7.5 The PxCC20() Kernel

211

(1)

7.7.6 The VfCC20() Kernel

212

(2)

7.8 Performance Of IMFLIPG2.CU

214

(1)

7.9 Old-School CUDA Debugging

214

(7)

7.9.1 Common CUDA Bugs

216

(2)

7.9.2 return Debugging

218

(2)

7.9.3 Comment-Based Debugging

220

(1)

7.9.4 printf() Debugging

220

(1)

7.10 Biological Reasons For Software Bugs

221

(4)

7.10.1 How Is Our Brain Involved in Writing/Debugging Code?

222

(1)

7.10.2 Do We Write Buggy Code When We Are Tired?

222

(4)

7.10.2.1 Attention

223

(1)

7.10.2.2 Physical Tiredness

223

(1)

7.10.2.3 Tiredness Due to Heavy Physical Activity

223

(1)

7.10.2.4 Tiredness Due to Needing Sleep

223

(1)

7.10.2.5 Mental Tiredness

224

(1)

Chapter 8 Understanding GPU Hardware Architecture

225

(38)

8.1 GPU Hardware Architecture

226

(1)

8.2 GPU Hardware Components

226

(4)

8.2.1 SM: Streaming Multiprocessor

226

(1)

8.2.2 GPU Cores

227

(1)

8.2.3 Giga-Thread Scheduler

227

(2)

8.2.4 Memory Controllers

229

(1)

8.2.5 Shared Cache Memory (L2$)

229

(1)

8.2.6 Host Interface

229

(1)

8.3 NVIDIA GPU Architectures

230

(3)

8.3.1 Fermi Architecture

231

(1)

8.3.2 GT, GTX, and Compute Accelerators

231

(1)

8.3.3 Kepler Architecture

232

(1)

8.3.4 Maxwell Architecture

232

(1)

8.3.5 Pascal Architecture and NVLink

233

(1)

8.4 CUDA Edge Detection: IMEDGEG.CU

233

(9)

8.4.1 Variables to Store the Image in CPU, GPU Memory

233

(2)

8.4.1.1 TheImage and Copylmage

233

(1)

8.4.1.2 GPUImg

234

(1)

8.4.1.3 GPUBWImg

234

(1)

8.4.1.4 GPUGaussImg

234

(1)

8.4.1.5 GPUGradient and GPUTheta

234

(1)

8.4.1.6 GPUResultImg

235

(1)

8.4.2 Allocating Memory for the GPU Variables

235

(3)

8.4.3 Calling the Kernels and Time-Stamping Their Execution

238

(1)

8.4.4 Computing the Kernel Performance

239

(1)

8.4.5 Computing the Amount of Kernel Data Movement

239

(3)

8.4.6 Reporting the Kernel Performance

242

(1)

8.5 IMEDGEG: Kernels

242

(7)

8.5.1 BWKernel()

242

(2)

8.5.2 GaussKernel()

244

(2)

8.5.3 SobelKernel()

246

(3)

8.5.4 ThresholdKernel()

249

(1)

8.6 Performance Of IMEDGEG.CU

249

(4)

8.6.1 imedgeG.cu: PCIe Bus Utilization

250

(1)

8.6.2 imedgeG.cu: Runtime Results

250

(2)

8.6.3 imedgeG.cu: Kernel Performance Comparison

252

(1)

8.7 GPU Code: Compile Time

253

(2)

8.7.1 Designing CUDA Code

253

(2)

8.7.2 Compiling CUDA Code

255

(1)

8.7.3 GPU Assembly: PTX, CUBIN

255

(1)

8.8 GPU Code: Launch

255

(2)

8.8.1 OS Involvement and CUDA DLL File

255

(1)

8.8.2 GPU Graphics Driver

256

(1)

8.8.3 CPU4-4GPU Memory Transfers

256

(1)

8.9 GPU Code: Execution (Run Time)

257

(6)

8.9.1 Getting the Data

257

(1)

8.9.2 Getting the Code and Parameters

257

(1)

8.9.3 Launching Grids of Blocks

258

(1)

8.9.4 Giga Thread Scheduler (GTS)

258

(1)

8.9.5 Scheduling Blocks

259

(1)

8.9.6 Executing Blocks

260

(1)

8.9.7 Transparent Scalability

261

(2)

Chapter 9 Understanding GPU Cores

263

(40)

9.1 GPU Architecture Families

263

(12)

9.1.1 Fermi Architecture

263

(1)

9.1.2 Fermi SM Structure

264

(2)

9.1.3 Kepler Architecture

266

(1)

9.1.4 Kepler SMX Structure

267

(1)

9.1.5 Maxwell Architecture

268

(1)

9.1.6 Maxwell SMM Structure

268

(2)

9.1.7 Pascal GP100 Architecture

270

(1)

9.1.8 Pascal GP100 SM Structure

271

(1)

9.1.9 Family Comparison: Peak GFLOPS and Peak DGFLOPS

272

(1)

9.1.10 GPU Boost

273

(1)

9.1.11 GPU Power Consumption

274

(1)

9.1.12 Computer Power Supply

274

(1)

9.2 Streaming Multiprocessor (SM) Building Blocks

275

(4)

9.2.1 GPU Cores

275

(1)

9.2.2 Double Precision Units (DPU)

276

(1)

9.2.3 Special Function Units (SFU)

276

(1)

9.2.4 Register File (RF)

276

(1)

9.2.5 Load/Store Queues (LDST)

277

(1)

9.2.6 L1$ and Texture Cache

277

(1)

9.2.7 Shared Memory

278

(1)

9.2.8 Constant Cache

278

(1)

9.2.9 Instruction Cache

278

(1)

9.2.10 Instruction Buffer

278

(1)

9.2.11 Warp Schedulers

278

(1)

9.2.12 Dispatch Units

279

(1)

9.3 Parallel Thread Execution (PTX) Data Types

279

(7)

9.3.1 INT8: 8-bit Integer

280

(1)

9.3.2 INT16: 16-bit Integer

280

(1)

9.3.3 24-bit Integer

280

(1)

9.3.4 INT32: 32-bit Integer

281

(1)

9.3.5 Predicate Registers (32-bit)

281

(1)

9.3.6 INT64: 64-bit Integer

282

(1)

9.3.7 128-bit Integer

282

(1)

9.3.8 FP32: Single Precision Floating Point (float)

282

(1)

9.3.9 FP64: Double Precision Floating Point (double)

283

(1)

9.3.10 FP16: Half Precision Floating Point (half)

284

(1)

9.3.11 What is a FLOP?

284

(1)

9.3.12 Fused Multiply-Accumulate (FMA) versus Multiply-Add (MAD)

285

(1)

9.3.13 Quad and Octo Precision Floating Point

285

(1)

9.3.14 Pascal GP104 Engine SM Structure

285

(1)

9.4 IMFLIPGC.CU: CORE-FRIENDLY IMFLIPG

286

(13)

9.4.1 Hflip2(): Precomputing Kernel Parameters

288

(2)

9.4.2 Vflip2(): Precomputing Kernel Parameters

290

(1)

9.4.3 Computing Image Coordinates by a Thread

290

(1)

9.4.4 Block ID versus Image Row Mapping

291

(1)

9.4.5 Hflip3(): Using a 2D Launch Grid

292

(1)

9.4.6 Vflip3(): Using a 2D Launch Grid

293

(1)

9.4.7 Hflip4(): Computing Two Consecutive Pixels

294

(1)

9.4.8 Vflip4(): Computing Two Consecutive Pixels

295

(1)

9.4.9 Hflip5(): Computing Four Consecutive Pixels

296

(1)

9.4.10 Vflip5(): Computing Four Consecutive Pixels

297

(1)

9.4.11 PixCopy2(), PixCopy3(): Copying 2,4 Consecutive Pixels at a Time

298

(1)

9.5 IMEDGEGC.CU: Core-Friendly IMEDGEG

299

(4)

9.5.1 BWKernel20: Using Precomputed Values and 2D Blocks

299

(1)

9.5.2 GaussKernel20: Using Precomputed Values and 2D Blocks

300

(3)

Chapter 10 Understanding GPU Memory

303

(42)

10.1 Global Memory

303

(1)

10.2 L2 CACHE

304

(1)

10.3 Texture/L1 Cache

304

(1)

10.4 Shared Memory

305

(2)

10.4.1 Split versus Dedicated Shared Memory

305

(1)

10.4.2 Memory Resources Available Per Core

306

(1)

10.4.3 Using Shared Memory as Software Cache

306

(1)

10.4.4 Allocating Shared Memory in an SM

307

(1)

10.5 Instruction Cache

307

(1)

10.6 Constant Memory

307

(1)

10.7 IMFLIPGCM.CU: Core And Memory Friendly IMFLIPG

308

(11)

10.7.1 Hflip6(),Vflip6(): Using Shared Memory as Buffer

308

(2)

10.7.2 Hflip7(): Consecutive Swap Operations in Shared Memory

310

(2)

10.7.3 HfIip8(): Using Registers to Swap Four Pixels

312

(2)

10.7.4 VfIip7(): Copying 4 Bytes (int) at a Time

314

(1)

10.7.5 Aligned versus Unaligned Data Access in Memory

314

(1)

10.7.6 VfIip8(): Copying 8 Bytes at a Time

315

(1)

10.7.7 Vflip9(): Using Only Global Memory, 8 Bytes at a Time

316

(1)

10.7.8 PixCopy4(), PixCopy5(): Copying One versus 4 Bytes Using Shared Memory

317

(1)

10.7.9 PixCopy6(), PixCopy7(): Copying One/Two Integers Using Global Memory

318

(1)

10.8 IMEDGEGCM.CU: Core- & Memory-Friendly IMEDGEG

319

(14)

10.8.1 BWKernel3(): Using Byte Manipulation to Extract RGB

319

(2)

10.8.2 GaussKernel3(): Using Constant Memory

321

(1)

10.8.3 Ways to Handle Constant Values

321

(2)

10.8.4 GaussKernel4(): Buffering Neighbors of 1 Pixel in Shared Memory

323

(2)

10.8.5 GaussKernel5(): Buffering Neighbors of 4 Pixels in Shared Memory

325

(2)

10.8.6 GaussKernel6(): Reading 5 Vertical Pixels into Shared Memory

327

(2)

10.8.7 GaussKernel7(): Eliminating the Need to Account for Edge Pixels

329

(2)

10.8.8 GaussKernel8(): Computing 8 Vertical Pixels

331

(2)

10.9 CUDA Occupancy Calculator

333

(12)

10.9.1 Choosing the Optimum Threads/Block

334

(1)

10.9.2 SM-Level Resource Limitations

335

(1)

10.9.3 What is "Occupancy"?

336

(1)

10.9.4 CUDA Occupancy Calculator: Resource Computation

336

(4)

10.9.5 Case Study: GaussKernel7()

340

(3)

10.9.6 Case Study: GaussKernel8()

343

(2)

Chapter 11 CUDA Streams

345

(38)

11.1 What Is Pipelining?

347

(2)

11.1.1 Execution Overlapping

347

(1)

11.1.2 Exposed versus Coalesced Runtime

348

(1)

11.2 Memory Allocation

349

(2)

11.2.1 Physical versus Virtual Memory

349

(1)

11.2.2 Physical to Virtual Address Translation

350

(1)

11.2.3 Pinned Memory

350

(1)

11.2.4 Allocating Pinned Memory with cudaMallocHost()

351

(1)

11.3 Fast CPUGPU Data Transfers

351

(1)

11.3.1 Synchronous Data Transfers

351

(1)

11.3.2 Asynchronous Data Transfers

351

(1)

11.4 CUDA Streams

352

(4)

11.4.1 CPU->GPU Transfer, Kernel Exec, GPU-*CPUTransfer

352

(1)

11.4.2 Implementing Streaming in CUDA

353

(1)

11.4.3 Copy Engine

353

(1)

11.4.4 Kernel Execution Engine

353

(1)

11.4.5 Concurrent Upstream and Downstream PCIe Transfers

354

(1)

11.4.6 Creating CUDA Streams

355

(1)

11.4.7 Destroying CUDA Streams

355

(1)

11.4.8 Synchronizing CUDA Streams

355

(1)

11.5 IMGSTR.CU: Streaming Image Processing

356

(10)

11.5.1 Reading the Image into Pinned Memory

356

(2)

11.5.2 Synchronous versus Single Stream

358

(1)

11.5.3 Multiple Streams

359

(2)

11.5.4 Data Dependence Across Multiple Streams

361

(10)

11.5.4.1 Horizontal Flip: No Data Dependence

362

(1)

11.5.4.2 Edge Detection: Data Dependence

363

(1)

11.5.4.3 Preprocessing Overlapping Rows Synchronously

363

(1)

11.5.4.4 Asynchronous Processing the Non-Overlapping Rows

364

(2)

11.6 Streaming Horizontal Flip Kernel

366

(1)

11.7 IMGSTR.CU: Streaming Edge Detection

367

(4)

11.8 Performance Comparison: IMGSTR.CU

371

(4)

11.8.1 Synchronous versus Asynchronous Results

371

(1)

11.8.2 Randomness in the Results

372

(1)

11.8.3 Optimum Queuing

372

(1)

11.8.4 Best Case Streaming Results

373

(1)

11.8.5 Worst Case Streaming Results

374

(1)

11.9 NVIDIA Visual Profiler: NWP

375

(8)

11.9.1 Installing nvvp and nvprof

375

(1)

11.9.2 Using nvvp

376

(1)

11.9.3 Using nvprof

377

(1)

11.9.4 imGStr Synchronous and Single-Stream Results

377

(1)

11.9.5 imGStr 2- and 4-Stream Results

378

(5)

Part III More To Know

Chapter 12 CUDA Libraries

383

(14)

Mohamadhadi Habiszadeh

Omid Rajabi Shishvan

Tolga Soyata

12.1 cuBLAS

383

(7)

12.1.1 BLAS Levels

383

(1)

12.1.2 cuBLAS Datatypes

384

(1)

12.1.3 Installing cuBLAS

385

(1)

12.1.4 Variable Declaration and Initialization

385

(1)

12.1.5 Device Memory Allocation

386

(1)

12.1.6 Creating Context

386

(1)

12.1.7 Transferring Data to the Device

386

(1)

12.1.8 Calling cuBLAS Functions

387

(1)

12.1.9 Transfer Data Back to the Host

388

(1)

12.1.10 Deallocating Memory

388

(1)

12.1.11 Example cuBLAS Program: Matrix Scalar

388

(2)

12.2 cuFFT

390

(2)

12.2.1 cuFFT Library Characteristics

390

(1)

12.2.2 A Sample Complex-to-Complex Transform

390

(1)

12.2.3 A Sample Real-to-Complex Transform

391

(1)

12.3 NVIDIA Performance Primitives (NPP)

392

(1)

12.4 Thrust Library

393

(4)

Chapter 13 Introduction to OpenCL

397

(16)

Chase Conklin

Tolga Soyata

13.1 What Is OpenCL?

397

(1)

13.1.1 Multiplatform

397

(1)

13.1.2 Queue-Based

397

(1)

13.2 Image Flip Kernel In OPENCL

398

(1)

13.3 Running Our Kernel

399

(7)

13.3.1 Selecting a Device

400

(1)

13.3.2 Running the Kernel

401

(1)

13.3.2.1 Creating a Compute Context

401

(1)

13.3.2.2 Creating a Command Queue

401

(1)

13.3.2.3 Loading Kernel File

402

(1)

13.3.2.4 Setting Up Kernel Invocation

403

(2)

13.3.3 Runtimes of Our OpenCL Program

405

(1)

13.4 Edge Detection In OpenCL

406

(7)

Chapter 14 Other GPU Programming Languages

413

(12)

Sia Miller

Andrew Boggio-Dandry

Tolga Soyata

14.1 GPU Programming With Python

413

(7)

14.1.1 PyOpenCL Version of imflip

414

(4)

14.1.2 PyOpenCL Element-Wise Kernel

418

(2)

14.2 OPENGL

420

(1)

14.3 OPENGL ES: OPENGL For Embedded Systems

420

(1)

14.4 VULKAN

421

(1)

14.5 Microsoft's High-Level Shading Language (HLSL)

421

(1)

14.5.1 Shading

421

(1)

14.5.2 Microsoft HLSL

422

(1)

14.6 Apple's Metal API

422

(1)

14.7 Apple's Swift Programming Language

423

(1)

14.8 OPENCV

423

(2)

14.8.1 Installing OpenCV and Face Recognition

423

(1)

14.8.2 Mobile-Cloudlet-Cloud Real-Time Face Recognition

423

(1)

14.8.3 Acceleration as a Service (AXaas)

423

(2)

Chapter 15 Deep Learning Using CUDA

425

(10)

Omid Rajabi Shishvan

Tolga Soyata

15.1 Artificial Neural Networks (ANNS)

425

(1)

15.1.1 Neurons

425

(1)

15.1.2 Activation Functions

425

(1)

15.2 Fully Connected Neural Networks

425

(2)

15.3 Deep Networks/Convolutional Neural Networks

427

(1)

15.4 Training A Network

428

(1)

15.5 cuDNN Library For Deep Learning

428

(4)

15.5.1 Creating a Layer

429

(1)

15.5.2 Creating a Network

430

(1)

15.5.3 Forward Propagation

431

(1)

15.5.4 Backpropagation

431

(1)

15.5.5 Using cuBLAS in the Network

431

(1)

15.6 Keras

432

(3)

Bibliography

435

(4)

Index

439

Tolga Soyata is an associate professor in the Electrical and Computer Engineering department of SUNY Albany.

GPU Parallel Program Development Using CUDA [Kõva köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv