Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

CUDA Handbook: A Comprehensive Guide to GPU Programming, The [Pehme köide]

3.96/5 (48 hinnangut Goodreads-ist)

Nicholas Wilt

Formaat: Paperback / softback, 528 pages, kõrgus x laius x paksus: 231x187x28 mm, kaal: 846 g
Ilmumisaeg: 27-Jun-2013
Kirjastus: Addison-Wesley Educational Publishers Inc
ISBN-10: 0321809467
ISBN-13: 9780321809469

Teised raamatud teemal:

Computer programming / software development - (Hetkel poes: 4 nimetust)
Parallel processing

Pehme köide
Hind: 62,86 €
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 528 pages, kõrgus x laius x paksus: 231x187x28 mm, kaal: 846 g
Ilmumisaeg: 27-Jun-2013
Kirjastus: Addison-Wesley Educational Publishers Inc
ISBN-10: 0321809467
ISBN-13: 9780321809469

Teised raamatud teemal:

Computer programming / software development - (Hetkel poes: 4 nimetust)
Parallel processing

Püsilink: https://www.kriso.ee/db/9780321809469.html

Märksõnad:

The CUDA Handbook begins where CUDA by Example (Addison-Wesley, 2011) leaves off, discussing CUDA hardware and software in greater detail and covering both CUDA 5.0 and Kepler. Every CUDA developer, from the casual to the most sophisticated, will find something here of interest and immediate usefulness. Newer CUDA developers will see how the hardware processes commands and how the driver checks progress; more experienced CUDA developers will appreciate the expert coverage of topics such as the driver API and context migration, as well as the guidance on how best to structure CPU/GPU data interchange and synchronization.

The accompanying open source code–more than 25,000 lines of it, freely available at www.cudahandbook.com–is specifically intended to be reused and repurposed by developers.

Designed to be both a comprehensive reference and a practical cookbook, the text is divided into the following three parts:

Part I, Overview, gives high-level descriptions of the hardware and software that make CUDA possible.

Part II, Details, provides thorough descriptions of every aspect of CUDA, including

Memory
Streams and events
Models of execution, including the dynamic parallelism feature, new with CUDA 5.0 and SM 3.5
The streaming multiprocessors, including descriptions of all features through SM 3.5
Programming multiple GPUs
Texturing

The source code accompanying Part II is presented as reusable microbenchmarks and microdemos, designed to expose specific hardware characteristics or highlight specific use cases.

Part III, Select Applications, details specific families of CUDA applications and key parallel algorithms, including

Streaming workloads
Reduction
Parallel prefix sum (Scan)
N-body
Image Processing

These algorithms cover the full range of potential CUDA applications.

Preface

xxi

Acknowledgments

xxiii

About the Author

xxv

PART I

(118)

Chapter 1 Background

(8)

1.1 Our Approach

(1)

1.2 Code

(1)

1.2.1 Microbenchmarks

(1)

1.2.2 Microdemos

(1)

1.2.3 Optimization Journeys

(1)

1.3 Administrative Items

(1)

1.3.1 Open Source

(1)

1.3.2 CUDA Handbook Library (chLib)

(1)

1.3.3 Coding Style

(1)

1.3.4 CUDA SDK

(1)

1.4 Road Map

(3)

Chapter 2 Hardware Architecture

(40)

2.1 CPU Configurations

(6)

2.1.1 Front-Side Bus

(1)

2.1.2 Symmetric Multiprocessors

(1)

2.1.3 Nonuniform Memory Access

(3)

2.1.4 PCI Express Integration

(1)

2.2 Integrated GPUs

(2)

2.3 Multiple GPUs

(3)

2.4 Address Spaces in CUDA

(10)

2.4.1 Virtual Addressing: A Brief History

(4)

2.4.2 Disjoint Address Spaces

(2)

2.4.3 Mapped Pinned Memory

(1)

2.4.4 Portable Pinned Memory

(1)

2.4.5 Unified Addressing

(1)

2.4.6 Peer-to-Peer Mappings

(1)

2.5 CPU/GPU Interactions

(9)

2.5.1 Pinned Host Memory and Command Buffers

(3)

2.5.2 CPU/GPU Concurrency

(4)

2.5.3 The Host Interface and Intra-GPU Synchronization

(2)

2.5.4 Inter-GPU Synchronization

(1)

2.6 GPU Architecture

(9)

2.6.1 Overview

(4)

2.6.2 Streaming Multiprocessors

(4)

2.7 Further Reading

(1)

Chapter 3 Software Architecture

(42)

3.1 Software Layers

(8)

3.1.1 CUDA Runtime and Driver

(1)

3.1.2 Driver Models

(3)

3.1.3 NVCC, PTX, and Microcode

(2)

3.2 Devices and Initialization

(8)

3.2.1 Device Count

(1)

3.2.2 Device Attributes

(3)

3.2.3 When CUDA Is Not Present

(4)

3.3 Contexts

(4)

3.3.1 Lifetime and Scoping

(1)

3.3.2 Preallocation of Resources

(1)

3.3.3 Address Space

(1)

3.3.4 Current Context Stack

(2)

3.3.5 Context State

(1)

3.4 Modules and Functions

(2)

3.5 Kernels (Functions)

(2)

3.6 Device Memory

(1)

3.7 Streams and Events

(3)

3.7.1 Software Pipelining

(1)

3.7.2 Stream Callbacks

(1)

3.7.3 The NULL Stream

(1)

3.7.4 Events

(1)

3.8 Host Memory

(3)

3.8.1 Pinned Host Memory

(1)

3.8.2 Portable Pinned Memory

(1)

3.8.3 Mapped Pinned Memory

(1)

3.8.4 Host Memory Registration

(1)

3.9 CUDA Arrays and Texturing

(4)

3.9.1 Texture References

(3)

3.9.2 Surface References

(1)

3.10 Graphics Interoperability

(1)

3.11 The CUDA Runtime and CUDA Driver API

(6)

Chapter 4 Software Environment

(26)

4.1 nvcc---CUDA Compiler Driver

(7)

4.2 ptxas---the PTX Assembler

100

(5)

4.3 cuobjdump

105

(1)

4.4 nvidia-smi

106

(3)

4.5 Amazon Web Services

109

(10)

4.5.1 Command-Line Tools

110

(1)

4.5.2 EC2 and Virtualization

110

(1)

4.5.3 Key Pairs

111

(1)

4.5.4 Availability Zones (AZs) and Regions

112

(1)

4.5.5 S3

112

(1)

4.5.6 EBS

113

(1)

4.5.7 AMIs

113

(1)

4.5.8 Linux on EC2

114

(1)

4.5.9 Windows on EC2

115

(4)

PART II

119

(232)

Chapter 5 Memory

121

(52)

5.1 Host Memory

122

(8)

5.1.1 Allocating Pinned Memory

122

(1)

5.1.2 Portable Pinned Memory

123

(1)

5.1.3 Mapped Pinned Memory

124

(1)

5.1.4 Write-Combined Pinned Memory

124

(1)

5.1.5 Registering Pinned Memory

125

(1)

5.1.6 Pinned Memory and UVA

126

(1)

5.1.7 Mapped Pinned Memory Usage

127

(1)

5.1.8 NUMA, Thread Affinity, and Pinned Memory

128

(2)

5.2 Global Memory

130

(26)

5.2.1 Pointers

131

(1)

5.2.2 Dynamic Allocations

132

(5)

5.2.3 Querying the Amount of Global Memory

137

(1)

5.2.4 Static Allocations

138

(1)

5.2.5 Memset APIs

139

(1)

5.2.6 Pointer Queries

140

(3)

5.2.7 Peer-to-Peer Access

143

(1)

5.2.8 Reading and Writing Global Memory

143

(1)

5.2.9 Coalescing Constraints

143

(4)

5.2.10 Microbenchmarks: Peak Memory Bandwidth

147

(5)

5.2.11 Atomic Operations

152

(3)

5.2.12 Texturing from Global Memory

155

(1)

5.2.13 ECC (Error Correcting Codes)

155

(1)

5.3 Constant Memory

156

(2)

5.3.1 Host and Device__constant__Memory

157

(1)

5.3.2 Accessing__constant__Memory

157

(1)

5.4 Local Memory

158

(4)

5.5 Texture Memory

162

(1)

5.6 Shared Memory

162

(2)

5.6.1 Unsized Shared Memory Declarations

163

(1)

5.6.2 Warp-Synchronous Coding

164

(1)

5.6.3 Pointers to Shared Memory

164

(1)

5.7 Memory Copy

164

(9)

5.7.1 Synchronous versus Asynchronous Memcpy

165

(1)

5.7.2 Unified Virtual Addressing

166

(1)

5.7.3 CUDA Runtime

166

(3)

5.7.4 Driver API

169

(4)

Chapter 6 Streams and Events

173

(32)

6.1 CPU/GPU Concurrency: Covering Driver Overhead

174

(4)

6.1.1 Kernel Launches

174

(4)

6.2 Asynchronous Memcpy

178

(5)

6.2.1 Asynchronous Memcpy: Host→Device

179

(2)

6.2.2 Asynchronous Memcpy: Device→Host

181

(1)

6.2.3 The NULL Stream and Concurrency Breaks

181

(2)

6.3 CUDA Events: CPU/GPU Synchronization

183

(3)

6.3.1 Blocking Events

186

(1)

6.3.2 Queries

186

(1)

6.4 CUDA Events: Timing

186

(1)

6.5 Concurrent Copying and Kernel Processing

187

(10)

6.5.1 concurrencyMemcpyKernel.cu

189

(5)

6.5.2 Performance Results

194

(2)

6.5.3 Breaking Interengine Concurrency

196

(1)

6.6 Mapped Pinned Memory

197

(2)

6.7 Concurrent Kernel Processing

199

(3)

6.8 GPU/GPU Synchronization: cudaStreamWaitEvent()

202

(1)

6.8.1 Streams and Events on Multi-GPU: Notes and Limitations

202

(1)

6.9 Source Code Reference

202

(3)

Chapter 7 Kernel Execution

205

(26)

7.1 Overview

205

(1)

7.2 Syntax

206

(5)

7.2.1 Limitations

208

(1)

7.2.2 Caches and Coherency

209

(1)

7.2.3 Asynchrony and Error Handling

209

(1)

7.2.4 Timeouts

210

(1)

7.2.5 Local Memory

210

(1)

7.2.6 Shared Memory

211

(1)

7.3 Blocks, Threads, Warps, and Lanes

211

(9)

7.3.1 Grids of Blocks

211

(4)

7.3.2 Execution Guarantees

215

(1)

7.3.3 Block and Thread IDs

216

(4)

7.4 Occupancy

220

(2)

7.5 Dynamic Parallelism

222

(9)

7.5.1 Scoping and Synchronization

223

(1)

7.5.2 Memory Model

224

(1)

7.5.3 Streams and Events

225

(1)

7.5.4 Error Handling

225

(1)

7.5.5 Compiling and Linking

226

(1)

7.5.6 Resource Management

226

(2)

7.5.7 Summary

228

(3)

Chapter 8 Streaming Multiprocessors

231

(56)

8.1 Memory

233

(8)

8.1.1 Registers

233

(1)

8.1.2 Local Memory

234

(1)

8.1.3 Global Memory

235

(2)

8.1.4 Constant Memory

237

(1)

8.1.5 Shared Memory

237

(3)

8.1.6 Barriers and Coherency

240

(1)

8.2 Integer Support

241

(3)

8.2.1 Multiplication

241

(1)

8.2.2 Miscellaneous (Bit Manipulation)

242

(1)

8.2.3 Funnel Shift (SM 3.5)

243

(1)

8.3 Floating-Point Support

244

(23)

8.3.1 Formats

244

(6)

8.3.2 Single Precision (32-Bit)

250

(3)

8.3.3 Double Precision (64-Bit)

253

(1)

8.3.4 Half Precision (16-Bit)

253

(1)

8.3.5 Case Study: float→half Conversion

253

(5)

8.3.6 Math Library

258

(8)

8.3.7 Additional Reading

266

(1)

8.4 Conditional Code

267

(2)

8.4.1 Predication

267

(1)

8.4.2 Divergence and Convergence

268

(1)

8.4.3 Special Cases: Min, Max and Absolute Value

269

(1)

8.5 Textures and Surfaces

269

(1)

8.6 Miscellaneous Instructions

270

(5)

8.6.1 Warp-Level Primitives

270

(2)

8.6.2 Block-Level Primitives

272

(1)

8.6.3 Performance Counter

272

(1)

8.6.4 Video Instructions

272

(3)

8.6.5 Special Registers

275

(1)

8.7 Instruction Sets

275

(12)

Chapter 9 Multiple GPUs

287

(18)

9.1 Overview

287

(1)

9.2 Peer-to-Peer

288

(3)

9.2.1 Peer-to-Peer Memcpy

288

(1)

9.2.2 Peer-to-Peer Addressing

289

(2)

9.3 UVA: Inferring Device from Address

291

(1)

9.4 Inter-GPU Synchronization

292

(2)

9.5 Single-Threaded Multi-GPU

294

(5)

9.5.1 Current Context Stack

294

(2)

9.5.2 N-Body

296

(3)

9.6 Multithreaded Multi-GPU

299

(6)

Chapter 10 Texturing

305

(46)

10.1 Overview

305

(1)

10.1.1 Two Use Cases

306

(1)

10.2 Texture Memory

306

(8)

10.2.1 Device Memory

307

(1)

10.2.2 CUDA Arrays and Block Linear Addressing

308

(5)

10.2.3 Device Memory versus CUDA Arrays

313

(1)

10.3 1D Texturing

314

(3)

10.3.1 Texture Setup

314

(3)

10.4 Texture as a Read Path

317

(6)

10.4.1 Increasing Effective Address Coverage

318

(3)

10.4.2 Texturing from Host Memory

321

(2)

10.5 Texturing with Unnormalized Coordinates

323

(8)

10.6 Texturing with Normalized Coordinates

331

(2)

10.7 1D Surface Read/Write

333

(2)

10.8 2D Texturing

335

(3)

10.8.1 Microdemo: tex2d_opengl.cu

335

(3)

10.9 2D Texturing: Copy Avoidance

338

(2)

10.9.1 2D Texturing from Device Memory

338

(2)

10.9.2 2D Surface Read/Write

340

(1)

10.10 3D Texturing

340

(2)

10.11 Layered Textures

342

(1)

10.11.1 1D Layered Textures

343

(1)

10.11.2 2D Layered Textures

343

(1)

10.12 Optimal Block Sizing and Performance

343

(2)

10.12.1 Results

344

(1)

10.13 Texturing Quick References

345

(6)

10.13.1 Hardware Capabilities

345

(2)

10.13.2 CUDA Runtime

347

(2)

10.13.3 Driver API

349

(2)

PART III

351

(120)

Chapter 11 Streaming Workloads

353

(12)

11.1 Device Memory

355

(3)

11.2 Asynchronous Memcpy

358

(1)

11.3 Streams

359

(2)

11.4 Mapped Pinned Memory

361

(1)

11.5 Performance and Summary

362

(3)

Chapter 12 Reduction

365

(20)

12.1 Overview

365

(2)

12.2 Two-Pass Reduction

367

(6)

12.3 Single-Pass Reduction

373

(3)

12.4 Reduction with Atomics

376

(1)

12.5 Arbitrary Block Sizes

377

(1)

12.6 Reduction Using Arbitrary Data Types

378

(4)

12.7 Predicate Reduction

382

(1)

12.8 Warp Reduction with Shuffle

382

(3)

Chapter 13 Scan

385

(36)

13.1 Definition and Variations

385

(2)

13.2 Overview

387

(3)

13.3 Scan and Circuit Design

390

(4)

13.4 CUDA Implementations

394

(13)

13.4.1 Scan-Then-Fan

394

(6)

13.4.2 Reduce-Then-Scan (Recursive)

400

(3)

13.4.3 Reduce-Then-Scan (Two Pass)

403

(4)

13.5 Warp Scans

407

(7)

13.5.1 Zero Padding

408

(1)

13.5.2 Templated Formulations

409

(1)

13.5.3 Warp Shuffle

410

(2)

13.5.4 Instruction Counts

412

(2)

13.6 Stream Compaction

414

(4)

13.7 References (Parallel Scan Algorithms)

418

(1)

13.8 Further Reading (Parallel Prefix Sum Circuits)

419

(2)

Chapter 14 N-Body

421

(28)

14.1 Introduction

423

(5)

14.1.1 A Matrix of Forces

424

(4)

14.2 Naive Implementation

428

(4)

14.3 Shared Memory

432

(2)

14.4 Constant Memory

434

(2)

14.5 Warp Shuffle

436

(2)

14.6 Multiple GPUs and Scalability

438

(1)

14.7 CPU Optimizations

439

(5)

14.8 Conclusion

444

(2)

14.9 References and Further Reading

446

(3)

Chapter 15 Image Processing: Normalized Correlation

449

(22)

15.1 Overview

449

(3)

15.2 Naive Texture-Texture Implementation

452

(4)

15.3 Template in Constant Memory

456

(3)

15.4 Image in Shared Memory

459

(4)

15.5 Further Optimizations

463

(2)

15.5.1 SM-Aware Coding

463

(1)

15.5.2 Loop Unrolling

464

(1)

15.6 Source Code

465

(1)

15.7 Performance and Further Reading

466

(3)

15.8 Further Reading

469

(2)

Appendix A The CUDA Handbook Library

471

(10)

A.1 Timing

471

(1)

A.2 Threading

472

(2)

A.3 Driver API Facilities

474

(1)

A.4 Shmoos

475

(1)

A.5 Command Line Parsing

476

(1)

A.6 Error Handling

477

(4)

Glossary / TLA Decoder

481

(6)

Index

487

Nicholas Wilt has been programming professionally for more than twenty-five years in a variety of areas, including industrial machine vision, graphics, and low-level multimedia software. While at Microsoft, he served as the development lead for Direct3D 5.0 and 6.0, built the prototype for the Desktop Window Manager, and did early GPU computing work. At NVIDIA, he worked on CUDA from its inception, designing and often implementing most of CUDAs low-level abstractions. Now at Amazon, Mr. Wilt is working on cloud computing technologies relating to GPUs.

CUDA Handbook: A Comprehensive Guide to GPU Programming, The [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv