Klienditugi: 7440010 (E-R 10-18)

Programming Massively Parallel Processors: A Hands-on Approach 3rd edition [Pehme köide]

4.02/5 (136 hinnangut Goodreads-ist)

David B. Kirk (NVIDIA Fellow), Wen-mei W. Hwu (CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA)

Formaat: Paperback / softback, 576 pages, kõrgus x laius: 235x191 mm, kaal: 1560 g, 330 illustrations; Illustrations, unspecified
Ilmumisaeg: 08-Dec-2016
Kirjastus: Morgan Kaufmann Publishers In
ISBN-10: 0128119861
ISBN-13: 9780128119860

Teised raamatud teemal:

Parallel processing
Human-computer interaction
Programming & scripting languages: general - (Hetkel poes: 1 nimetust)
Microprocessors

Pehme köide
Hind: 95,04 €*
* saadame teile pakkumise kasutatud raamatule, mille hind võib erineda kodulehel olevast hinnast
See raamat on trükist otsas, kuid me saadame teile pakkumise kasutatud raamatule.
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Lisa soovinimekirja

Formaat: Paperback / softback, 576 pages, kõrgus x laius: 235x191 mm, kaal: 1560 g, 330 illustrations; Illustrations, unspecified
Ilmumisaeg: 08-Dec-2016
Kirjastus: Morgan Kaufmann Publishers In
ISBN-10: 0128119861
ISBN-13: 9780128119860

Teised raamatud teemal:

Parallel processing
Human-computer interaction
Programming & scripting languages: general - (Hetkel poes: 1 nimetust)
Microprocessors

Püsilink: https://www.kriso.ee/db/9780128119860.html

Märksõnad:

Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. For this new edition, the authors are updating their coverage of CUDA, including coverage of newer libraries such as CuDNN, moving content that has become less important to appendices, adding two new chapters on parallel patterns, and updating case studies to reflect current industry practices, while still retaining its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses.

Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing
Utilizes CUDA version 7.5, NVIDIA's software development tool created specifically for massively parallel environments
Contains new and updated case studies
Includes coverage of newer libraries such as CuDNN for Deep Learning

Muu info

Learn how to program massively parallel processors with this best-selling guide to CUDA and GPU parallel programming

Preface

Acknowledgements

xxi

Chapter 1 Introduction

(18)

1.1 Heterogeneous Parallel Computing

(4)

1.2 Architecture of a Modern GPU

(2)

1.3 Why More Speed or Parallelism?

(2)

1.4 Speeding Up Real Applications

(2)

1.5 Challenges in Parallel Programming

(1)

1.6 Parallel Programming Languages and Models

(2)

1.7 Overarching Goals

(1)

1.8 Organization of the Book

(4)

References

(1)

Chapter 2 Data Parallel Computing

(24)

2.1 Data Parallelism

(2)

2.2 CUDA C Program Structure

(3)

2.3 A Vector Addition Kernel

(2)

2.4 Device Global Memory and Data Transfer

(5)

2.5 Kernel Functions and Threading

(5)

2.6 Kernel Launch

(1)

2.7 Summary

(1)

Function Declarations

(1)

Kernel Launch

(1)

Built-in (Predefined) Variables

(1)

Run-time API

(1)

2.8 Exercises

(4)

References

(2)

Chapter 3 Scalable Parallel Execution

(28)

3.1 CUDA Thread Organization

(4)

3.2 Mapping Threads to Multidimensional Data

(7)

3.3 Image Blur: A More Complex Kernel

(4)

3.4 Synchronization and Transparent Scalability

(2)

3.5 Resource Assignment

(1)

3.6 Querying Device Properties

(3)

3.7 Thread Scheduling and Latency Tolerance

(3)

3.8 Summary

(1)

3.9 Exercises

(4)

Chapter 4 Memory and Data Locality

(32)

4.1 Importance of Memory Access Efficiency

(1)

4.2 Matrix Multiplication

(4)

4.3 CUDA Memory Types

(7)

4.4 Tiling for Reduced Memory Traffic

(6)

4.5 A Tiled Matrix Multiplication Kernel

(4)

4.6 Boundary Checks

(3)

4.7 Memory as a Limiting Factor to Parallelism

(2)

4.8 Summary

(1)

4.9 Exercises

100

(3)

Chapter 5 Performance Considerations

103

(28)

5.1 Global Memory Bandwidth

104

(8)

5.2 More on Memory Parallelism

112

(5)

5.3 Warps and SIMD Hardware

117

(8)

5.4 Dynamic Partitioning of Resources

125

(2)

5.5 Thread Granularity

127

(1)

5.6 Summary

128

(1)

5.7 Exercises

128

(3)

References

130

(1)

Chapter 6 Numerical Considerations

131

(18)

6.1 Floating-Point Data Representation

132

(2)

Normalized Representation of M

132

(1)

Excess Encoding of E

133

(1)

6.2 Representable Numbers

134

(4)

6.3 Special Bit Patterns and Precision in IEEE Format

138

(1)

6.4 Arithmetic Accuracy and Rounding

139

(1)

6.5 Algorithm Considerations

140

(2)

6.6 Linear Solvers and Numerical Stability

142

(4)

6.7 Summary

146

(1)

6.8 Exercises

147

(2)

References

147

(2)

Chapter 7 Parallel Patterns: Convolution

149

(26)

7.1 Background

150

(3)

7.2 ID Parallel Convolution---A Basic Algorithm

153

(3)

7.3 Constant Memory and Caching

156

(4)

7.4 Tiled 1D Convolution with Halo Cells

160

(5)

7.5 A Simpler Tiled ID Convolution---General Caching

165

(1)

7.6 Tiled 2D Convolution with Halo Cells

166

(6)

7.7 Summary

172

(1)

7.8 Exercises

173

(2)

Chapter 8 Parallel Patterns: Prefix Sum

175

(24)

8.1 Background

176

(1)

8.2 A Simple Parallel Scan

177

(4)

8.3 Speed and Work Efficiency

181

(2)

8.4 A More Work-Efficient Parallel Scan

183

(4)

8.5 An Even More Work-Efficient Parallel Scan

187

(2)

8.6 Hierarchical Parallel Scan for Arbitrary-Length Inputs

189

(3)

8.7 Single-Pass Scan for Memory Access Efficiency

192

(3)

8.8 Summary

195

(1)

8.9 Exercises

195

(4)

References

196

(3)

Chapter 9 Parallel Patterns---Parallel Histogram Computation

199

(16)

9.1 Background

200

(2)

9.2 Use of Atomic Operations

202

(4)

9.3 Block versus Interleaved Partitioning

206

(1)

9.4 Latency versus Throughput of Atomic Operations

207

(3)

9.5 Atomic Operation in Cache Memory

210

(1)

9.6 Privatization

210

(1)

9.7 Aggregation

211

(2)

9.8 Summary

213

(1)

9.9 Exercises

213

(2)

Reference

214

(1)

Chapter 10 Parallel Patterns: Sparse Matrix Computation

215

(16)

10.1 Background

216

(3)

10.2 Parallel SpMV Using CSR

219

(2)

10.3 Padding and Transposition

221

(3)

10.4 Using a Hybrid Approach to Regulate Padding

224

(3)

10.5 Sorting and Partitioning for Regularization

227

(2)

10.6 Summary

229

(1)

10.7 Exercises

229

(2)

References

230

(1)

Chapter 11 Parallel Patterns: Merge Sort

231

(26)

11.1 Background

231

(2)

11.2 A Sequential Merge Algorithm

233

(1)

11.3 A Parallelization Approach

234

(2)

11.4 Co-Rank Function Implementation

236

(5)

11.5 A Basic Parallel Merge Kernel

241

(1)

11.6 A Tiled Merge Kernel

242

(7)

11.7 A Circular-Buffer Merge Kernel

249

(7)

11.8 Summary

256

(1)

11.9 Exercises

256

(1)

Reference

256

(1)

Chapter 12 Parallel Patterns: Graph Search

257

(18)

12.1 Background

258

(2)

12.2 Breadth-First Search

260

(2)

12.3 A Sequential BFS Function

262

(3)

12.4 A Parallel BFS Function

265

(5)

12.5 Optimizations

270

(3)

Memory Bandwidth

270

(1)

Hierarchical Queues

271

(1)

Kernel Launch Overhead

272

(1)

Load Balance

273

(1)

12.6 Summary

273

(1)

12.7 Exercises

273

(2)

References

274

(1)

Chapter 13 CUDA Dynamic Parallelism

275

(30)

13.1 Background

276

(2)

13.2 Dynamic Parallelism Overview

278

(1)

13.3 A Simple Example

279

(2)

13.4 Memory Data Visibility

281

(2)

Global Memory

281

(1)

Zero-Copy Memory

282

(1)

Constant Memory

282

(1)

Local Memory

282

(1)

Shared Memory

283

(1)

Texture Memory

283

(1)

13.5 Configurations and Memory Management

283

(2)

Launch Environment Configuration

283

(1)

Memory Allocation and Lifetime

283

(1)

Nesting Depth

284

(1)

Pending Launch Pool Configuration

284

(1)

Errors and Launch Failures

284

(1)

13.6 Synchronization, Streams, and Events

285

(2)

Synchronization

285

(1)

Synchronization Depth

285

(1)

Streams

286

(1)

Events

287

(1)

13.7 A More Complex Example

287

(6)

Linear Bezier Curves

288

(1)

Quadratic Bezier Curves

288

(1)

Bezier Curve Calculation (Without Dynamic Parallelism)

288

(2)

Bezier Curve Calculation (With Dynamic Parallelism)

290

(2)

Launch Pool Size

292

(1)

Streams

292

(1)

13.8 A Recursive Example

293

(4)

13.9 Summary

297

(2)

13.10 Exercises

299

(2)

References

301

(1)

13.1 Code Appendix

301

(4)

Chapter 14 Application Case Study---non-Cartesian Magnetic Resonance Imaging

305

(26)

14.1 Background

306

(2)

14.2 Iterative Reconstruction

308

(2)

14.3 Computing FHD

310

(17)

Step 1 Determine the Kernel Parallelism Structure

312

(5)

Step 2 Getting Around the Memory Bandwidth Limitation

317

(6)

Step 3 Using Hardware Trigonometry Functions

323

(3)

Step 4 Experimental Performance Tuning

326

(1)

14.4 Final Evaluation

327

(1)

14.5 Exercises

328

(3)

References

329

(2)

Chapter 15 Application Case Study---Molecular Visualization and Analysis

331

(14)

15.1 Background

332

(1)

15.2 A Simple Kernel Implementation

333

(4)

15.3 Thread Granularity Adjustment

337

(1)

15.4 Memory Coalescing

338

(4)

15.5 Summary

342

(1)

15.6 Exercises

343

(2)

References

344

(1)

Chapter 16 Application Case Study---Machine Learning

345

(24)

16.1 Background

346

(1)

16.2 Convolutional Neural Networks

347

(8)

ConvNets: Basic Layers

348

(3)

ConvNets: Backpropagation

351

(4)

16.3 Convolutional Layer: A Basic CUDA Implementation of Forward Propagation

355

(4)

16.4 Reduction of Convolutional Layer to Matrix Multiplication

359

(5)

16.5 cuDNN Library

364

(2)

16.6 Exercises

366

(3)

References

367

(2)

Chapter 17 Parallel Programming and Computational Thinking

369

(18)

17.1 Goals of Parallel Computing

370

(1)

17.2 Problem Decomposition

371

(3)

17.3 Algorithm Selection

374

(5)

17.4 Computational Thinking

379

(1)

17.5 Single Program, Multiple Data, Shared Memory and Locality

380

(2)

17.6 Strategies for Computational Thinking

382

(1)

17.7 A Hypothetical Example: Sodium Map of the Brain

383

(3)

17.8 Summary

386

(1)

17.9 Exercises

386

(1)

References

386

(1)

Chapter 18 Programming a Heterogeneous Computing Cluster

387

(26)

18.1 Background

388

(1)

18.2 A Running Example

388

(3)

18.3 Message Passing Interface Basics

391

(2)

18.4 Message Passing Interface Point-to-Point Communication

393

(7)

18.5 Overlapping Computation and Communication

400

(8)

18.6 Message Passing Interface Collective Communication

408

(1)

18.7 CUDA-Aware Message Passing Interface

409

(1)

18.8 Summary

410

(1)

18.9 Exercises

410

(3)

Reference

411

(2)

Chapter 19 Parallel Programming with OpenACC

413

(30)

19.1 The OpenACC Execution Model

414

(2)

19.2 OpenACC Directive Format

416

(2)

19.3 OpenACC by Example

418

(17)

The OpenACC Kernels Directive

419

(3)

The OpenACC Parallel Directive

422

(2)

Comparison of Kernels and Parallel Directives

424

(1)

OpenACC Data Directives

425

(5)

OpenACC Loop Optimizations

430

(2)

OpenACC Routine Directive

432

(2)

Asynchronous Computation and Data

434

(1)

19.4 Comparing OpenACC and CUDA

435

(2)

Portability

435

(1)

Performance

436

(1)

Simplicity

436

(1)

19.5 Interoperability with CUDA and Libraries

437

(3)

Calling CUDA or Libraries with OpenACC Arrays

437

(1)

Using CUDA Pointers in OpenACC

438

(1)

Calling CUDA Device Kernels from OpenACC

439

(1)

19.6 The Future of OpenACC

440

(1)

19.7 Exercises

441

(2)

Chapter 20 More on CUDA and Graphics Processing Unit Computing

443

(14)

20.1 Model of Host/Device Interaction

444

(5)

20.2 Kernel Execution Control

449

(2)

20.3 Memory Bandwidth and Compute Throughput

451

(2)

20.4 Programming Environment

453

(2)

20.5 Future Outlook

455

(2)

References

456

(1)

Chapter 21 Conclusion and Outlook

457

(4)

21.1 Goals Revisited

457

(1)

21.2 Future Outlook

458

(3)

Appendix A An Introduction to OpenCL

461

(14)

Appendix B THRUST: a Productivity-oriented Library for CUDA

475

(18)

Appendix C CUDA Fortran

493

(22)

Appendix D An introduction to C++ AMP

515

(20)

Index

535

David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow. At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers. Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide. Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group (www.impact.crhc.illinois.edu). He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.

Programming Massively Parallel Processors: A Hands-on Approach 3rd edition [Pehme köide]

Muu info

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv