Muutke küpsiste eelistusi

Programming Massively Parallel Processors: A Hands-on Approach 3rd edition [Pehme köide]

(NVIDIA Fellow), (CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA)
  • Formaat: Paperback / softback, 576 pages, kõrgus x laius: 235x191 mm, kaal: 1560 g, 330 illustrations; Illustrations, unspecified
  • Ilmumisaeg: 08-Dec-2016
  • Kirjastus: Morgan Kaufmann Publishers In
  • ISBN-10: 0128119861
  • ISBN-13: 9780128119860
  • Pehme köide
  • Hind: 95,04 €*
  • * saadame teile pakkumise kasutatud raamatule, mille hind võib erineda kodulehel olevast hinnast
  • See raamat on trükist otsas, kuid me saadame teile pakkumise kasutatud raamatule.
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Lisa soovinimekirja
  • Formaat: Paperback / softback, 576 pages, kõrgus x laius: 235x191 mm, kaal: 1560 g, 330 illustrations; Illustrations, unspecified
  • Ilmumisaeg: 08-Dec-2016
  • Kirjastus: Morgan Kaufmann Publishers In
  • ISBN-10: 0128119861
  • ISBN-13: 9780128119860

Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. For this new edition, the authors are updating their coverage of CUDA, including coverage of newer libraries such as CuDNN, moving content that has become less important to appendices, adding two new chapters on parallel patterns, and updating case studies to reflect current industry practices, while still retaining its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses.

  • Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing
  • Utilizes CUDA version 7.5, NVIDIA's software development tool created specifically for massively parallel environments
  • Contains new and updated case studies
  • Includes coverage of newer libraries such as CuDNN for Deep Learning

Muu info

Learn how to program massively parallel processors with this best-selling guide to CUDA and GPU parallel programming
Preface xv
Acknowledgements xxi
Chapter 1 Introduction
1(18)
1.1 Heterogeneous Parallel Computing
2(4)
1.2 Architecture of a Modern GPU
6(2)
1.3 Why More Speed or Parallelism?
8(2)
1.4 Speeding Up Real Applications
10(2)
1.5 Challenges in Parallel Programming
12(1)
1.6 Parallel Programming Languages and Models
12(2)
1.7 Overarching Goals
14(1)
1.8 Organization of the Book
15(4)
References
18(1)
Chapter 2 Data Parallel Computing
19(24)
2.1 Data Parallelism
20(2)
2.2 CUDA C Program Structure
22(3)
2.3 A Vector Addition Kernel
25(2)
2.4 Device Global Memory and Data Transfer
27(5)
2.5 Kernel Functions and Threading
32(5)
2.6 Kernel Launch
37(1)
2.7 Summary
38(1)
Function Declarations
38(1)
Kernel Launch
38(1)
Built-in (Predefined) Variables
39(1)
Run-time API
39(1)
2.8 Exercises
39(4)
References
41(2)
Chapter 3 Scalable Parallel Execution
43(28)
3.1 CUDA Thread Organization
43(4)
3.2 Mapping Threads to Multidimensional Data
47(7)
3.3 Image Blur: A More Complex Kernel
54(4)
3.4 Synchronization and Transparent Scalability
58(2)
3.5 Resource Assignment
60(1)
3.6 Querying Device Properties
61(3)
3.7 Thread Scheduling and Latency Tolerance
64(3)
3.8 Summary
67(1)
3.9 Exercises
67(4)
Chapter 4 Memory and Data Locality
71(32)
4.1 Importance of Memory Access Efficiency
72(1)
4.2 Matrix Multiplication
73(4)
4.3 CUDA Memory Types
77(7)
4.4 Tiling for Reduced Memory Traffic
84(6)
4.5 A Tiled Matrix Multiplication Kernel
90(4)
4.6 Boundary Checks
94(3)
4.7 Memory as a Limiting Factor to Parallelism
97(2)
4.8 Summary
99(1)
4.9 Exercises
100(3)
Chapter 5 Performance Considerations
103(28)
5.1 Global Memory Bandwidth
104(8)
5.2 More on Memory Parallelism
112(5)
5.3 Warps and SIMD Hardware
117(8)
5.4 Dynamic Partitioning of Resources
125(2)
5.5 Thread Granularity
127(1)
5.6 Summary
128(1)
5.7 Exercises
128(3)
References
130(1)
Chapter 6 Numerical Considerations
131(18)
6.1 Floating-Point Data Representation
132(2)
Normalized Representation of M
132(1)
Excess Encoding of E
133(1)
6.2 Representable Numbers
134(4)
6.3 Special Bit Patterns and Precision in IEEE Format
138(1)
6.4 Arithmetic Accuracy and Rounding
139(1)
6.5 Algorithm Considerations
140(2)
6.6 Linear Solvers and Numerical Stability
142(4)
6.7 Summary
146(1)
6.8 Exercises
147(2)
References
147(2)
Chapter 7 Parallel Patterns: Convolution
149(26)
7.1 Background
150(3)
7.2 ID Parallel Convolution---A Basic Algorithm
153(3)
7.3 Constant Memory and Caching
156(4)
7.4 Tiled 1D Convolution with Halo Cells
160(5)
7.5 A Simpler Tiled ID Convolution---General Caching
165(1)
7.6 Tiled 2D Convolution with Halo Cells
166(6)
7.7 Summary
172(1)
7.8 Exercises
173(2)
Chapter 8 Parallel Patterns: Prefix Sum
175(24)
8.1 Background
176(1)
8.2 A Simple Parallel Scan
177(4)
8.3 Speed and Work Efficiency
181(2)
8.4 A More Work-Efficient Parallel Scan
183(4)
8.5 An Even More Work-Efficient Parallel Scan
187(2)
8.6 Hierarchical Parallel Scan for Arbitrary-Length Inputs
189(3)
8.7 Single-Pass Scan for Memory Access Efficiency
192(3)
8.8 Summary
195(1)
8.9 Exercises
195(4)
References
196(3)
Chapter 9 Parallel Patterns---Parallel Histogram Computation
199(16)
9.1 Background
200(2)
9.2 Use of Atomic Operations
202(4)
9.3 Block versus Interleaved Partitioning
206(1)
9.4 Latency versus Throughput of Atomic Operations
207(3)
9.5 Atomic Operation in Cache Memory
210(1)
9.6 Privatization
210(1)
9.7 Aggregation
211(2)
9.8 Summary
213(1)
9.9 Exercises
213(2)
Reference
214(1)
Chapter 10 Parallel Patterns: Sparse Matrix Computation
215(16)
10.1 Background
216(3)
10.2 Parallel SpMV Using CSR
219(2)
10.3 Padding and Transposition
221(3)
10.4 Using a Hybrid Approach to Regulate Padding
224(3)
10.5 Sorting and Partitioning for Regularization
227(2)
10.6 Summary
229(1)
10.7 Exercises
229(2)
References
230(1)
Chapter 11 Parallel Patterns: Merge Sort
231(26)
11.1 Background
231(2)
11.2 A Sequential Merge Algorithm
233(1)
11.3 A Parallelization Approach
234(2)
11.4 Co-Rank Function Implementation
236(5)
11.5 A Basic Parallel Merge Kernel
241(1)
11.6 A Tiled Merge Kernel
242(7)
11.7 A Circular-Buffer Merge Kernel
249(7)
11.8 Summary
256(1)
11.9 Exercises
256(1)
Reference
256(1)
Chapter 12 Parallel Patterns: Graph Search
257(18)
12.1 Background
258(2)
12.2 Breadth-First Search
260(2)
12.3 A Sequential BFS Function
262(3)
12.4 A Parallel BFS Function
265(5)
12.5 Optimizations
270(3)
Memory Bandwidth
270(1)
Hierarchical Queues
271(1)
Kernel Launch Overhead
272(1)
Load Balance
273(1)
12.6 Summary
273(1)
12.7 Exercises
273(2)
References
274(1)
Chapter 13 CUDA Dynamic Parallelism
275(30)
13.1 Background
276(2)
13.2 Dynamic Parallelism Overview
278(1)
13.3 A Simple Example
279(2)
13.4 Memory Data Visibility
281(2)
Global Memory
281(1)
Zero-Copy Memory
282(1)
Constant Memory
282(1)
Local Memory
282(1)
Shared Memory
283(1)
Texture Memory
283(1)
13.5 Configurations and Memory Management
283(2)
Launch Environment Configuration
283(1)
Memory Allocation and Lifetime
283(1)
Nesting Depth
284(1)
Pending Launch Pool Configuration
284(1)
Errors and Launch Failures
284(1)
13.6 Synchronization, Streams, and Events
285(2)
Synchronization
285(1)
Synchronization Depth
285(1)
Streams
286(1)
Events
287(1)
13.7 A More Complex Example
287(6)
Linear Bezier Curves
288(1)
Quadratic Bezier Curves
288(1)
Bezier Curve Calculation (Without Dynamic Parallelism)
288(2)
Bezier Curve Calculation (With Dynamic Parallelism)
290(2)
Launch Pool Size
292(1)
Streams
292(1)
13.8 A Recursive Example
293(4)
13.9 Summary
297(2)
13.10 Exercises
299(2)
References
301(1)
13.1 Code Appendix
301(4)
Chapter 14 Application Case Study---non-Cartesian Magnetic Resonance Imaging
305(26)
14.1 Background
306(2)
14.2 Iterative Reconstruction
308(2)
14.3 Computing FHD
310(17)
Step 1 Determine the Kernel Parallelism Structure
312(5)
Step 2 Getting Around the Memory Bandwidth Limitation
317(6)
Step 3 Using Hardware Trigonometry Functions
323(3)
Step 4 Experimental Performance Tuning
326(1)
14.4 Final Evaluation
327(1)
14.5 Exercises
328(3)
References
329(2)
Chapter 15 Application Case Study---Molecular Visualization and Analysis
331(14)
15.1 Background
332(1)
15.2 A Simple Kernel Implementation
333(4)
15.3 Thread Granularity Adjustment
337(1)
15.4 Memory Coalescing
338(4)
15.5 Summary
342(1)
15.6 Exercises
343(2)
References
344(1)
Chapter 16 Application Case Study---Machine Learning
345(24)
16.1 Background
346(1)
16.2 Convolutional Neural Networks
347(8)
ConvNets: Basic Layers
348(3)
ConvNets: Backpropagation
351(4)
16.3 Convolutional Layer: A Basic CUDA Implementation of Forward Propagation
355(4)
16.4 Reduction of Convolutional Layer to Matrix Multiplication
359(5)
16.5 cuDNN Library
364(2)
16.6 Exercises
366(3)
References
367(2)
Chapter 17 Parallel Programming and Computational Thinking
369(18)
17.1 Goals of Parallel Computing
370(1)
17.2 Problem Decomposition
371(3)
17.3 Algorithm Selection
374(5)
17.4 Computational Thinking
379(1)
17.5 Single Program, Multiple Data, Shared Memory and Locality
380(2)
17.6 Strategies for Computational Thinking
382(1)
17.7 A Hypothetical Example: Sodium Map of the Brain
383(3)
17.8 Summary
386(1)
17.9 Exercises
386(1)
References
386(1)
Chapter 18 Programming a Heterogeneous Computing Cluster
387(26)
18.1 Background
388(1)
18.2 A Running Example
388(3)
18.3 Message Passing Interface Basics
391(2)
18.4 Message Passing Interface Point-to-Point Communication
393(7)
18.5 Overlapping Computation and Communication
400(8)
18.6 Message Passing Interface Collective Communication
408(1)
18.7 CUDA-Aware Message Passing Interface
409(1)
18.8 Summary
410(1)
18.9 Exercises
410(3)
Reference
411(2)
Chapter 19 Parallel Programming with OpenACC
413(30)
19.1 The OpenACC Execution Model
414(2)
19.2 OpenACC Directive Format
416(2)
19.3 OpenACC by Example
418(17)
The OpenACC Kernels Directive
419(3)
The OpenACC Parallel Directive
422(2)
Comparison of Kernels and Parallel Directives
424(1)
OpenACC Data Directives
425(5)
OpenACC Loop Optimizations
430(2)
OpenACC Routine Directive
432(2)
Asynchronous Computation and Data
434(1)
19.4 Comparing OpenACC and CUDA
435(2)
Portability
435(1)
Performance
436(1)
Simplicity
436(1)
19.5 Interoperability with CUDA and Libraries
437(3)
Calling CUDA or Libraries with OpenACC Arrays
437(1)
Using CUDA Pointers in OpenACC
438(1)
Calling CUDA Device Kernels from OpenACC
439(1)
19.6 The Future of OpenACC
440(1)
19.7 Exercises
441(2)
Chapter 20 More on CUDA and Graphics Processing Unit Computing
443(14)
20.1 Model of Host/Device Interaction
444(5)
20.2 Kernel Execution Control
449(2)
20.3 Memory Bandwidth and Compute Throughput
451(2)
20.4 Programming Environment
453(2)
20.5 Future Outlook
455(2)
References
456(1)
Chapter 21 Conclusion and Outlook
457(4)
21.1 Goals Revisited
457(1)
21.2 Future Outlook
458(3)
Appendix A An Introduction to OpenCL 461(14)
Appendix B THRUST: a Productivity-oriented Library for CUDA 475(18)
Appendix C CUDA Fortran 493(22)
Appendix D An introduction to C++ AMP 515(20)
Index 535
David B. Kirk is well recognized for his contributions to graphics hardware and algorithm research. By the time he began his studies at Caltech, he had already earned B.S. and M.S. degrees in mechanical engineering from MIT and worked as an engineer for Raster Technologies and Hewlett-Packard's Apollo Systems Division, and after receiving his doctorate, he joined Crystal Dynamics, a video-game manufacturing company, as chief scientist and head of technology. In 1997, he took the position of Chief Scientist at NVIDIA, a leader in visual computing technologies, and he is currently an NVIDIA Fellow. At NVIDIA, Kirk led graphics-technology development for some of today's most popular consumer-entertainment platforms, playing a key role in providing mass-market graphics capabilities previously available only on workstations costing hundreds of thousands of dollars. For his role in bringing high-performance graphics to personal computers, Kirk received the 2002 Computer Graphics Achievement Award from the Association for Computing Machinery and the Special Interest Group on Graphics and Interactive Technology (ACM SIGGRAPH) and, in 2006, was elected to the National Academy of Engineering, one of the highest professional distinctions for engineers. Kirk holds 50 patents and patent applications relating to graphics design and has published more than 50 articles on graphics technology, won several best-paper awards, and edited the book Graphics Gems III. A technological "evangelist" who cares deeply about education, he has supported new curriculum initiatives at Caltech and has been a frequent university lecturer and conference keynote speaker worldwide. Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, implementation, compilation, and algorithms for parallel computing. He is the chief scientist of Parallel Computing Institute and director of the IMPACT research group (www.impact.crhc.illinois.edu). He is a co-founder and CTO of MulticoreWare. For his contributions in research and teaching, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the Tau Beta Pi Daniel C. Drucker Eminent Faculty Award, the ISCA Influential Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. He directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the NSF Blue Waters Petascale computer project. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.