Muutke küpsiste eelistusi

E-raamat: Programming Massively Parallel Processors: A Hands-on Approach

(NVIDIA Fellow), (Assistant Professor, Department of), (CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA)
  • Formaat: EPUB+DRM
  • Ilmumisaeg: 28-May-2022
  • Kirjastus: Morgan Kaufmann
  • Keel: eng
  • ISBN-13: 9780323984638
  • Formaat - EPUB+DRM
  • Hind: 75,06 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: EPUB+DRM
  • Ilmumisaeg: 28-May-2022
  • Kirjastus: Morgan Kaufmann
  • Keel: eng
  • ISBN-13: 9780323984638

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. For this new edition, the authors are updating their coverage of CUDA, including the concept of unified memory, and expanding content in areas such as threads, while still retaining its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses.
  • Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing
  • Updated to utilize CUDA version 10.0, NVIDIA's software development tool created specifically for massively parallel environments
  • Features new content on unified memory, as well as expanded content on threads, streams, warp divergence, and OpenMP
  • Includes updated and new case studies
Foreword xv
Preface xvii
Acknowledgments xxvii
Chapter 1 Introduction
1(22)
1.1 Heterogeneous parallel computing
3(4)
1.2 Why more speed or parallelism?
7(2)
1.3 Speeding up real applications
9(2)
1.4 Challenges in parallel programming
11(2)
1.5 Related parallel programming interfaces
13(1)
1.6 Overarching goals
14(1)
1.7 Organization of the book
15(8)
References
19(4)
Part I Fundamental Concepts
Chapter 2 Heterogeneous data parallel computing
23(24)
David Luebke
2.1 Data parallelism
23(4)
2.2 CUDAC program structure
27(1)
2.3 A vector addition kernel
28(3)
2.4 Device global memory and data transfer
31(4)
2.5 Kernel functions and threading
35(5)
2.6 Calling kernel functions
40(2)
2.7 Compilation
42(1)
2.8 Summary
43(4)
Exercises
44(2)
References
46(1)
Chapter 3 Multidimensional grids and data
47(22)
3.1 Multidimensional grid organization
47(4)
3.2 Mapping threads to multidimensional data
51(7)
3.3 Image blur: a more complex kernel
58(4)
3.4 Matrix multiplication
62(4)
3.5 Summary
66(3)
Exercises
67(2)
Chapter 4 Compute architecture and scheduling
69(24)
4.1 Architecture of a modern GPU
70(1)
4.2 Block scheduling
70(1)
4.3 Synchronization and transparent scalability
71(3)
4.4 Warps and SIMD hardware
74(5)
4.5 Control divergence
79(4)
4.6 Warp scheduling and latency tolerance
83(2)
4.7 Resource partitioning and occupancy
85(2)
4.8 Querying device properties
87(3)
4.9 Summary
90(3)
Exercises
90(2)
References
92(1)
Chapter 5 Memory architecture and data locality
93(30)
5.1 Importance of memory access efficiency
94(2)
5.2 CUDA memory types
96(7)
5.3 Tiling for reduced memory traffic
103(4)
5.4 A tiled matrix multiplication kernel
107(5)
5.5 Boundary checks
112(3)
5.6 Impact of memory usage on occupancy
115(3)
5.7 Summary
118(5)
Exercises
119(4)
Chapter 6 Performance considerations
123(28)
6.1 Memory coalescing
124(9)
6.2 Hiding memory latency
133(5)
6.3 Thread coarsening
138(3)
6.4 A checklist of optimizations
141(4)
6.5 Knowing your computation's bottleneck
145(1)
6.6 Summary
146(5)
Exercises
146(1)
References
147(4)
Part II Parallel Patterns
Chapter 7 Convolution An introduction to constant memory and caching
151(22)
7.1 Background
152(4)
7.2 Parallel convolution: a basic algorithm
156(3)
7.3 Constant memory and caching
159(4)
7.4 Tiled convolution with halo cells
163(5)
7.5 Tiled convolution using caches for halo cells
168(2)
7.6 Summary
170(3)
Exercises
171(2)
Chapter 8 Stencil
173(18)
8.1 Background
174(4)
8.2 Parallel stencil: a basic algorithm
178(1)
8.3 Shared memory tiling for stencil sweep
179(4)
8.4 Thread coarsening
183(3)
8.5 Register tiling
186(2)
8.6 Summary
188(3)
Exercises
188(3)
Chapter 9 Parallel histogram
191(20)
9.1 Background
192(2)
9.2 Atomic operations and a basic histogram kernel
194(4)
9.3 Latency and throughput of atomic operations
198(2)
9.4 Privatization
200(3)
9.5 Coarsening
203(3)
9.6 Aggregation
206(2)
9.7 Summary
208(3)
Exercises
209(1)
References
210(1)
Chapter 10 Reduction And minimizing divergence
211(24)
10.1 Background
211(2)
10.2 Reduction trees
213(4)
10.3 A simple reduction kernel
217(2)
10.4 Minimizing control divergence
219(4)
10.5 Minimizing memory divergence
223(2)
10.6 Minimizing global memory accesses
225(1)
10.7 Hierarchical reduction for arbitrary input length
226(2)
10.8 Thread coarsening for reduced overhead
228(3)
10.9 Summary
231(4)
Exercises
232(3)
Chapter 11 Prefix sum (scan) An introduction to work efficiency in parallel algorithms
235(28)
Li-Wen Chang
Juan Gomez-Luna
John Owens
11.1 Background
236(2)
11.2 Parallel scan with the Kogge-Stone algorithm
238(6)
11.3 Speed and work efficiency consideration
244(2)
11.4 Parallel scan with the Brent-Kung algorithm
246(5)
11.5 Coarsening for even more work efficiency
251(2)
11.6 Segmented parallel scan for arbitrary-length inputs
253(3)
11.7 Single-pass scan for memory access efficiency
256(3)
11.8 Summary
259(4)
Exercises
260(1)
References
261(2)
Chapter 12 Merge An introduction to dynamic input data identification
263(30)
Li-Wen Chang
Jie Lv
12.1 Background
263(2)
12.2 A sequential merge algorithm
265(1)
12.3 A parallelization approach
266(2)
12.4 Co-rank function implementation
268(5)
12.5 A basic parallel merge kernel
273(2)
12.6 A tiled merge kernel to improve coalescing
275(7)
12.7 A circular buffer merge kernel
282(6)
12.8 Thread coarsening for merge
288(1)
12.9 Summary
288(5)
Exercises
289(1)
References
289(4)
Part III Advanced Patterns and Applications
Chapter 13 Sorting
293(18)
Michael Garland
13.1 Background
294(1)
13.2 Radix sort
295(1)
13.3 Parallel radix sort
296(4)
13.4 Optimizing for memory coalescing
300(2)
13.5 Choice of radix value
302(3)
13.6 Thread coarsening to improve coalescing
305(1)
13.7 Parallel merge sort
306(2)
13.8 Other parallel sort methods
308(1)
13.9 Summary
309(2)
Exercises
310(1)
References
310(1)
Chapter 14 Sparse matrix computation
311(20)
14.1 Background
312(2)
14.2 A simple SpMV kernel with the COO format
314(3)
14.3 Grouping row nonzeros with the CSR format
317(3)
14.4 Improving memory coalescing with the ELL format
320(4)
14.5 Regulating padding with the hybrid ELL-COO format
324(1)
14.6 Reducing control divergence with the JDS format
325(3)
14.7 Summary
328(3)
Exercises
329(1)
References
329(2)
Chapter 15 Graph traversal
331(24)
John Owens
Juan Gomez-Luna
15.1 Background
332(3)
15.2 Breadth-first search
335(3)
15.3 Vertex-centric parallelization of breadth-first search
338(5)
15.4 Edge-centric parallelization of breadth-first search
343(2)
15.5 Improving efficiency with frontiers
345(3)
15.6 Reducing contention with privatization
348(2)
15.7 Other optimizations
350(2)
15.8 Summary
352(3)
Exercises
353(1)
References
354(1)
Chapter 16 Deep learning
355(36)
Carl Pearson
Boris Ginsburg
16.1 Background
356(10)
16.2 Convolutional neural networks
366(10)
16.3 Convolutional layer: a CUDA inference kernel
376(3)
16.4 Formulating a convolutional layer as GEMM
379(6)
16.5 Cudnn Library
385(2)
16.6 Summary
387(4)
Exercises
388(1)
References
388(3)
Chapter 17 Iterative magnetic resonance imaging reconstruction
391(24)
17.1 Background
391(3)
17.2 Iterative reconstruction
394(2)
17.3 Computing FHD
396(16)
17.4 Summary
412(3)
Exercises
413(1)
References
414(1)
Chapter 18 Electrostatic potential map
415(18)
John Stone
18.1 Background
415(2)
18.2 Scatter versus gather in kernel design
417(5)
18.3 Thread coarsening
422(2)
18.4 Memory coalescing
424(1)
18.5 Cutoff binning for data size scalability
425(5)
18.6 Summary
430(3)
Exercises
431(1)
References
431(2)
Chapter 19 Parallel programming and computational thinking
433(16)
19.1 Goals of parallel computing
433(3)
19.2 Algorithm selection
436(4)
19.3 Problem decomposition
440(4)
19.4 Computational thinking
444(2)
19.5 Summary
446(3)
References
446(3)
Part IV Advanced Practices
Chapter 20 Programming a heterogeneous computing cluster An introduction to CUDA streams
449(26)
Isaac Geladjo
Javier Cabezas
20.1 Background
449(1)
20.2 A running example
450(2)
20.3 Message passing interface basics
452(3)
20.4 Message passing interface point-to-point communication
455(7)
20.5 Overlapping computation and communication
462(8)
20.6 Message passing interface collective communication
470(1)
20.7 CUDA aware message passing interface
471(1)
20.8 Summary
472(3)
Exercises
472(1)
References
473(2)
Chapter 21 CUDA dynamic parallelism
475(24)
Juan Gomez-Luna
21.1 Background
476(2)
21.2 Dynamic parallelism overview
478(3)
21.3 An example: Bezier curves
481(3)
21.4 A recursive example: quadtrees
484(6)
21.5 Important considerations
490(2)
21.6 Summary
492(3)
Exercises
493(2)
A21.1 Support code for quadtree example
495(4)
References
497(2)
Chapter 22 Advanced practices and future evolution
499(16)
Isaac Gelado
Mark Harris
22.1 Model of host/device interaction
500(5)
22.2 Kernel execution control
505(3)
22.3 Memory bandwidth and compute throughput
508(2)
22.4 Programming environment
510(3)
22.5 Future outlook
513(2)
References
513(2)
Chapter 23 Conclusion and outlook
515(4)
23.1 Goals revisited
515(1)
23.2 Future outlook
516(3)
Appendix A Numerical considerations 519(18)
Index 537