Muutke küpsiste eelistusi

Multicore and GPU Programming: An Integrated Approach 2nd edition [Pehme köide]

(Professor, Computer Science and Engineering Department, American University of Sharjah, UAE)
  • Formaat: Paperback / softback, 1024 pages, kõrgus x laius: 235x191 mm, kaal: 1520 g
  • Ilmumisaeg: 08-Aug-2022
  • Kirjastus: Morgan Kaufmann Publishers In
  • ISBN-10: 0128141204
  • ISBN-13: 9780128141205
  • Formaat: Paperback / softback, 1024 pages, kõrgus x laius: 235x191 mm, kaal: 1520 g
  • Ilmumisaeg: 08-Aug-2022
  • Kirjastus: Morgan Kaufmann Publishers In
  • ISBN-10: 0128141204
  • ISBN-13: 9780128141205
Multicore and GPU Programming: An Integrated Approach offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, CUDA, and other current tools it teaches the design and development of software capable of taking advantage of today’s computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm. Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, readers can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines.
  • Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA, with coverage of OpenCL and OpenACC added
  • Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance
  • New features in the second edition include the use of the C++14 standard for all sample code, a new chapter on concurrent data structures, and the latest research on load balancing
  • Download source code, examples, and instructor support materials on the book’s companion website
List of tables
xv
Preface xvii
PART 1 Introduction
Chapter 1 Introduction
3(28)
1.1 The era of multicore machines
3(2)
1.2 A taxonomy of parallel machines
5(2)
1.3 A glimpse of influential computing machines
7(10)
1.3.1 The Cell BE processor
8(1)
1.3.2 NVidia's Ampere
9(3)
1.3.3 Multicore to many-core: TILERA's TILE-Gx8072 and Intel's Xeon Phi
12(2)
1.3.4 AMD's Epyc Rome: scaling up with smaller chips
14(2)
1.3.5 Fujitsu A64FX: compute and memory integration
16(1)
1.4 Performance metrics
17(4)
1.5 Predicting and measuring parallel program performance
21(10)
1.5.1 Amdahl's law
25(2)
1.5.2 Gustafson-Barsis' rebuttal
27(1)
Exercises
28(3)
Chapter 2 Multicore and parallel program design
31(34)
2.1 Introduction
31(1)
2.2 The PCAM methodology
32(4)
2.3 Decomposition patterns
36(18)
2.3.1 Task parallelism
37(1)
2.3.2 Divide-and-conquer decomposition
37(3)
2.3.3 Geometric decomposition
40(3)
2.3.4 Recursive data decomposition
43(6)
2.3.5 Pipeline decomposition
49(4)
2.3.6 Event-based coordination decomposition
53(1)
2.4 Program structure patterns
54(6)
2.4.1 Single program, multiple data
54(1)
2.4.2 Multiple program, multiple data
55(1)
2.4.3 Master-worker
56(1)
2.4.4 Map-reduce
57(1)
2.4.5 Fork/join
58(2)
2.4.6 Loop parallelism
60(1)
2.5 Matching decomposition patterns with program structure patterns
60(5)
Exercises
61(4)
PART 2 Programming with threads and processes
Chapter 3 Threads and concurrency in standard C++
65(116)
3.1 Introduction
65(3)
3.2 Threads
68(1)
3.2.1 What is a thread?
68(1)
3.2.2 What are threads good for?
68(1)
3.3 Thread creation and initialization
69(8)
3.4 Sharing data between threads
77(3)
3.5 Design concerns
80(2)
3.6 Semaphores
82(5)
3.7 Applying semaphores in classical problems
87(30)
3.7.1 Producers-consumers
89(4)
3.7.2 Dealing with termination
93(12)
3.7.3 The barbershop problem - introducing fairness
105(6)
3.7.4 Readers-writers
111(6)
3.8 Atomic data types
117(9)
3.8.1 Memory ordering
122(4)
3.9 Monitors
126(12)
3.9.1 Design approach #1: critical section inside the monitor
131(1)
3.9.2 Design approach #2: monitor controls entry to critical section
132(4)
3.9.3 General semaphores revisited
136(2)
3.10 Applying monitors in classical problems
138(14)
3.10.1 Producers-consumers revisited
138(7)
3.10.2 Readers-writers
145(7)
3.11 Asynchronous threads
152(4)
3.12 Dynamic vs. static thread management
156(9)
3.13 Threads and fibers
165(7)
3.14 Debugging multi-threaded applications
172(9)
Exercises
177(4)
Chapter 4 Parallel data structures
181(50)
4.1 Introduction
181(4)
4.2 Lock-based structures
185(18)
4.2.1 Queues
185(4)
4.2.2 Lists
189(14)
4.3 Lock-free structures
203(24)
4.3.1 Lock-free stacks
204(5)
4.3.2 A bounded lock-free queue: first attempt
209(7)
4.3.3 The ABA problem
216(2)
4.3.4 A fixed bounded lock-free queue
218(4)
4.3.5 An unbounded lock-free queue
222(5)
4.4 Closing remarks
227(4)
Exercises
228(3)
Chapter 5 Distributed memory programming
231(158)
5.1 Introduction
231(1)
5.2 MPI
232(2)
5.3 Core concepts
234(1)
5.4 Your first MPI program
234(4)
5.5 Program architecture
238(3)
5.5.1 SPMD
238(2)
5.5.2 MPMD
240(1)
5.6 Point-to-point communication
241(4)
5.7 Alternative point-to-point communication modes
245(3)
5.7.1 Buffered communications
246(2)
5.8 Non-blocking communications
248(4)
5.9 Point-to-point communications: summary
252(1)
5.10 Error reporting & handling
252(2)
5.11 Collective communications
254(29)
5.11.1 Scattering
259(6)
5.11.2 Gathering
265(2)
5.11.3 Reduction
267(4)
5.11.4 All-to-all gathering
271(5)
5.11.5 All-to-all scattering
276(6)
5.11.6 All-to-all reduction
282(1)
5.11.7 Global synchronization
282(1)
5.12 Persistent communications
283(3)
5.13 Big-count communications in MPI 4.0
286(1)
5.14 Partitioned communications
287(2)
5.15 Communicating objects
289(11)
5.15.1 Derived datatypes
291(7)
5.15.2 Packing/unpacking
298(2)
5.16 Node management: communicators and groups
300(6)
5.16.1 Creating groups
301(2)
5.16.2 Creating intracommunicators
303(3)
5.17 One-sided communication
306(12)
5.17.1 RMA communication functions
307(2)
5.17.2 RMA synchronization functions
309(9)
5.18 I/O considerations
318(8)
5.19 Combining MPI processes with threads
326(2)
5.20 Timing and performance measurements
328(1)
5.21 Debugging, profiling, and tracing MPI programs
329(7)
5.21.1 Brief introduction to Scalasca
330(4)
5.21.2 Brief introduction to TAU
334(2)
5.22 The Boost.MPI library
336(13)
5.22.1 Blocking and non-blocking communications
337(5)
5.22.2 Data serialization
342(3)
5.22.3 Collective operations
345(4)
5.23 A case study: diffusion-limited aggregation
349(6)
5.24 A case study: brute-force encryption cracking
355(6)
5.25 A case study: MPI implementation of the master-worker pattern
361(28)
5.25.1 A simple master-worker setup
361(8)
5.25.2 A multi-threaded master-worker setup
369(15)
Exercises
384(5)
Chapter 6 GPU programming: CUDA
389(194)
6.1 Introduction
389(3)
6.2 CUDA's programming model: threads, blocks, and grids
392(6)
6.3 CUDA's execution model: streaming multiprocessors and warps
398(3)
6.4 CUDA compilation process
401(5)
6.5 Putting together a CUDA project
406(3)
6.6 Memory hierarchy
409(24)
6.6.1 Local memory/registers
416(1)
6.6.2 Shared memory
417(9)
6.6.3 Constant memory
426(7)
6.6.4 Texture and surface memory
433(1)
6.7 Optimization techniques
433(49)
6.7.1 Block and grid design
433(12)
6.7.2 Kernel structure
445(8)
6.7.3 Shared memory access
453(9)
6.7.4 Global memory access
462(12)
6.7.5 Asynchronous execution and streams: overlapping GPU memory transfers and more
474(8)
6.8 Graphs
482(10)
6.8.1 Creating a graph using the CUDA graph API
483(6)
6.8.2 Creating a graph by capturing a stream
489(3)
6.9 Warp functions
492(9)
6.10 Cooperative groups
501(22)
6.10.1 Intrablock cooperative groups
501(13)
6.10.2 Interblock cooperative groups
514(5)
6.10.3 Grid-level reduction
519(4)
6.11 Dynamic parallelism
523(4)
6.12 Debugging CUDA programs
527(2)
6.13 Profiling CUDA programs
529(4)
6.14 CUDA and MPI
533(6)
6.15 Case studies
539(44)
6.15.1 Fractal set calculation
540(11)
6.15.2 Block cipher encryption
551(27)
Exercises
578(5)
Chapter 7 GPU and accelerator programming: OpenCL
583(100)
7.1 The OpenCL architecture
583(2)
7.2 The platform model
585(5)
7.3 The execution model
590(3)
7.4 The programming model
593(14)
7.4.1 Summarizing the structure of an OpenCL program
603(4)
7.5 The memory model
607(33)
7.5.1 Buffer objects
609(9)
7.5.2 Local memory
618(1)
7.5.3 Image objects
619(12)
7.5.4 Pipe objects
631(9)
7.6 Shared virtual memory
640(4)
7.7 Atomics and synchronization
644(5)
7.8 Work group functions
649(6)
7.9 Events and profiling OpenCL programs
655(2)
7.10 OpenCL and other parallel software platforms
657(3)
7.11 Case study: Mandelbrot set
660(23)
7.11.1 Calculating the Mandelbrot set using OpenCL
661(7)
7.11.2 Hybrid calculation of the Mandelbrot set using OpenCL and C++11
668(6)
7.11.3 Hybrid calculation of the Mandelbrot set using OpenCL on both host and device
674(3)
7.11.4 Performance comparison
677(1)
Exercises
677(6)
PART 3 Higher-level parallel programming
Chapter 8 Shared-memory programming: OpenMP
683(120)
8.1 Introduction
683(1)
8.2 Your first OpenMP program
684(5)
8.3 Variable scope
689(10)
8.3.1 OpenMP integration V.0: manual partitioning
691(2)
8.3.2 OpenMP integration V. 1: manual partitioning without a race condition
693(1)
8.3.3 OpenMP integration V.2: implicit partitioning with locking
694(2)
8.3.4 OpenMP integration V.3: implicit partitioning with reduction
696(2)
8.3.5 Final words on variable scope
698(1)
8.4 Loop-level parallelism
699(17)
8.4.1 Data dependencies
701(10)
8.4.2 Nested loops
711(1)
8.4.3 Scheduling
712(4)
8.5 Task parallelism
716(23)
8.5.1 The secti ons directive
716(6)
8.5.2 The task directive
722(5)
8.5.3 Task dependencies
727(4)
8.5.4 The taskl oop directive
731(1)
8.5.5 The taskgroup directive and task-level reduction
732(7)
8.6 Synchronization constructs
739(7)
8.7 Cancellation constructs
746(1)
8.8 SIMD extensions
747(4)
8.9 Offloading to devices
751(15)
8.9.1 Device work-sharing directives
753(5)
8.9.2 Device memory management directives
758(6)
8.9.3 CUDA interoperability
764(2)
8.10 The loop construct
766(1)
8.11 Thread affinity
767(4)
8.12 Correctness and optimization issues
771(13)
8.12.1 Thread safety
771(7)
8.12.2 False-sharing
778(6)
8.13 A case study: sorting in OpenMP
784(9)
8.13.1 Bottom-up mergesort in OpenMP
784(3)
8.13.2 Top-down mergesort in OpenMP
787(6)
8.13.3 Performance comparison
793(1)
8.14 A case study: brute-force encryption cracking, combining MPI and OpenMP
793(10)
Exercises
797(6)
Chapter 9 High-level multi-threaded programming with the Qt library
803(32)
9.1 Introduction
803(1)
9.2 Implicit thread creation
804(2)
9.3 Qt's pool of threads
806(2)
9.4 Higher-level constructs - multi-threaded programming without threads!
808(27)
9.4.1 Concurrent map
808(3)
9.4.2 Map-reduce
811(2)
9.4.3 Concurrent filter
813(2)
9.4.4 Filter-reduce
815(1)
9.4.5 A case study: multi-threaded sorting
816(9)
9.4.6 A case study: multi-threaded image matching
825(8)
Exercises
833(2)
Chapter 10 The Thrust template library
835(52)
10.1 Introduction
835(1)
10.2 First steps in Thrust
836(4)
10.3 Working with Thrust datatypes
840(4)
10.4 Thrust algorithms
844(18)
10.4.1 Transformations
844(5)
10.4.2 Sorting & searching
849(5)
10.4.3 Reductions
854(3)
10.4.4 Scans/prefix-sums
857(2)
10.4.5 Data management and reordering
859(3)
10.5 Fancy iterators
862(6)
10.6 Switching device back-ends
868(2)
10.7 Thrust execution policies and asynchronous execution
870(2)
10.8 Case studies
872(15)
10.8.1 Monte Carlo integration
872(4)
10.8.2 DNA sequence alignment
876(7)
Exercises
883(4)
PART 4 Advanced topics
Chapter 11 Load balancing
887(56)
11.1 Introduction
887(1)
11.2 Dynamic load balancing: the Linda legacy
888(2)
11.3 Static load balancing: the divisible load theory approach
890(24)
11.3.1 Modeling costs
891(7)
11.3.2 Communication configuration
898(3)
11.3.3 Analysis
901(10)
11.3.4 Summary - short literature review
911(3)
11.4 DLTLib: a library for partitioning workloads
914(3)
11.5 Case studies
917(26)
11.5.1 Hybrid computation of a Mandelbrot set "movie": a case study in dynamic load balancing
917(13)
11.5.2 Distributed block cipher encryption: a case study in static load balancing
930(10)
Exercises
940(3)
Appendix A Creating Qt programs
943(2)
A.1 Using an IDE
943(1)
A.2 The qmake utility
943(2)
Appendix B Running MPI programs: preparatory and configuration steps
945(4)
B.1 Preparatory steps
945(1)
B.2 Computing nodes discovery for MPI program deployment
946(3)
B.2.1 Host discovery with the nmap utility
946(1)
B.2.2 Automatic generation of a hostfile
947(2)
Appendix C Time measurement
949(6)
C.1 Introduction
949(1)
C.2 POSIX high-resolution timing
949(2)
C.3 Timing in C++11
951(1)
C.4 Timing in Qt
952(1)
C.5 Timing in OpenMP
952(1)
C.6 Timing in MPI
953(1)
C.7 Timing in CUDA
953(2)
Appendix D Boost.MPI
955(2)
D.1 Mapping from MPI C to Boost.MPI
955(2)
Appendix E Setting up CUDA
957(4)
E.1 Installation
957(1)
E.2 Issues with GCC
957(1)
E.3 Combining CUDA with third-party libraries
958(3)
Appendix F OpenCL helper functions
961(6)
F.1 Function readCLFromFile
961(1)
F.2 Function is Error
962(1)
F.3 Function getCompi 1 ationError
963(1)
F.4 Function handl eError
963(1)
F.5 Function setupDevice
964(1)
F.6 Function setupProgramAndKernel
965(2)
Appendix G DLTlib
967(10)
G.1 DLTlib functions
967(8)
G.1.1 Class Network: generic methods
968(2)
G.1.2 Class Network: query processing
970(1)
G.1.3 Class Network: image processing
971(2)
G.1.4 Class Network: image registration
973(2)
G.2 DLTlib files
975(2)
Glossary 977(2)
Bibliography 979(4)
Index 983
Gerassimos Barlas is a Professor with the Computer Science & Engineering Department, American University of Sharjah, Sharjah, UAE. His research interest includes parallel algorithms, development, analysis and modeling frameworks for load balancing, and distributed Video on-Demand. Prof. Barlas has taught parallel computing for more than 12 years, has been involved with parallel computing since the early 90s, and is active in the emerging field of Divisible Load Theory for parallel and distributed systems.