|
|
xv | |
Preface |
|
xvii | |
|
|
|
|
3 | (28) |
|
1.1 The era of multicore machines |
|
|
3 | (2) |
|
1.2 A taxonomy of parallel machines |
|
|
5 | (2) |
|
1.3 A glimpse of influential computing machines |
|
|
7 | (10) |
|
1.3.1 The Cell BE processor |
|
|
8 | (1) |
|
|
9 | (3) |
|
1.3.3 Multicore to many-core: TILERA's TILE-Gx8072 and Intel's Xeon Phi |
|
|
12 | (2) |
|
1.3.4 AMD's Epyc Rome: scaling up with smaller chips |
|
|
14 | (2) |
|
1.3.5 Fujitsu A64FX: compute and memory integration |
|
|
16 | (1) |
|
|
17 | (4) |
|
1.5 Predicting and measuring parallel program performance |
|
|
21 | (10) |
|
|
25 | (2) |
|
1.5.2 Gustafson-Barsis' rebuttal |
|
|
27 | (1) |
|
|
28 | (3) |
|
Chapter 2 Multicore and parallel program design |
|
|
31 | (34) |
|
|
31 | (1) |
|
|
32 | (4) |
|
2.3 Decomposition patterns |
|
|
36 | (18) |
|
|
37 | (1) |
|
2.3.2 Divide-and-conquer decomposition |
|
|
37 | (3) |
|
2.3.3 Geometric decomposition |
|
|
40 | (3) |
|
2.3.4 Recursive data decomposition |
|
|
43 | (6) |
|
2.3.5 Pipeline decomposition |
|
|
49 | (4) |
|
2.3.6 Event-based coordination decomposition |
|
|
53 | (1) |
|
2.4 Program structure patterns |
|
|
54 | (6) |
|
2.4.1 Single program, multiple data |
|
|
54 | (1) |
|
2.4.2 Multiple program, multiple data |
|
|
55 | (1) |
|
|
56 | (1) |
|
|
57 | (1) |
|
|
58 | (2) |
|
|
60 | (1) |
|
2.5 Matching decomposition patterns with program structure patterns |
|
|
60 | (5) |
|
|
61 | (4) |
|
PART 2 Programming with threads and processes |
|
|
|
Chapter 3 Threads and concurrency in standard C++ |
|
|
65 | (116) |
|
|
65 | (3) |
|
|
68 | (1) |
|
|
68 | (1) |
|
3.2.2 What are threads good for? |
|
|
68 | (1) |
|
3.3 Thread creation and initialization |
|
|
69 | (8) |
|
3.4 Sharing data between threads |
|
|
77 | (3) |
|
|
80 | (2) |
|
|
82 | (5) |
|
3.7 Applying semaphores in classical problems |
|
|
87 | (30) |
|
3.7.1 Producers-consumers |
|
|
89 | (4) |
|
3.7.2 Dealing with termination |
|
|
93 | (12) |
|
3.7.3 The barbershop problem - introducing fairness |
|
|
105 | (6) |
|
|
111 | (6) |
|
|
117 | (9) |
|
|
122 | (4) |
|
|
126 | (12) |
|
3.9.1 Design approach #1: critical section inside the monitor |
|
|
131 | (1) |
|
3.9.2 Design approach #2: monitor controls entry to critical section |
|
|
132 | (4) |
|
3.9.3 General semaphores revisited |
|
|
136 | (2) |
|
3.10 Applying monitors in classical problems |
|
|
138 | (14) |
|
3.10.1 Producers-consumers revisited |
|
|
138 | (7) |
|
|
145 | (7) |
|
3.11 Asynchronous threads |
|
|
152 | (4) |
|
3.12 Dynamic vs. static thread management |
|
|
156 | (9) |
|
|
165 | (7) |
|
3.14 Debugging multi-threaded applications |
|
|
172 | (9) |
|
|
177 | (4) |
|
Chapter 4 Parallel data structures |
|
|
181 | (50) |
|
|
181 | (4) |
|
4.2 Lock-based structures |
|
|
185 | (18) |
|
|
185 | (4) |
|
|
189 | (14) |
|
|
203 | (24) |
|
|
204 | (5) |
|
4.3.2 A bounded lock-free queue: first attempt |
|
|
209 | (7) |
|
|
216 | (2) |
|
4.3.4 A fixed bounded lock-free queue |
|
|
218 | (4) |
|
4.3.5 An unbounded lock-free queue |
|
|
222 | (5) |
|
|
227 | (4) |
|
|
228 | (3) |
|
Chapter 5 Distributed memory programming |
|
|
231 | (158) |
|
|
231 | (1) |
|
|
232 | (2) |
|
|
234 | (1) |
|
5.4 Your first MPI program |
|
|
234 | (4) |
|
|
238 | (3) |
|
|
238 | (2) |
|
|
240 | (1) |
|
5.6 Point-to-point communication |
|
|
241 | (4) |
|
5.7 Alternative point-to-point communication modes |
|
|
245 | (3) |
|
5.7.1 Buffered communications |
|
|
246 | (2) |
|
5.8 Non-blocking communications |
|
|
248 | (4) |
|
5.9 Point-to-point communications: summary |
|
|
252 | (1) |
|
5.10 Error reporting & handling |
|
|
252 | (2) |
|
5.11 Collective communications |
|
|
254 | (29) |
|
|
259 | (6) |
|
|
265 | (2) |
|
|
267 | (4) |
|
5.11.4 All-to-all gathering |
|
|
271 | (5) |
|
5.11.5 All-to-all scattering |
|
|
276 | (6) |
|
5.11.6 All-to-all reduction |
|
|
282 | (1) |
|
5.11.7 Global synchronization |
|
|
282 | (1) |
|
5.12 Persistent communications |
|
|
283 | (3) |
|
5.13 Big-count communications in MPI 4.0 |
|
|
286 | (1) |
|
5.14 Partitioned communications |
|
|
287 | (2) |
|
5.15 Communicating objects |
|
|
289 | (11) |
|
|
291 | (7) |
|
|
298 | (2) |
|
5.16 Node management: communicators and groups |
|
|
300 | (6) |
|
|
301 | (2) |
|
5.16.2 Creating intracommunicators |
|
|
303 | (3) |
|
5.17 One-sided communication |
|
|
306 | (12) |
|
5.17.1 RMA communication functions |
|
|
307 | (2) |
|
5.17.2 RMA synchronization functions |
|
|
309 | (9) |
|
|
318 | (8) |
|
5.19 Combining MPI processes with threads |
|
|
326 | (2) |
|
5.20 Timing and performance measurements |
|
|
328 | (1) |
|
5.21 Debugging, profiling, and tracing MPI programs |
|
|
329 | (7) |
|
5.21.1 Brief introduction to Scalasca |
|
|
330 | (4) |
|
5.21.2 Brief introduction to TAU |
|
|
334 | (2) |
|
5.22 The Boost.MPI library |
|
|
336 | (13) |
|
5.22.1 Blocking and non-blocking communications |
|
|
337 | (5) |
|
5.22.2 Data serialization |
|
|
342 | (3) |
|
5.22.3 Collective operations |
|
|
345 | (4) |
|
5.23 A case study: diffusion-limited aggregation |
|
|
349 | (6) |
|
5.24 A case study: brute-force encryption cracking |
|
|
355 | (6) |
|
5.25 A case study: MPI implementation of the master-worker pattern |
|
|
361 | (28) |
|
5.25.1 A simple master-worker setup |
|
|
361 | (8) |
|
5.25.2 A multi-threaded master-worker setup |
|
|
369 | (15) |
|
|
384 | (5) |
|
Chapter 6 GPU programming: CUDA |
|
|
389 | (194) |
|
|
389 | (3) |
|
6.2 CUDA's programming model: threads, blocks, and grids |
|
|
392 | (6) |
|
6.3 CUDA's execution model: streaming multiprocessors and warps |
|
|
398 | (3) |
|
6.4 CUDA compilation process |
|
|
401 | (5) |
|
6.5 Putting together a CUDA project |
|
|
406 | (3) |
|
|
409 | (24) |
|
6.6.1 Local memory/registers |
|
|
416 | (1) |
|
|
417 | (9) |
|
|
426 | (7) |
|
6.6.4 Texture and surface memory |
|
|
433 | (1) |
|
6.7 Optimization techniques |
|
|
433 | (49) |
|
6.7.1 Block and grid design |
|
|
433 | (12) |
|
|
445 | (8) |
|
6.7.3 Shared memory access |
|
|
453 | (9) |
|
6.7.4 Global memory access |
|
|
462 | (12) |
|
6.7.5 Asynchronous execution and streams: overlapping GPU memory transfers and more |
|
|
474 | (8) |
|
|
482 | (10) |
|
6.8.1 Creating a graph using the CUDA graph API |
|
|
483 | (6) |
|
6.8.2 Creating a graph by capturing a stream |
|
|
489 | (3) |
|
|
492 | (9) |
|
|
501 | (22) |
|
6.10.1 Intrablock cooperative groups |
|
|
501 | (13) |
|
6.10.2 Interblock cooperative groups |
|
|
514 | (5) |
|
6.10.3 Grid-level reduction |
|
|
519 | (4) |
|
|
523 | (4) |
|
6.12 Debugging CUDA programs |
|
|
527 | (2) |
|
6.13 Profiling CUDA programs |
|
|
529 | (4) |
|
|
533 | (6) |
|
|
539 | (44) |
|
6.15.1 Fractal set calculation |
|
|
540 | (11) |
|
6.15.2 Block cipher encryption |
|
|
551 | (27) |
|
|
578 | (5) |
|
Chapter 7 GPU and accelerator programming: OpenCL |
|
|
583 | (100) |
|
7.1 The OpenCL architecture |
|
|
583 | (2) |
|
|
585 | (5) |
|
|
590 | (3) |
|
7.4 The programming model |
|
|
593 | (14) |
|
7.4.1 Summarizing the structure of an OpenCL program |
|
|
603 | (4) |
|
|
607 | (33) |
|
|
609 | (9) |
|
|
618 | (1) |
|
|
619 | (12) |
|
|
631 | (9) |
|
7.6 Shared virtual memory |
|
|
640 | (4) |
|
7.7 Atomics and synchronization |
|
|
644 | (5) |
|
|
649 | (6) |
|
7.9 Events and profiling OpenCL programs |
|
|
655 | (2) |
|
7.10 OpenCL and other parallel software platforms |
|
|
657 | (3) |
|
7.11 Case study: Mandelbrot set |
|
|
660 | (23) |
|
7.11.1 Calculating the Mandelbrot set using OpenCL |
|
|
661 | (7) |
|
7.11.2 Hybrid calculation of the Mandelbrot set using OpenCL and C++11 |
|
|
668 | (6) |
|
7.11.3 Hybrid calculation of the Mandelbrot set using OpenCL on both host and device |
|
|
674 | (3) |
|
7.11.4 Performance comparison |
|
|
677 | (1) |
|
|
677 | (6) |
|
PART 3 Higher-level parallel programming |
|
|
|
Chapter 8 Shared-memory programming: OpenMP |
|
|
683 | (120) |
|
|
683 | (1) |
|
8.2 Your first OpenMP program |
|
|
684 | (5) |
|
|
689 | (10) |
|
8.3.1 OpenMP integration V.0: manual partitioning |
|
|
691 | (2) |
|
8.3.2 OpenMP integration V. 1: manual partitioning without a race condition |
|
|
693 | (1) |
|
8.3.3 OpenMP integration V.2: implicit partitioning with locking |
|
|
694 | (2) |
|
8.3.4 OpenMP integration V.3: implicit partitioning with reduction |
|
|
696 | (2) |
|
8.3.5 Final words on variable scope |
|
|
698 | (1) |
|
8.4 Loop-level parallelism |
|
|
699 | (17) |
|
|
701 | (10) |
|
|
711 | (1) |
|
|
712 | (4) |
|
|
716 | (23) |
|
8.5.1 The secti ons directive |
|
|
716 | (6) |
|
|
722 | (5) |
|
|
727 | (4) |
|
8.5.4 The taskl oop directive |
|
|
731 | (1) |
|
8.5.5 The taskgroup directive and task-level reduction |
|
|
732 | (7) |
|
8.6 Synchronization constructs |
|
|
739 | (7) |
|
8.7 Cancellation constructs |
|
|
746 | (1) |
|
|
747 | (4) |
|
8.9 Offloading to devices |
|
|
751 | (15) |
|
8.9.1 Device work-sharing directives |
|
|
753 | (5) |
|
8.9.2 Device memory management directives |
|
|
758 | (6) |
|
8.9.3 CUDA interoperability |
|
|
764 | (2) |
|
|
766 | (1) |
|
|
767 | (4) |
|
8.12 Correctness and optimization issues |
|
|
771 | (13) |
|
|
771 | (7) |
|
|
778 | (6) |
|
8.13 A case study: sorting in OpenMP |
|
|
784 | (9) |
|
8.13.1 Bottom-up mergesort in OpenMP |
|
|
784 | (3) |
|
8.13.2 Top-down mergesort in OpenMP |
|
|
787 | (6) |
|
8.13.3 Performance comparison |
|
|
793 | (1) |
|
8.14 A case study: brute-force encryption cracking, combining MPI and OpenMP |
|
|
793 | (10) |
|
|
797 | (6) |
|
Chapter 9 High-level multi-threaded programming with the Qt library |
|
|
803 | (32) |
|
|
803 | (1) |
|
9.2 Implicit thread creation |
|
|
804 | (2) |
|
|
806 | (2) |
|
9.4 Higher-level constructs - multi-threaded programming without threads! |
|
|
808 | (27) |
|
|
808 | (3) |
|
|
811 | (2) |
|
|
813 | (2) |
|
|
815 | (1) |
|
9.4.5 A case study: multi-threaded sorting |
|
|
816 | (9) |
|
9.4.6 A case study: multi-threaded image matching |
|
|
825 | (8) |
|
|
833 | (2) |
|
Chapter 10 The Thrust template library |
|
|
835 | (52) |
|
|
835 | (1) |
|
10.2 First steps in Thrust |
|
|
836 | (4) |
|
10.3 Working with Thrust datatypes |
|
|
840 | (4) |
|
|
844 | (18) |
|
|
844 | (5) |
|
10.4.2 Sorting & searching |
|
|
849 | (5) |
|
|
854 | (3) |
|
|
857 | (2) |
|
10.4.5 Data management and reordering |
|
|
859 | (3) |
|
|
862 | (6) |
|
10.6 Switching device back-ends |
|
|
868 | (2) |
|
10.7 Thrust execution policies and asynchronous execution |
|
|
870 | (2) |
|
|
872 | (15) |
|
10.8.1 Monte Carlo integration |
|
|
872 | (4) |
|
10.8.2 DNA sequence alignment |
|
|
876 | (7) |
|
|
883 | (4) |
|
|
|
Chapter 11 Load balancing |
|
|
887 | (56) |
|
|
887 | (1) |
|
11.2 Dynamic load balancing: the Linda legacy |
|
|
888 | (2) |
|
11.3 Static load balancing: the divisible load theory approach |
|
|
890 | (24) |
|
|
891 | (7) |
|
11.3.2 Communication configuration |
|
|
898 | (3) |
|
|
901 | (10) |
|
11.3.4 Summary - short literature review |
|
|
911 | (3) |
|
11.4 DLTLib: a library for partitioning workloads |
|
|
914 | (3) |
|
|
917 | (26) |
|
11.5.1 Hybrid computation of a Mandelbrot set "movie": a case study in dynamic load balancing |
|
|
917 | (13) |
|
11.5.2 Distributed block cipher encryption: a case study in static load balancing |
|
|
930 | (10) |
|
|
940 | (3) |
|
Appendix A Creating Qt programs |
|
|
943 | (2) |
|
|
943 | (1) |
|
|
943 | (2) |
|
Appendix B Running MPI programs: preparatory and configuration steps |
|
|
945 | (4) |
|
|
945 | (1) |
|
B.2 Computing nodes discovery for MPI program deployment |
|
|
946 | (3) |
|
B.2.1 Host discovery with the nmap utility |
|
|
946 | (1) |
|
B.2.2 Automatic generation of a hostfile |
|
|
947 | (2) |
|
Appendix C Time measurement |
|
|
949 | (6) |
|
|
949 | (1) |
|
C.2 POSIX high-resolution timing |
|
|
949 | (2) |
|
|
951 | (1) |
|
|
952 | (1) |
|
|
952 | (1) |
|
|
953 | (1) |
|
|
953 | (2) |
|
|
955 | (2) |
|
D.1 Mapping from MPI C to Boost.MPI |
|
|
955 | (2) |
|
Appendix E Setting up CUDA |
|
|
957 | (4) |
|
|
957 | (1) |
|
|
957 | (1) |
|
E.3 Combining CUDA with third-party libraries |
|
|
958 | (3) |
|
Appendix F OpenCL helper functions |
|
|
961 | (6) |
|
F.1 Function readCLFromFile |
|
|
961 | (1) |
|
|
962 | (1) |
|
F.3 Function getCompi 1 ationError |
|
|
963 | (1) |
|
F.4 Function handl eError |
|
|
963 | (1) |
|
|
964 | (1) |
|
F.6 Function setupProgramAndKernel |
|
|
965 | (2) |
|
|
967 | (10) |
|
|
967 | (8) |
|
G.1.1 Class Network: generic methods |
|
|
968 | (2) |
|
G.1.2 Class Network: query processing |
|
|
970 | (1) |
|
G.1.3 Class Network: image processing |
|
|
971 | (2) |
|
G.1.4 Class Network: image registration |
|
|
973 | (2) |
|
|
975 | (2) |
Glossary |
|
977 | (2) |
Bibliography |
|
979 | (4) |
Index |
|
983 | |