List of Tables |
|
xiii | |
Preface |
|
xv | |
Chapter 1 Introduction |
|
1 | (26) |
|
1.1 The era of multicore machines |
|
|
1 | (2) |
|
1.2 A taxonomy of parallel machines |
|
|
3 | (2) |
|
1.3 A glimpse of contemporary computing machines |
|
|
5 | (9) |
|
1.3.1 The cell BE processor |
|
|
6 | (1) |
|
|
7 | (3) |
|
|
10 | (1) |
|
1.3.4 Multicore to many-core: Tilera's TILE-Gx8072 and Intel's Xeon Phi |
|
|
11 | (3) |
|
|
14 | (4) |
|
1.5 Predicting and measuring parallel program performance |
|
|
18 | (8) |
|
|
21 | (3) |
|
1.5.2 Gustafson-Barsis's rebuttal |
|
|
24 | (2) |
|
|
26 | (1) |
Chapter 2 Multicore and parallel program design |
|
27 | (28) |
|
|
27 | (1) |
|
|
28 | (4) |
|
2.3 Decomposition patterns |
|
|
32 | (15) |
|
|
33 | (1) |
|
2.3.2 Divide-and-conquer decomposition |
|
|
34 | (2) |
|
2.3.3 Geometric decomposition |
|
|
36 | (3) |
|
2.3.4 Recursive data decomposition |
|
|
39 | (3) |
|
2.3.5 Pipeline decomposition |
|
|
42 | (4) |
|
2.3.6 Event-based coordination decomposition |
|
|
46 | (1) |
|
2.4 Program structure patterns |
|
|
47 | (6) |
|
2.4.1 Single-program, multiple-data |
|
|
48 | (1) |
|
2.4.2 Multiple-program, multiple-data |
|
|
48 | (1) |
|
|
49 | (1) |
|
|
50 | (1) |
|
|
51 | (2) |
|
|
53 | (1) |
|
2.5 Matching decomposition patterns with program structure patterns |
|
|
53 | (1) |
|
|
54 | (1) |
Chapter 3 Shared-memory programming: threads |
|
55 | (110) |
|
|
55 | (3) |
|
|
58 | (10) |
|
|
58 | (1) |
|
3.2.2 What are threads good for? |
|
|
59 | (1) |
|
3.2.3 Thread creation and initialization |
|
|
59 | (6) |
|
3.2.4 Sharing data between threads |
|
|
65 | (3) |
|
|
68 | (2) |
|
|
70 | (5) |
|
3.5 Applying semaphores in classical problems |
|
|
75 | (24) |
|
3.5.1 Producers-consumers |
|
|
75 | (4) |
|
3.5.2 Dealing with termination |
|
|
79 | (11) |
|
3.5.3 The barbershop problem: introducing fairness |
|
|
90 | (5) |
|
|
95 | (4) |
|
|
99 | (8) |
|
3.6.1 Design approach 1: critical section inside the monitor |
|
|
103 | (1) |
|
3.6.2 Design approach 2: monitor controls entry to critical section |
|
|
104 | (3) |
|
3.7 Applying monitors in classical problems |
|
|
107 | (13) |
|
3.7.1 Producers-consumers revisited |
|
|
107 | (6) |
|
|
113 | (7) |
|
3.8 Dynamic vs. static thread management |
|
|
120 | (10) |
|
|
120 | (1) |
|
3.8.2 Creating and managing a pool of threads |
|
|
121 | (9) |
|
3.9 Debugging multithreaded applications |
|
|
130 | (5) |
|
3.10 Higher-level constructs: multithreaded programming without threads |
|
|
135 | (25) |
|
|
136 | (2) |
|
|
138 | (2) |
|
|
140 | (2) |
|
|
142 | (1) |
|
3.10.5 A case study: multithreaded sorting |
|
|
143 | (9) |
|
3.10.6 A case study: multithreaded image matching |
|
|
152 | (8) |
|
|
160 | (5) |
Chapter 4 Shared-memory programming: OpenMP |
|
165 | (74) |
|
|
165 | (1) |
|
4.2 Your first OpenMP program |
|
|
166 | (3) |
|
|
169 | (10) |
|
4.3.1 OpenMP integration V.0: manual partitioning |
|
|
171 | (2) |
|
4.3.2 OpenMP integration V.1: manual partitioning without a race condition |
|
|
173 | (2) |
|
4.3.3 OpenMP integration V.2: implicit partitioning with locking |
|
|
175 | (1) |
|
4.3.4 OpenMP integration V.3: implicit partitioning with reduction |
|
|
176 | (2) |
|
4.3.5 Final words on variable scope |
|
|
178 | (1) |
|
4.4 Loop-level parallelism |
|
|
179 | (16) |
|
|
181 | (10) |
|
|
191 | (1) |
|
|
192 | (3) |
|
|
195 | (13) |
|
4.5.1 The sections directive |
|
|
196 | (6) |
|
|
202 | (6) |
|
4.6 Synchronization constructs |
|
|
208 | (8) |
|
4.7 Correctness and optimization issues |
|
|
216 | (10) |
|
|
216 | (4) |
|
|
220 | (6) |
|
4.8 A Case study: sorting in OpenMP |
|
|
226 | (11) |
|
4.8.1 Bottom-up mergesort in OpenMP |
|
|
227 | (3) |
|
4.8.2 Top-down mergesort in OpenMP |
|
|
230 | (5) |
|
4.8.3 Performance comparison |
|
|
235 | (2) |
|
|
237 | (2) |
Chapter 5 Distributed memory programming |
|
239 | (152) |
|
5.1 Communicating processes |
|
|
239 | (1) |
|
|
240 | (1) |
|
|
241 | (1) |
|
5.4 Your first MPI program |
|
|
242 | (4) |
|
|
246 | (2) |
|
|
246 | (1) |
|
|
246 | (2) |
|
5.6 Point-to-Point communication |
|
|
248 | (4) |
|
5.7 Alternative Point-to-Point communication modes |
|
|
252 | (3) |
|
5.7.1 Buffered communications |
|
|
253 | (2) |
|
5.8 Non blocking communications |
|
|
255 | (4) |
|
5.9 Point-to-Point communications: summary |
|
|
259 | (1) |
|
5.10 Error reporting and handling |
|
|
259 | (2) |
|
5.11 Collective communications |
|
|
261 | (28) |
|
|
266 | (6) |
|
|
272 | (2) |
|
|
274 | (5) |
|
5.11.4 All-to-all gathering |
|
|
279 | (4) |
|
5.11.5 All-to-all scattering |
|
|
283 | (5) |
|
5.11.6 All-to-all reduction |
|
|
288 | (1) |
|
5.11.7 Global synchronization |
|
|
289 | (1) |
|
5.12 Communicating objects |
|
|
289 | (11) |
|
|
290 | (7) |
|
|
297 | (3) |
|
5.13 Node management: communicators and groups |
|
|
300 | (5) |
|
|
300 | (2) |
|
5.13.2 Creating intra-communicators |
|
|
302 | (3) |
|
5.14 One-sided communications |
|
|
305 | (12) |
|
5.14.1 RMA communication functions |
|
|
307 | (1) |
|
5.14.2 RMA synchronization functions |
|
|
308 | (9) |
|
|
317 | (8) |
|
5.16 Combining MPI processes with threads |
|
|
325 | (3) |
|
5.17 Timing and performance measurements |
|
|
328 | (1) |
|
5.18 Debugging and profiling MPI programs |
|
|
329 | (4) |
|
5.19 The Boost.MPI library |
|
|
333 | (14) |
|
5.19.1 Blocking and non blocking communications |
|
|
335 | (5) |
|
5.19.2 Data serialization |
|
|
340 | (3) |
|
5.19.3 Collective operations |
|
|
343 | (4) |
|
5.20 A case study: diffusion-limited aggregation |
|
|
347 | (5) |
|
5.21 A case study: brute-force encryption cracking |
|
|
352 | (10) |
|
5.21.1 Version #1: "plain-vanilla" MPI |
|
|
352 | (6) |
|
5.21.2 Version #2: combining MPI and OpenMP |
|
|
358 | (4) |
|
5.22 A Case study: MPI implementation of the master-worker pattern |
|
|
362 | (24) |
|
5.22.1 A Simple master-worker setup |
|
|
363 | (8) |
|
5.22.2 A Multithreaded master-worker setup |
|
|
371 | (15) |
|
|
386 | (5) |
Chapter 6 GPU programming |
|
391 | (136) |
|
|
391 | (3) |
|
6.2 CUDA's programming model: threads, blocks, and grids |
|
|
394 | (6) |
|
6.3 CUDA's execution model: streaming multiprocessors and warps |
|
|
400 | (3) |
|
6.4 CUDA compilation process |
|
|
403 | (4) |
|
6.5 Putting together a CUDA project |
|
|
407 | (3) |
|
|
410 | (22) |
|
6.6.1 Local memory/registers |
|
|
416 | (1) |
|
|
417 | (8) |
|
|
425 | (7) |
|
6.6.4 Texture and surface memory |
|
|
432 | (1) |
|
6.7 Optimization techniques |
|
|
432 | (39) |
|
6.7.1 Block and grid design |
|
|
432 | (10) |
|
|
442 | (4) |
|
6.7.3 Shared memory access |
|
|
446 | (8) |
|
6.7.4 Global memory access |
|
|
454 | (4) |
|
6.7.5 Page-locked and zero-copy memory |
|
|
458 | (3) |
|
|
461 | (3) |
|
6.7.7 Asynchronous execution and streams |
|
|
464 | (7) |
|
|
471 | (4) |
|
6.9 Debugging CUDA programs |
|
|
475 | (1) |
|
6.10 Profiling CUDA programs |
|
|
476 | (4) |
|
|
480 | (5) |
|
|
485 | (38) |
|
6.12.1 Fractal set calculation |
|
|
486 | (10) |
|
6.12.2 Block cipher encryption |
|
|
496 | (27) |
|
|
523 | (4) |
Chapter 7 The Thrust template library |
|
527 | (48) |
|
|
527 | (1) |
|
7.2 First steps in Thrust |
|
|
528 | (4) |
|
7.3 Working with Thrust datatypes |
|
|
532 | (3) |
|
|
535 | (18) |
|
|
536 | (4) |
|
7.4.2 Sorting and searching |
|
|
540 | (6) |
|
|
546 | (2) |
|
|
548 | (2) |
|
7.4.5 Data management and manipulation |
|
|
550 | (3) |
|
|
553 | (6) |
|
7.6 Switching device back ends |
|
|
559 | (2) |
|
|
561 | (10) |
|
7.7.1 Monte Carlo integration |
|
|
561 | (3) |
|
7.7.2 DNA sequence alignment |
|
|
564 | (7) |
|
|
571 | (4) |
Chapter 8 Load balancing |
|
575 | (54) |
|
|
575 | (1) |
|
8.2 Dynamic load balancing: the Linda legacy |
|
|
576 | (2) |
|
8.3 Static load balancing: the divisible load theory approach |
|
|
578 | (23) |
|
|
579 | (7) |
|
8.3.2 Communication configuration |
|
|
586 | (3) |
|
|
589 | (9) |
|
8.3.4 Summary - short literature review |
|
|
598 | (3) |
|
8.4 DLTlib: A library for partitioning workloads |
|
|
601 | (3) |
|
|
604 | (23) |
|
8.5.1 Hybrid computation of a mandelbrot set "movie": a case study in dynamic load balancing |
|
|
604 | (13) |
|
8.5.2 Distributed block cipher encryption: a case study in static load balancing |
|
|
617 | (10) |
|
|
627 | (2) |
Appendix A Compiling Qt programs |
|
629 | (2) |
|
|
629 | (1) |
|
|
629 | (2) |
Appendix B Running MPI programs: preparatory and configuration steps |
|
631 | (4) |
|
|
631 | (1) |
|
B.2 Computing nodes discovery for MPI program deployment |
|
|
632 | (3) |
|
B.2.1 Host discovery with the nmap utility |
|
|
632 | (1) |
|
B.2.2 Automatic generation of a hostfile |
|
|
633 | (2) |
Appendix C lime measurement |
|
635 | (6) |
|
|
635 | (1) |
|
C.2 POSIX high-resolution timing |
|
|
635 | (2) |
|
|
637 | (1) |
|
|
638 | (1) |
|
|
638 | (1) |
|
|
638 | (3) |
Appendix D Boost.MPI |
|
641 | (2) |
|
D.1 Mapping from MPI C to Boost.MPI |
|
|
641 | (2) |
Appendix E Setting up CUDA |
|
643 | (6) |
|
|
643 | (1) |
|
|
643 | (1) |
|
E.3 Running CUDA without an Nvidia GPU |
|
|
644 | (1) |
|
E.4 Running CUDA on optimus-equipped laptops |
|
|
645 | (1) |
|
E.5 Combining CUDA with third-party libraries |
|
|
646 | (3) |
Appendix F DLTlib |
|
649 | (10) |
|
|
649 | (8) |
|
F.1.1 Class Network: generic methods |
|
|
650 | (2) |
|
F.1.2 Class Network: query processing |
|
|
652 | (1) |
|
F.1.3 Class Network: image processing |
|
|
653 | (1) |
|
F.1.4 Class Network: image registration |
|
|
654 | (3) |
|
|
657 | (2) |
Glossary |
|
659 | (2) |
Bibliography |
|
661 | (4) |
Index |
|
665 | |