Preface |
|
xv | |
|
Chapter 1 Why parallel computing |
|
|
1 | (16) |
|
1.1 Why we need ever-increasing performance |
|
|
2 | (1) |
|
1.2 Why we're building parallel systems |
|
|
3 | (1) |
|
1.3 Why we need to write parallel programs |
|
|
3 | (3) |
|
1.4 How do we write parallel programs? |
|
|
6 | (2) |
|
|
8 | (2) |
|
1.6 Concurrent, parallel, distributed |
|
|
10 | (1) |
|
|
11 | (1) |
|
|
11 | (1) |
|
1.9 Typographical conventions |
|
|
12 | (1) |
|
|
12 | (2) |
|
|
14 | (3) |
|
Chapter 2 Parallel hardware and parallel software |
|
|
17 | (72) |
|
|
17 | (3) |
|
2.1.1 The von Neumann architecture |
|
|
17 | (2) |
|
2.1.2 Processes, multitasking, and threads |
|
|
19 | (1) |
|
2.2 Modifications to the von Neumann model |
|
|
20 | (11) |
|
2.2.1 The basics of caching |
|
|
21 | (2) |
|
|
23 | (1) |
|
2.2.3 Caches and programs: an example |
|
|
24 | (1) |
|
|
25 | (2) |
|
2.2.5 Instruction-level parallelism |
|
|
27 | (3) |
|
2.2.6 Hardware multithreading |
|
|
30 | (1) |
|
|
31 | (18) |
|
2.3.1 Classifications of parallel computers |
|
|
31 | (1) |
|
|
31 | (3) |
|
|
34 | (3) |
|
2.3.4 Interconnection networks |
|
|
37 | (8) |
|
|
45 | (4) |
|
2.3.6 Shared-memory vs. distributed-memory |
|
|
49 | (1) |
|
|
49 | (11) |
|
|
50 | (1) |
|
2.4.2 Coordinating the processes/threads |
|
|
50 | (1) |
|
|
51 | (4) |
|
|
55 | (3) |
|
|
58 | (2) |
|
2.4.6 Programming hybrid systems |
|
|
60 | (1) |
|
|
60 | (1) |
|
|
60 | (1) |
|
|
61 | (1) |
|
|
61 | (10) |
|
2.6.1 Speedup and efficiency in MIMD systems |
|
|
61 | (4) |
|
|
65 | (1) |
|
2.6.3 Scalability in MIMD systems |
|
|
66 | (1) |
|
2.6.4 Taking timings of MIMD programs |
|
|
67 | (3) |
|
|
70 | (1) |
|
2.7 Parallel program design |
|
|
71 | (4) |
|
|
71 | (4) |
|
2.8 Writing and running parallel programs |
|
|
75 | (2) |
|
|
77 | (1) |
|
|
78 | (6) |
|
|
78 | (1) |
|
|
79 | (2) |
|
|
81 | (1) |
|
|
82 | (1) |
|
|
82 | (1) |
|
2.10.6 Parallel program design |
|
|
83 | (1) |
|
|
84 | (1) |
|
|
84 | (5) |
|
Chapter 3 Distributed memory programming with MPI |
|
|
89 | (70) |
|
|
90 | (10) |
|
3.1.1 Compilation and execution |
|
|
91 | (1) |
|
|
92 | (1) |
|
3.1.3 MPI_Init and MPI_Finalize |
|
|
93 | (1) |
|
3.1.4 Communicators, MPI_Comm_size, and MPI_Comm_rank |
|
|
94 | (1) |
|
|
94 | (1) |
|
|
95 | (1) |
|
|
95 | (2) |
|
|
97 | (1) |
|
|
97 | (1) |
|
3.1.10 The status_p argument |
|
|
98 | (1) |
|
3.1.11 Semantics of MPI_Send and MPI_Recv |
|
|
99 | (1) |
|
3.1.12 Some potential pitfalls |
|
|
100 | (1) |
|
3.2 The trapezoidal rule in MPI |
|
|
100 | (4) |
|
3.2.1 The trapezoidal rule |
|
|
101 | (1) |
|
3.2.2 Parallelizing the trapezoidal rule |
|
|
102 | (2) |
|
|
104 | (4) |
|
|
105 | (2) |
|
|
107 | (1) |
|
3.4 Collective communication |
|
|
108 | (15) |
|
3.4.1 Tree-structured communication |
|
|
108 | (2) |
|
|
110 | (2) |
|
3.4.3 Collective vs. point-to-point communications |
|
|
112 | (1) |
|
|
113 | (1) |
|
|
113 | (3) |
|
|
116 | (1) |
|
|
117 | (2) |
|
|
119 | (2) |
|
|
121 | (2) |
|
3.5 MPI-derived datatypes |
|
|
123 | (4) |
|
3.6 Performance evaluation of MPI programs |
|
|
127 | (8) |
|
|
127 | (3) |
|
|
130 | (3) |
|
3.6.3 Speedup and efficiency |
|
|
133 | (1) |
|
|
134 | (1) |
|
3.7 A parallel sorting algorithm |
|
|
135 | (9) |
|
3.7.1 Some simple serial sorting algorithms |
|
|
135 | (2) |
|
3.7.2 Parallel odd-even transposition sort |
|
|
137 | (3) |
|
3.7.3 Safety in MPI programs |
|
|
140 | (3) |
|
3.7.4 Final details of parallel odd-even sort |
|
|
143 | (1) |
|
|
144 | (4) |
|
|
148 | (7) |
|
3.10 Programming assignments |
|
|
155 | (4) |
|
Chapter 4 Shared-memory programming with Pthreads |
|
|
159 | (62) |
|
4.1 Processes, threads, and Pthreads |
|
|
160 | (1) |
|
|
161 | (8) |
|
|
161 | (2) |
|
|
163 | (1) |
|
4.2.3 Starting the threads |
|
|
164 | (2) |
|
4.2.4 Running the threads |
|
|
166 | (1) |
|
4.2.5 Stopping the threads |
|
|
167 | (1) |
|
|
168 | (1) |
|
4.2.7 Other approaches to thread startup |
|
|
168 | (1) |
|
4.3 Matrix-vector multiplication |
|
|
169 | (2) |
|
|
171 | (4) |
|
|
175 | (3) |
|
|
178 | (4) |
|
4.7 Producer--consumer synchronization and semaphores |
|
|
182 | (5) |
|
4.8 Barriers and condition variables |
|
|
187 | (6) |
|
4.8.1 Busy-waiting and a mutex |
|
|
188 | (1) |
|
|
189 | (1) |
|
4.8.3 Condition variables |
|
|
190 | (3) |
|
|
193 | (1) |
|
|
193 | (9) |
|
4.9.1 Sorted linked list functions |
|
|
193 | (1) |
|
4.9.2 A multithreaded linked list |
|
|
194 | (4) |
|
4.9.3 Pthreads read-write locks |
|
|
198 | (2) |
|
4.9.4 Performance of the various implementations |
|
|
200 | (1) |
|
4.9.5 Implementing read-write locks |
|
|
201 | (1) |
|
4.10 Caches, cache-coherence, and false sharing |
|
|
202 | (5) |
|
|
207 | (3) |
|
4.11.1 Incorrect programs can produce correct output |
|
|
210 | (1) |
|
|
210 | (2) |
|
|
212 | (7) |
|
4.14 Programming assignments |
|
|
219 | (2) |
|
Chapter 5 Shared-memory programming with OpenMP |
|
|
221 | (70) |
|
|
222 | (6) |
|
5.1.1 Compiling and running OpenMP programs |
|
|
223 | (1) |
|
|
224 | (3) |
|
|
227 | (1) |
|
|
228 | (5) |
|
5.2.1 A first OpenMP version |
|
|
229 | (4) |
|
|
233 | (1) |
|
|
234 | (3) |
|
5.5 The parallel for directive |
|
|
237 | (8) |
|
|
238 | (1) |
|
|
239 | (2) |
|
5.5.3 Finding loop-carried dependences |
|
|
241 | (1) |
|
|
241 | (3) |
|
|
244 | (1) |
|
5.6 More about loops in OpenMP: sorting |
|
|
245 | (4) |
|
|
245 | (1) |
|
5.6.2 Odd-even transposition sort |
|
|
246 | (3) |
|
|
249 | (7) |
|
5.7.1 The schedule clause |
|
|
250 | (2) |
|
5.7.2 The static schedule type |
|
|
252 | (1) |
|
5.7.3 The dynamic and guided schedule types |
|
|
252 | (1) |
|
5.7.4 The runtime schedule type |
|
|
253 | (2) |
|
|
255 | (1) |
|
5.8 Producers and consumers |
|
|
256 | (9) |
|
|
256 | (1) |
|
|
256 | (1) |
|
|
257 | (1) |
|
|
257 | (1) |
|
5.8.5 Termination detection |
|
|
258 | (1) |
|
|
259 | (1) |
|
5.8.7 The atomic directive |
|
|
259 | (1) |
|
5.8.8 Critical sections and locks |
|
|
260 | (2) |
|
5.8.9 Using locks in the message-passing program |
|
|
262 | (1) |
|
5.8.10 Critical directives, atomic directives, or locks? |
|
|
263 | (1) |
|
|
264 | (1) |
|
5.9 Caches, cache coherence, and false sharing |
|
|
265 | (5) |
|
|
270 | (4) |
|
|
274 | (3) |
|
5.11.1 Incorrect programs can produce correct output |
|
|
276 | (1) |
|
|
277 | (4) |
|
|
281 | (5) |
|
5.14 Programming assignments |
|
|
286 | (5) |
|
Chapter 6 GPU programming with CUDA |
|
|
291 | (70) |
|
|
291 | (1) |
|
|
292 | (1) |
|
6.3 Heterogeneous computing |
|
|
293 | (2) |
|
|
295 | (3) |
|
|
296 | (1) |
|
6.4.2 Compiling and running the program |
|
|
296 | (2) |
|
|
298 | (1) |
|
6.6 Threads, blocks, and grids |
|
|
299 | (2) |
|
6.7 Nvidia compute capabilities and device architectures |
|
|
301 | (1) |
|
|
302 | (10) |
|
|
303 | (2) |
|
|
305 | (1) |
|
6.8.3 Allocate_vectors and managed memory |
|
|
305 | (2) |
|
6.8.4 Other functions called from main |
|
|
307 | (2) |
|
6.8.5 Explicit memory transfers |
|
|
309 | (3) |
|
6.9 Returning results from CUDA kernels |
|
|
312 | (2) |
|
6.10 CUDA trapezoidal rule I |
|
|
314 | (6) |
|
6.10.1 The trapezoidal rule |
|
|
314 | (1) |
|
6.10.2 A CUDA implementation |
|
|
315 | (1) |
|
6.10.3 Initialization, return value, and final update |
|
|
316 | (2) |
|
6.10.4 Using the correct threads |
|
|
318 | (1) |
|
6.10.5 Updating the return value and atomicAdd |
|
|
318 | (1) |
|
6.10.6 Performance of the CUDA trapezoidal rule |
|
|
319 | (1) |
|
6.11 CUDA trapezoidal rule II: improving performance |
|
|
320 | (8) |
|
6.11.1 Tree-structured communication |
|
|
320 | (2) |
|
6.11.2 Local variables, registers, shared and global memory |
|
|
322 | (2) |
|
6.11.3 Warps and warp shuffles |
|
|
324 | (1) |
|
6.11.4 Implementing tree-structured global sum with a warp shuffle |
|
|
325 | (1) |
|
6.11.5 Shared memory and an alternative to the warp shuffle |
|
|
326 | (2) |
|
6.12 Implementation of trapezoidal rule with warp Size thread blocks |
|
|
328 | (3) |
|
|
329 | (1) |
|
6.12.2 Kernel with warp shuffle |
|
|
329 | (1) |
|
6.12.3 Kernel with shared memory |
|
|
329 | (1) |
|
|
330 | (1) |
|
6.13 CUDA trapezoidal rule III: blocks with more than one warp |
|
|
331 | (7) |
|
|
331 | (2) |
|
6.13.2 More shared memory |
|
|
333 | (1) |
|
6.13.3 Shared memory warp sums |
|
|
333 | (1) |
|
6.13.4 Shared memory banks |
|
|
334 | (2) |
|
|
336 | (1) |
|
|
336 | (2) |
|
|
338 | (11) |
|
6.14.1 Serial bitonic sort |
|
|
338 | (4) |
|
6.14.2 Butterflies and binary representations |
|
|
342 | (3) |
|
6.14.3 Parallel bitonic sort I |
|
|
345 | (2) |
|
6.14.4 Parallel bitonic sort II |
|
|
347 | (1) |
|
6.14.5 Performance of CUDA bitonic sort |
|
|
348 | (1) |
|
|
349 | (6) |
|
|
355 | (3) |
|
6.17 Programming assignments |
|
|
358 | (3) |
|
Chapter 7 Parallel program development |
|
|
361 | (94) |
|
|
361 | (37) |
|
|
361 | (2) |
|
7.1.2 Two serial programs |
|
|
363 | (5) |
|
7.1.3 Parallelizing the n-body solvers |
|
|
368 | (3) |
|
|
371 | (1) |
|
7.1.5 Parallelizing the basic solver using OpenMP |
|
|
371 | (4) |
|
7.1.6 Parallelizing the reduced solver using OpenMP |
|
|
375 | (4) |
|
7.1.7 Evaluating the OpenMP codes |
|
|
379 | (1) |
|
7.1.8 Parallelizing the solvers using Pthreads |
|
|
380 | (1) |
|
7.1.9 Parallelizing the basic solver using MPI |
|
|
381 | (2) |
|
7.1.10 Parallelizing the reduced solver using MPI |
|
|
383 | (6) |
|
7.1.11 Performance of the MPI solvers |
|
|
389 | (1) |
|
7.1.12 Parallelizing the basic solver using CUDA |
|
|
390 | (3) |
|
7.1.13 A note on cooperative groups in CUDA |
|
|
393 | (1) |
|
7.1.14 Performance of the basic CUDA n-body solver |
|
|
393 | (1) |
|
7.1.15 Improving the performance of the CUDA n-body solver |
|
|
394 | (1) |
|
7.1.16 Using shared memory in the n-body solver |
|
|
395 | (3) |
|
|
398 | (41) |
|
7.2.1 Sample sort and bucket sort |
|
|
398 | (2) |
|
7.2.2 Choosing the sample |
|
|
400 | (1) |
|
7.2.3 A simple implementation of the Map function |
|
|
401 | (1) |
|
7.2.4 An alternative implementation of Map |
|
|
402 | (5) |
|
7.2.5 Parallelizing sample sort |
|
|
407 | (3) |
|
7.2.6 Implementing sample sort with OpenMP |
|
|
410 | (5) |
|
7.2.7 Implementing sample sort with Pthreads |
|
|
415 | (3) |
|
7.2.8 Implementing sample sort with MPI |
|
|
418 | (11) |
|
7.2.9 Implementing sample sort with CUDA |
|
|
429 | (10) |
|
|
439 | (1) |
|
|
439 | (1) |
|
|
440 | (4) |
|
|
442 | (2) |
|
|
444 | (7) |
|
7.7 Programming assignments |
|
|
451 | (4) |
|
Chapter 8 Where to go from here |
|
|
455 | (4) |
Bibliography |
|
459 | (4) |
Index |
|
463 | |