Preface |
|
xv | |
Acknowledgments |
|
xviii | |
About the Author |
|
xix | |
|
Chapter 1 Why Parallel Computing? |
|
|
1 | (14) |
|
1.1 Why We Need Ever-Increasing Performance |
|
|
2 | (1) |
|
1.2 Why We're Building Parallel Systems |
|
|
3 | (1) |
|
1.3 Why We Need to Write Parallel Programs |
|
|
3 | (3) |
|
1.4 How Do We Write Parallel Programs? |
|
|
6 | (2) |
|
|
8 | (1) |
|
1.6 Concurrent, Parallel, Distributed |
|
|
9 | (1) |
|
|
10 | (1) |
|
|
10 | (1) |
|
1.9 Typographical Conventions |
|
|
11 | (1) |
|
|
12 | (1) |
|
|
12 | (3) |
|
Chapter 2 Parallel Hardware and Parallel Software |
|
|
15 | (68) |
|
|
15 | (3) |
|
2.1.1 The von Neumann architecture |
|
|
15 | (2) |
|
2.1.2 Processes, multitasking, and threads |
|
|
17 | (1) |
|
2.2 Modifications to the von Neumann Model |
|
|
18 | (11) |
|
2.2.1 The basics of caching |
|
|
19 | (1) |
|
|
20 | (2) |
|
2.2.3 Caches and programs: an example |
|
|
22 | (1) |
|
|
23 | (2) |
|
2.2.5 Instruction-level parallelism |
|
|
25 | (3) |
|
2.2.6 Hardware multithreading |
|
|
28 | (1) |
|
|
29 | (18) |
|
|
29 | (3) |
|
|
32 | (3) |
|
2.3.3 Interconnection networks |
|
|
35 | (8) |
|
|
43 | (3) |
|
2.3.5 Shared-memory versus distributed-memory |
|
|
46 | (1) |
|
|
47 | (9) |
|
|
47 | (1) |
|
2.4.2 Coordinating the processes/threads |
|
|
48 | (1) |
|
|
49 | (4) |
|
|
53 | (3) |
|
2.4.5 Programming hybrid systems |
|
|
56 | (1) |
|
|
56 | (2) |
|
|
58 | (7) |
|
2.6.1 Speedup and efficiency |
|
|
58 | (3) |
|
|
61 | (1) |
|
|
62 | (1) |
|
|
63 | (2) |
|
2.7 Parallel Program Design |
|
|
65 | (5) |
|
|
66 | (4) |
|
2.8 Writing and Running Parallel Programs |
|
|
70 | (1) |
|
|
70 | (1) |
|
|
71 | (6) |
|
|
71 | (2) |
|
|
73 | (1) |
|
|
74 | (1) |
|
|
75 | (1) |
|
|
75 | (1) |
|
2.10.6 Parallel program design |
|
|
76 | (1) |
|
|
76 | (1) |
|
|
77 | (6) |
|
Chapter 3 Distributed-Memory Programming with MPI |
|
|
83 | (68) |
|
|
84 | (10) |
|
3.1.1 Compilation and execution |
|
|
84 | (2) |
|
|
86 | (1) |
|
3.1.3 MP I_I nit and MP I_Finalize |
|
|
86 | (1) |
|
3.1.4 Communicators, MP I_Comm_size and MP I_Comm_rank |
|
|
87 | (1) |
|
|
88 | (1) |
|
|
88 | (1) |
|
|
88 | (2) |
|
|
90 | (1) |
|
|
91 | (1) |
|
3.1.10 The status_p argument |
|
|
92 | (1) |
|
3.1.11 Semantics of MPI_Send and MPI_Recv |
|
|
93 | (1) |
|
3.1.12 Some potential pitfalls |
|
|
94 | (1) |
|
3.2 The Trapezoidal Rule in MPI |
|
|
94 | (3) |
|
3.2.1 The trapezoidal rule |
|
|
94 | (2) |
|
3.2.2 Parallelizing the trapezoidal rule |
|
|
96 | (1) |
|
|
97 | (4) |
|
|
97 | (3) |
|
|
100 | (1) |
|
3.4 Collective Communication |
|
|
101 | (15) |
|
3.4.1 Three-structured communication |
|
|
102 | (1) |
|
|
103 | (2) |
|
3.4.3 Collective vs. point-to-point communications |
|
|
105 | (1) |
|
|
106 | (1) |
|
|
106 | (3) |
|
|
109 | (1) |
|
|
110 | (2) |
|
|
112 | (1) |
|
|
113 | (3) |
|
3.5 MPI Derived Datatypes |
|
|
116 | (3) |
|
3.6 Performance Evaluation of MPI Programs |
|
|
119 | (8) |
|
|
119 | (3) |
|
|
122 | (3) |
|
3.6.3 Speedup and efficiency |
|
|
125 | (1) |
|
|
126 | (1) |
|
3.7 A Parallel Sorting Algorithm |
|
|
127 | (9) |
|
3.7.1 Some simple serial sorting algorithms |
|
|
127 | (2) |
|
3.7.2 Parallel odd-even transposition sort |
|
|
129 | (3) |
|
3.7.3 Safety in MPI programs |
|
|
132 | (2) |
|
3.7.4 Final details of parallel odd-even sort |
|
|
134 | (2) |
|
|
136 | (4) |
|
|
140 | (7) |
|
3.10 Programming Assignments |
|
|
147 | (4) |
|
Chapter 4 Shared-Memory Programming with Pthreads |
|
|
151 | (58) |
|
4.1 Processes, Threads, and Pthreads |
|
|
151 | (2) |
|
|
153 | (6) |
|
|
153 | (2) |
|
|
155 | (1) |
|
4.2.3 Starting the threads |
|
|
156 | (1) |
|
4.2.4 Running the threads |
|
|
157 | (1) |
|
4.2.5 Stopping the threads |
|
|
158 | (1) |
|
|
158 | (1) |
|
4.2.7 Other approaches to thread startup |
|
|
159 | (1) |
|
4.3 Matrix-Vector Multiplication |
|
|
159 | (3) |
|
|
162 | (3) |
|
|
165 | (3) |
|
|
168 | (3) |
|
4.7 Producer-Consumer Synchronization and Semaphores |
|
|
171 | (5) |
|
4.8 Barriers and Condition Variables |
|
|
176 | (5) |
|
4.8.1 Busy-waiting and a mutex |
|
|
177 | (1) |
|
|
177 | (2) |
|
4.8.3 Condition variables |
|
|
179 | (2) |
|
|
181 | (1) |
|
|
181 | (9) |
|
4.9.1 Linked list functions |
|
|
181 | (2) |
|
4.9.2 A multi-threaded linked list |
|
|
183 | (4) |
|
4.9.3 Pthreads read-write locks |
|
|
187 | (1) |
|
4.9.4 Performance of the various implementations |
|
|
188 | (2) |
|
4.9.5 Implementing read-write locks |
|
|
190 | (1) |
|
4.10 Caches, Cache Coherence, and False Sharing |
|
|
190 | (5) |
|
|
195 | (3) |
|
4.11.1 Incorrect programs can produce correct output |
|
|
198 | (1) |
|
|
198 | (2) |
|
|
200 | (6) |
|
4.14 Programming Assignments |
|
|
206 | (3) |
|
Chapter 5 Shared-Memory Programming with OpenMP |
|
|
209 | (62) |
|
|
210 | (6) |
|
5.1.1 Compiling and running OpenMP programs |
|
|
211 | (1) |
|
|
212 | (3) |
|
|
215 | (1) |
|
|
216 | (4) |
|
5.2.1 A first OpenMP version |
|
|
216 | (4) |
|
|
220 | (1) |
|
|
221 | (3) |
|
5.5 The parallel for Directive |
|
|
224 | (8) |
|
|
225 | (2) |
|
|
227 | (1) |
|
5.5.3 Finding loop-carried dependences |
|
|
228 | (1) |
|
|
229 | (2) |
|
|
231 | (1) |
|
5.6 More About Loops in OpenMP: Sorting |
|
|
232 | (4) |
|
|
232 | (1) |
|
5.6.2 Odd-even transposition sort |
|
|
233 | (3) |
|
|
236 | (5) |
|
5.7.1 The schedule clause |
|
|
237 | (1) |
|
5.7.2 The static schedule type |
|
|
238 | (1) |
|
5.7.3 The dynamic and guided schedule types |
|
|
239 | (1) |
|
5.7.4 The runtime schedule type |
|
|
239 | (2) |
|
|
241 | (1) |
|
5.8 Producers and Consumers |
|
|
241 | (10) |
|
|
241 | (1) |
|
|
242 | (1) |
|
|
243 | (1) |
|
|
243 | (1) |
|
5.8.5 Termination detection |
|
|
244 | (1) |
|
|
244 | (1) |
|
5.8.7 The atomic directive |
|
|
245 | (1) |
|
5.8.8 Critical sections and locks |
|
|
246 | (2) |
|
5.8.9 Using locks in the message-passing program |
|
|
248 | (1) |
|
5.8.10 Critical directives, atomic directives, or locks? |
|
|
249 | (1) |
|
|
249 | (2) |
|
5.9 Caches, Cache Coherence, and False Sharing |
|
|
251 | (5) |
|
|
256 | (3) |
|
5.10.1 Incorrect programs can produce correct output |
|
|
258 | (1) |
|
|
259 | (4) |
|
|
263 | (4) |
|
5.13 Programming Assignments |
|
|
267 | (4) |
|
Chapter 6 Parallel Program Development |
|
|
271 | (82) |
|
|
271 | (28) |
|
|
271 | (2) |
|
6.1.2 Two serial programs |
|
|
273 | (4) |
|
6.1.3 Parallelizing the n-body solvers |
|
|
277 | (3) |
|
|
280 | (1) |
|
6.1.5 Parallelizing the basic solver using OpenMP |
|
|
281 | (3) |
|
6.1.6 Parallelizing the reduced solver using OpenMP |
|
|
284 | (4) |
|
6.1.7 Evaluating the OpenMP codes |
|
|
288 | (1) |
|
6.1.8 Parallelizing the solvers using pthreads |
|
|
289 | (1) |
|
6.1.9 Parallelizing the basic solver using MPI |
|
|
290 | (2) |
|
6.1.10 Parallelizing the reduced solver using MPI |
|
|
292 | (5) |
|
6.1.11 Performance of the MPI solvers |
|
|
297 | (2) |
|
|
299 | (36) |
|
6.2.1 Recursive depth-first search |
|
|
302 | (1) |
|
6.2.2 Nonrecursive depth-first search |
|
|
303 | (2) |
|
6.2.3 Data structures for the serial implementations |
|
|
305 | (1) |
|
6.2.4 Performance of the serial implementations |
|
|
306 | (1) |
|
6.2.5 Parallelizing tree search |
|
|
306 | (3) |
|
6.2.6 A static parallelization of tree search using pthreads |
|
|
309 | (1) |
|
6.2.7 A dynamic parallelization of tree search using pthreads |
|
|
310 | (5) |
|
6.2.8 Evaluating the pthreads tree-search programs |
|
|
315 | (1) |
|
6.2.9 Parallelizing the tree-search programs using OpenMP |
|
|
316 | (2) |
|
6.2.10 Performance of the OpenMP implementations |
|
|
318 | (1) |
|
6.2.11 Implementation of tree search using MPI and static partitioning |
|
|
319 | (8) |
|
6.2.12 Implementation of tree search using MPI and dynamic partitioning |
|
|
327 | (8) |
|
|
335 | (1) |
|
|
335 | (1) |
|
|
336 | (5) |
|
6.5.1 Pthreads and OpenMP |
|
|
337 | (1) |
|
|
338 | (3) |
|
|
341 | (9) |
|
6.7 Programming Assignments |
|
|
350 | (3) |
|
Chapter 7 Where to Go from Here |
|
|
353 | (4) |
References |
|
357 | (4) |
Index |
|
361 | |