|
|
|
xiii | |
|
|
|
xvii | |
|
|
|
xix | |
| Preface |
|
xxiii | |
|
Chapter 1 Understanding the need for parallel computing |
|
|
1 | (10) |
|
|
|
1 | (1) |
|
1.2 From Problem to Parallel Solution - Development Steps |
|
|
2 | (2) |
|
1.3 Approaches to Parallelization |
|
|
4 | (2) |
|
1.4 Selected Use Cases With Popular Apis |
|
|
6 | (1) |
|
|
|
7 | (4) |
|
Chapter 2 Overview of selected parallel and distributed systems for high performance computing |
|
|
11 | (18) |
|
2.1 Generic Taxonomy of Parallel Computing Systems |
|
|
11 | (1) |
|
|
|
12 | (2) |
|
|
|
14 | (3) |
|
2.4 Manycore Cpus/Coprocessors |
|
|
17 | (2) |
|
|
|
19 | (1) |
|
2.6 Growth of High Performance Computing Systems and Relevant Metrics |
|
|
20 | (2) |
|
2.7 Volunteer-Based Systems |
|
|
22 | (3) |
|
|
|
25 | (4) |
|
Chapter 3 Typical paradigms for parallel applications |
|
|
29 | (40) |
|
3.1 Aspects of Parallelization |
|
|
30 | (5) |
|
3.1.1 Data partitioning and granularity |
|
|
30 | (2) |
|
|
|
32 | (1) |
|
|
|
32 | (1) |
|
|
|
33 | (1) |
|
3.1.5 HPC related metrics |
|
|
34 | (1) |
|
|
|
35 | (4) |
|
|
|
39 | (16) |
|
|
|
55 | (1) |
|
|
|
56 | (13) |
|
Chapter 4 Selected APIs for parallel programming |
|
|
69 | (116) |
|
4.1 Message Passing Interface (MPI) |
|
|
74 | (28) |
|
4.1.1 Programming model and application structure |
|
|
74 | (1) |
|
4.1.2 The world of MPI processes and threads |
|
|
75 | (1) |
|
4.1.3 Initializing and finalizing usage of MPI |
|
|
75 | (1) |
|
4.1.4 Communication modes |
|
|
76 | (1) |
|
4.1.5 Basic point-to-point communication routines |
|
|
76 | (2) |
|
4.1.6 Basic MPI collective communication routines |
|
|
78 | (5) |
|
4.1.7 Packing buffers and creating custom data types |
|
|
83 | (2) |
|
4.1.8 Receiving a message with wildcards |
|
|
85 | (1) |
|
4.1.9 Receiving a message with unknown data size |
|
|
86 | (1) |
|
4.1.10 Various send modes |
|
|
87 | (1) |
|
4.1.11 Non-blocking communication |
|
|
88 | (2) |
|
|
|
90 | (5) |
|
4.1.13 A sample MPI application |
|
|
95 | (2) |
|
4.1.14 Multithreading in MPI |
|
|
97 | (2) |
|
4.1.15 Dynamic creation of processes in MPI |
|
|
99 | (2) |
|
|
|
101 | (1) |
|
|
|
102 | (16) |
|
4.2.1 Programming model and application structure |
|
|
102 | (2) |
|
4.2.2 Commonly used directives and functions |
|
|
104 | (5) |
|
4.2.3 The number of threads in a parallel region |
|
|
109 | (1) |
|
4.2.4 Synchronization of threads within a parallel region and single thread execution |
|
|
109 | (2) |
|
4.2.5 Important environment variables |
|
|
111 | (1) |
|
4.2.6 A sample OpenMP application |
|
|
112 | (3) |
|
4.2.7 Selected SIMD directives |
|
|
115 | (1) |
|
4.2.8 Device offload instructions |
|
|
115 | (2) |
|
|
|
117 | (1) |
|
|
|
118 | (9) |
|
4.3.1 Programming model and application structure |
|
|
118 | (3) |
|
|
|
121 | (2) |
|
4.3.3 Using condition variables |
|
|
123 | (1) |
|
|
|
124 | (1) |
|
|
|
125 | (1) |
|
4.3.6 A sample Pthreads application |
|
|
125 | (2) |
|
|
|
127 | (20) |
|
4.4.1 Programming model and application structure |
|
|
127 | (4) |
|
4.4.2 Scheduling and synchronization |
|
|
131 | (3) |
|
|
|
134 | (1) |
|
4.4.4 A sample CUDA application |
|
|
134 | (3) |
|
4.4.5 Streams and asynchronous operations |
|
|
137 | (4) |
|
4.4.6 Dynamic parallelism |
|
|
141 | (2) |
|
4.4.7 Unified Memory in CUDA |
|
|
143 | (2) |
|
4.4.8 Management of GPU devices |
|
|
145 | (2) |
|
|
|
147 | (20) |
|
4.5.1 Programming model and application structure |
|
|
147 | (8) |
|
4.5.2 Coordinates and Indexing |
|
|
155 | (1) |
|
4.5.3 Queuing data reads/writes and kernel execution |
|
|
156 | (1) |
|
4.5.4 Synchronization functions |
|
|
157 | (1) |
|
4.5.5 A sample OpenCL application |
|
|
158 | (9) |
|
|
|
167 | (5) |
|
4.6.1 Programming model and application structure |
|
|
167 | (1) |
|
|
|
168 | (1) |
|
|
|
169 | (2) |
|
4.6.4 A sample OpenACC application |
|
|
171 | (1) |
|
4.6.5 Asynchronous processing and synchronization |
|
|
171 | (1) |
|
|
|
172 | (1) |
|
4.7 Selected Hybrid Approaches |
|
|
172 | (13) |
|
|
|
173 | (4) |
|
|
|
177 | (3) |
|
|
|
180 | (5) |
|
Chapter 5 Programming parallel paradigms using selected APIs |
|
|
185 | (66) |
|
|
|
185 | (33) |
|
|
|
186 | (4) |
|
|
|
190 | (7) |
|
|
|
197 | (2) |
|
|
|
199 | (8) |
|
|
|
207 | (6) |
|
|
|
213 | (5) |
|
|
|
218 | (11) |
|
|
|
218 | (2) |
|
|
|
220 | (5) |
|
|
|
225 | (1) |
|
|
|
225 | (4) |
|
|
|
229 | (22) |
|
|
|
229 | (3) |
|
|
|
232 | (3) |
|
|
|
235 | (1) |
|
|
|
236 | (4) |
|
5.3.3.2 Version with dynamic process creation |
|
|
240 | (11) |
|
Chapter 6 Optimization techniques and best practices for parallel codes |
|
|
251 | (22) |
|
6.1 Data Prefetching, Communication and Computations Overlapping and Increasing Computation Efficiency |
|
|
252 | (5) |
|
|
|
253 | (3) |
|
|
|
256 | (1) |
|
|
|
257 | (1) |
|
6.3 Minimization of Overheads |
|
|
258 | (2) |
|
6.3.1 Initialization and synchronization overheads |
|
|
258 | (2) |
|
6.3.2 Load balancing vs cost of synchronization |
|
|
260 | (1) |
|
6.4 Process/Thread Affinity |
|
|
260 | (1) |
|
6.5 Data Types and Accuracy |
|
|
261 | (1) |
|
6.6 Data Organization and Arrangement |
|
|
261 | (1) |
|
|
|
262 | (2) |
|
6.8 Simulation of Parallel Application Execution |
|
|
264 | (1) |
|
6.9 Best Practices and Typical Optimizations |
|
|
265 | (8) |
|
|
|
265 | (1) |
|
|
|
266 | (3) |
|
|
|
269 | (1) |
|
|
|
270 | (3) |
|
|
|
273 | (2) |
|
|
|
273 | (2) |
|
Appendix B Further reading |
|
|
275 | (22) |
|
|
|
275 | (1) |
|
B.2 Other Resources on Parallel Programming |
|
|
275 | (22) |
| Index |
|
297 | |