Foreword |
|
vii | |
Editors |
|
ix | |
Contributors |
|
xi | |
Chapter 1 Introduction |
|
1 | (6) |
|
|
|
|
|
1 | (1) |
|
|
2 | (1) |
|
|
3 | (1) |
|
|
3 | (6) |
|
1.4.1 Part I: Basics of Parallel Programming |
|
|
3 | (1) |
|
1.4.2 Part II: Programming Languages for Multicore |
|
|
3 | (1) |
|
1.4.3 Part III: Programming Heterogeneous Processors |
|
|
4 | (1) |
|
1.4.4 Part IV: Emerging Technologies |
|
|
4 | (3) |
Part I Basics of Parallel Programming |
|
7 | (46) |
|
Chapter 2 Fundamentals of Multicore Hardware and Parallel Programming |
|
|
9 | (22) |
|
|
|
9 | (1) |
|
2.2 Potential for Increased Speed |
|
|
10 | (3) |
|
2.3 Types of Parallel Computing Platforms |
|
|
13 | (3) |
|
|
16 | (2) |
|
2.5 Multicore Processor Architectures |
|
|
18 | (3) |
|
|
18 | (1) |
|
2.5.2 Symmetric Multicore Designs |
|
|
18 | (2) |
|
2.5.3 Asymmetric Multicore Designs |
|
|
20 | (1) |
|
2.6 Programming Multicore Systems |
|
|
21 | (4) |
|
2.6.1 Processes and Threads |
|
|
21 | (2) |
|
|
23 | (2) |
|
|
25 | (1) |
|
2.7 Parallel Programming Strategies |
|
|
25 | (3) |
|
2.7.1 Task and Data Parallelism |
|
|
25 | (1) |
|
2.7.1.1 Embarrassingly Parallel Computations |
|
|
25 | (1) |
|
|
26 | (1) |
|
2.7.1.3 Synchronous Computations |
|
|
27 | (1) |
|
|
27 | (1) |
|
|
28 | (1) |
|
|
28 | (3) |
|
Chapter 3 Parallel Design Patterns |
|
|
31 | (22) |
|
|
3.1 Parallel Programming Challenge |
|
|
31 | (1) |
|
3.2 Design Patterns: Background and History |
|
|
32 | (1) |
|
3.3 Essential Patterns for Parallel Programming |
|
|
33 | (17) |
|
3.3.1 Parallel Algorithm Strategy Patterns |
|
|
34 | (1) |
|
3.3.1.1 Task Parallelism Pattern |
|
|
34 | (1) |
|
|
36 | (1) |
|
3.3.1.3 Divide and Conquer |
|
|
38 | (1) |
|
|
39 | (1) |
|
3.3.1.5 Geometric Decomposition |
|
|
40 | (1) |
|
3.3.2 Implementation Strategy Patterns |
|
|
41 | (1) |
|
|
42 | (1) |
|
|
43 | (1) |
|
3.3.2.3 Loop-Level Parallelism |
|
|
44 | (1) |
|
|
47 | (1) |
|
3.3.2.5 Master-Worker/Task-Queue |
|
|
49 | (1) |
|
3.4 Conclusions and Next Steps |
|
|
50 | (1) |
|
|
50 | (3) |
Part II Programming Languages for Multicore |
|
53 | (76) |
|
Chapter 4 Threads and Shared Variables in C++ |
|
|
55 | (24) |
|
|
4.1 Basic Model and Thread Creation |
|
|
56 | (1) |
|
4.2 Small Detour: C++0x Lambda Expressions |
|
|
57 | (1) |
|
|
57 | (1) |
|
|
58 | (2) |
|
4.5 More Refined Approach |
|
|
60 | (2) |
|
|
62 | (1) |
|
|
62 | (1) |
|
|
63 | (4) |
|
|
66 | (1) |
|
4.9 Other Synchronization Mechanisms |
|
|
67 | (3) |
|
|
67 | (1) |
|
4.9.2 Condition Variables |
|
|
68 | (1) |
|
4.9.3 Other Mutex Variants and Facilities |
|
|
69 | (1) |
|
|
70 | (1) |
|
4.10 Terminating a Multi-Threaded C++ Program |
|
|
70 | (1) |
|
|
71 | (2) |
|
4.12 Relationship to Earlier Standards |
|
|
73 | (3) |
|
4.12.1 Separate Thread Libraries |
|
|
73 | (1) |
|
|
74 | (1) |
|
4.12.3 Adjacent Field Overwrites |
|
|
75 | (1) |
|
4.12.4 Other Compiler-Introduced Races |
|
|
75 | (1) |
|
4.12.5 Program Termination |
|
|
76 | (1) |
|
|
76 | (3) |
|
Chapter 5 Parallelism in .NET and Java |
|
|
79 | (22) |
|
|
|
79 | (2) |
|
5.1.1 Types of Parallelism |
|
|
80 | (1) |
|
5.1.2 Overview of the Chapter |
|
|
81 | (1) |
|
5.2 .NET Parallel Landscape |
|
|
81 | (1) |
|
5.3 Task Parallel Library |
|
|
82 | (5) |
|
5.3.1 Basic Methods-For, For Each, and Invoke |
|
|
82 | (2) |
|
5.3.2 Breaking Out of a Loop |
|
|
84 | (1) |
|
|
85 | (2) |
|
|
87 | (1) |
|
|
88 | (2) |
|
|
90 | (3) |
|
|
90 | (1) |
|
5.6.2 java.util.concurrent |
|
|
91 | (2) |
|
|
93 | (2) |
|
|
94 | (1) |
|
5.8 ParallelArray Package |
|
|
95 | (1) |
|
|
96 | (1) |
|
|
97 | (1) |
|
|
97 | (4) |
|
|
101 | (28) |
|
|
|
|
101 | (4) |
|
|
102 | (1) |
|
6.1.2 Overview of Features |
|
|
102 | (2) |
|
6.1.3 Who Developed OpenMP? How Is It Evolving'? |
|
|
104 | (1) |
|
6.2 OpenMP 3.0 Specification |
|
|
105 | (13) |
|
6.2.1 Parallel Regions and Worksharing |
|
|
105 | (1) |
|
6.2.1.1 Scheduling Parallel Loops |
|
|
108 | (2) |
|
|
110 | (1) |
|
6.2.2.1 Using Data Attributes |
|
|
110 | (2) |
|
|
112 | (1) |
|
6.2.3.1 Using Explicit Tasks |
|
|
112 | (1) |
|
|
113 | (1) |
|
6.2.4.1 Performing Reductions in OpenMP |
|
|
115 | (1) |
|
6.2.5 OpenMP Library Routines and Environment Variables |
|
|
116 | (1) |
|
6.2.5.1 SPMD Programming Style |
|
|
117 | (1) |
|
6.3 Implementation of OpenMP |
|
|
118 | (3) |
|
6.4 Programming for Performance |
|
|
121 | (2) |
|
|
123 | (1) |
|
|
124 | (5) |
Part III Programming Heterogeneous Processors |
|
129 | (70) |
|
Chapter 7 Scalable Manycore Computing with CUDA |
|
|
131 | (24) |
|
|
|
|
|
131 | (1) |
|
7.2 Manycore GPU Machine Model |
|
|
132 | (2) |
|
7.3 Structure of CUDA Programs |
|
|
134 | (4) |
|
|
134 | (1) |
|
|
135 | (1) |
|
7.3.3 Communicating within Blocks |
|
|
136 | (1) |
|
7.3.4 Device Memory Management |
|
|
137 | (1) |
|
7.3.5 Complete CUDA Example |
|
|
138 | (1) |
|
7.4 Execution of Kernels on the GPU |
|
|
138 | (5) |
|
|
139 | (2) |
|
7.4.2 Coordinating Tasks in Kernels |
|
|
141 | (1) |
|
|
142 | (1) |
|
7.5 Writing a CUDA Program |
|
|
143 | (5) |
|
7.5.1 Block-Level Parallel Prefix |
|
|
143 | (2) |
|
|
145 | (2) |
|
7.5.3 Coordinating Whole Grids |
|
|
147 | (1) |
|
|
148 | (4) |
|
|
152 | (1) |
|
|
152 | (3) |
|
Chapter 8 Programming the Cell Processor |
|
|
155 | (44) |
|
|
|
156 | (1) |
|
8.2 Cell Processor Architecture Overview |
|
|
157 | (8) |
|
8.2.1 Power Processing Element |
|
|
157 | (1) |
|
8.2.2 Synergistic Processing Element |
|
|
158 | (2) |
|
8.2.3 Element Interconnect Bus |
|
|
160 | (1) |
|
8.2.4 DMA Communication and Memory Access |
|
|
161 | (2) |
|
|
163 | (1) |
|
|
164 | (1) |
|
|
164 | (1) |
|
8.3 Cell Programming with the SDK |
|
|
165 | (9) |
|
8.3.1 PPE/SPE Thread Coordination |
|
|
165 | (2) |
|
|
167 | (2) |
|
8.3.3 DMA Communication and Multi-Buffering |
|
|
169 | (1) |
|
8.3.4 Using SIMD Instructions on SPE |
|
|
170 | (4) |
|
8.3.5 Summary: Cell Programming with the SDK |
|
|
174 | (1) |
|
8.4 Cell SDK Compilers, Libraries, and Tools |
|
|
174 | (3) |
|
|
175 | (1) |
|
8.4.2 Full-System Simulator |
|
|
175 | (1) |
|
8.4.3 Performance Analysis and Visualization |
|
|
175 | (1) |
|
|
176 | (1) |
|
8.4.5 Libraries, Components, and Frameworks |
|
|
176 | (1) |
|
8.5 Higher-Level Programming Environments for Cell |
|
|
177 | (10) |
|
|
178 | (1) |
|
|
179 | (1) |
|
|
180 | (1) |
|
|
180 | (1) |
|
|
181 | (1) |
|
|
182 | (2) |
|
|
184 | (1) |
|
|
185 | (1) |
|
|
186 | (1) |
|
8.5.10 Other High-Level Programming Environments for Cell |
|
|
186 | (1) |
|
8.6 Algorithms and Components for Cell |
|
|
187 | (1) |
|
|
188 | (3) |
|
8.8 Bibliographical Remarks |
|
|
191 | (1) |
|
|
191 | (1) |
|
Disclaimers and Declarations |
|
|
192 | (1) |
|
|
192 | (1) |
|
|
193 | (6) |
Part IV Emerging Technologies |
|
199 | (110) |
|
Chapter 9 Automatic Extraction of Parallelism from Sequential Code |
|
|
201 | (38) |
|
|
|
|
|
|
|
|
|
|
202 | (1) |
|
|
202 | (1) |
|
9.1.2 Techniques and Tools |
|
|
202 | (1) |
|
|
203 | (6) |
|
|
203 | (1) |
|
9.2.2 Data Dependence Analysis |
|
|
204 | (1) |
|
9.2.2.1 Data Dependence Graph |
|
|
204 | (1) |
|
|
206 | (1) |
|
9.2.3 Control Dependence Analysis |
|
|
207 | (1) |
|
9.2.3.1 Control Dependence Graph |
|
|
207 | (1) |
|
|
208 | (1) |
|
9.2.4 Program Dependence Graph |
|
|
209 | (1) |
|
9.3 DOALL Parallelization |
|
|
209 | (8) |
|
|
209 | (2) |
|
|
211 | (1) |
|
9.3.3 Advanced Topic: Reduction |
|
|
212 | (2) |
|
9.3.4 Advanced Topic: Speculative DOALL |
|
|
214 | (1) |
|
9.3.5 Advanced Topic: Further Techniques and Transformations |
|
|
215 | (2) |
|
9.4 DOACROSS Parallelization |
|
|
217 | (6) |
|
|
217 | (2) |
|
|
219 | (1) |
|
9.4.3 Advanced Topic: Speculation |
|
|
220 | (3) |
|
9.5 Pipeline Parallelization |
|
|
223 | (7) |
|
|
223 | (1) |
|
|
224 | (1) |
|
|
225 | (1) |
|
9.5.4 Advanced Topic: Speculation |
|
|
226 | (4) |
|
9.6 Bringing It All Together |
|
|
230 | (2) |
|
|
232 | (1) |
|
|
233 | (6) |
|
Chapter 10 Auto-Tuning Parallel Application Performance |
|
|
239 | (26) |
|
|
|
|
|
240 | (1) |
|
|
240 | (1) |
|
|
241 | (3) |
|
|
242 | (1) |
|
10.3.2 Classification of Approaches |
|
|
243 | (1) |
|
10.4 Overview of the Tunable Architectures Approach |
|
|
244 | (1) |
|
10.5 Designing Tunable Applications |
|
|
245 | (6) |
|
10.5.1 Tunable Architectures |
|
|
246 | (1) |
|
10.5.1.1 Atomic Components |
|
|
246 | (1) |
|
|
247 | (1) |
|
10.5.1.3 Runtime System and Backend |
|
|
247 | (1) |
|
10.5.1.4 A Tunable Architecture Example |
|
|
248 | (2) |
|
|
250 | (1) |
|
|
251 | (1) |
|
10.6 Implementation with Tuning Instrumentation Languages |
|
|
251 | (5) |
|
|
252 | (2) |
|
|
254 | (1) |
|
|
255 | (1) |
|
|
255 | (1) |
|
|
255 | (1) |
|
10.7 Performance Optimization |
|
|
256 | (4) |
|
|
256 | (1) |
|
|
257 | (1) |
|
10.7.3 Auto-Tuning Systems |
|
|
258 | (1) |
|
|
258 | (1) |
|
|
259 | (1) |
|
|
259 | (1) |
|
10.7.3.4 Model-Based Systems |
|
|
260 | (1) |
|
|
260 | (1) |
|
10.8 Conclusion and Outlook |
|
|
260 | (1) |
|
|
261 | (4) |
|
Chapter 11 Transactional Memory |
|
|
265 | (26) |
|
|
|
265 | (3) |
|
11.2 Transactional Memory Taxonomy |
|
|
268 | (2) |
|
11.2.1 Eager/Lazy Version Management |
|
|
268 | (1) |
|
11.2.2 Eager/Lazy Conflict Detection |
|
|
268 | (1) |
|
|
269 | (1) |
|
11.3 Hardware Transactional Memory |
|
|
270 | (3) |
|
11.3.1 Classical Cache-Based Bounded-Size HTM |
|
|
270 | (2) |
|
|
272 | (1) |
|
11.4 Software Transactional Memory |
|
|
273 | (6) |
|
|
274 | (3) |
|
|
277 | (2) |
|
|
279 | (6) |
|
11.5.1 Semantics of Atomic Blocks |
|
|
280 | (2) |
|
11.5.2 Optimizing Atomic Blocks |
|
|
282 | (1) |
|
11.5.3 Composable Blocking |
|
|
283 | (2) |
|
|
285 | (2) |
|
|
287 | (1) |
|
|
288 | (1) |
|
|
289 | (2) |
|
Chapter 12 Emerging Applications |
|
|
291 | (18) |
|
|
|
291 | (1) |
|
|
292 | (8) |
|
12.2.1 Interactive RMS (iRMS) |
|
|
294 | (2) |
|
12.2.2 Growing Significance of Data-Driven Models |
|
|
296 | (1) |
|
12.2.2.1 Massive Data Computing: An Algorithmic Opportunity |
|
|
297 | (1) |
|
|
298 | (1) |
|
12.2.4 Structured Decomposition of RMS Applications |
|
|
298 | (2) |
|
|
300 | (5) |
|
12.3.1 Nature and Source of Underlying Parallelism |
|
|
300 | (1) |
|
12.3.1.1 Approximate, Yet Real Time |
|
|
300 | (1) |
|
12.3.1.2 Curse of Dimensionality and Irregular Access Pattern |
|
|
301 | (1) |
|
12.3.1.3 Parallelism: Both Coarse and Fine Grain |
|
|
301 | (1) |
|
12.3.1.4 Throughput Computing and Manycore |
|
|
302 | (1) |
|
12.3.1.5 Revisiting Amdahl's Law for Throughput Computing |
|
|
302 | (1) |
|
12.3.2 Scalability of RMS Applications |
|
|
303 | (1) |
|
12.3.2.1 Scalability Implications of Dataset Growth |
|
|
304 | (1) |
|
12.3.3 Homogenous versus Heterogeneous Decomposition |
|
|
305 | (1) |
|
|
305 | (1) |
|
|
306 | (3) |
Index |
|
309 | |