Preface |
|
xv | |
Acknowledgement |
|
xix | |
About the Author |
|
xxi | |
List of Abbreviations |
|
xxiii | |
1 Perspectives on Multicore Architectures |
|
1 | (26) |
|
1.1 The Origin of the Multicore Architecture |
|
|
2 | (10) |
|
1.1.1 Power Consumption Issue |
|
|
6 | (6) |
|
1.2 Perspectives on Parallel Computers |
|
|
12 | (6) |
|
1.2.1 Flynn's Taxonomy of Parallel Computers |
|
|
15 | (2) |
|
1.2.2 Classes of MIMD Parallel Computers |
|
|
17 | (1) |
|
1.3 Future Multicore Architectures |
|
|
18 | (6) |
|
|
24 | (3) |
2 Perspectives on Parallel Programming |
|
27 | (24) |
|
2.1 Limits on Parallel Program Performance |
|
|
28 | (3) |
|
2.2 Parallel Programming Models |
|
|
31 | (18) |
|
2.2.1 Comparing Shared Memory and Message Passing Models |
|
|
33 | (2) |
|
|
35 | (4) |
|
2.2.3 Other Programming Models |
|
|
39 | (10) |
|
|
49 | (2) |
3 Shared Memory Parallel Programming |
|
51 | (52) |
|
3.1 Steps in Parallel Programming |
|
|
52 | (1) |
|
|
53 | (5) |
|
3.2.1 Loop-Level Dependence Analysis |
|
|
55 | (1) |
|
3.2.2 Iteration-Space Traversal Graph and Loop-Carried Dependence Graph |
|
|
56 | (2) |
|
3.3 Identifying Parallel Tasks in Loop Structures |
|
|
58 | (9) |
|
3.3.1 Parallelism between Loop Iterations and DOALL Parallelism |
|
|
58 | (3) |
|
3.3.2 DOACROSS: Synchronized Parallelism between Loop Iterations |
|
|
61 | (2) |
|
3.3.3 Parallelism Across Statements in a Loop |
|
|
63 | (2) |
|
3.3.4 DOPIPE: Parallelism Across Statements of a Loop |
|
|
65 | (2) |
|
3.4 Identifying Parallelism at Other Levels |
|
|
67 | (2) |
|
3.5 Identifying Parallelism through Algorithm Knowledge |
|
|
69 | (3) |
|
3.6 Determining the Scope of Variables |
|
|
72 | (5) |
|
|
73 | (2) |
|
3.6.2 Reduction Variables and Operation |
|
|
75 | (1) |
|
3.6.3 Summary of Criteria |
|
|
76 | (1) |
|
|
77 | (1) |
|
3.8 Assigning Tasks to Threads |
|
|
78 | (5) |
|
3.9 Mapping Threads to Processors |
|
|
83 | (4) |
|
3.10 A Brief Introduction to OpenMP |
|
|
87 | (7) |
|
|
94 | (9) |
4 Parallel Programming for Linked Data Structures |
|
103 | (30) |
|
4.1 Parallelization Challenges in LDS |
|
|
104 | (1) |
|
4.1.1 Loop-Level Parallelization is Insufficient |
|
|
104 | (1) |
|
4.2 Approaches to Parallelization of LDS |
|
|
105 | (10) |
|
4.2.1 Parallelizing Computation vs. Traversal |
|
|
105 | (2) |
|
4.2.2 Parallelizing Operations on the Data Structure |
|
|
107 | (8) |
|
4.3 Parallelization Techniques for Linked Lists |
|
|
115 | (11) |
|
4.3.1 Parallelization among Readers |
|
|
115 | (2) |
|
4.3.2 Parallelism among LDS Traversals |
|
|
117 | (4) |
|
4.3.3 Fine-Grain Lock Approach |
|
|
121 | (5) |
|
4.4 The Role of Transactional Memory |
|
|
126 | (2) |
|
|
128 | (5) |
5 Introduction to Memory Hierarchy Organization |
|
133 | (58) |
|
5.1 Motivation for Memory Hierarchy |
|
|
134 | (1) |
|
5.2 Basic Architectures of a Cache |
|
|
135 | (20) |
|
|
136 | (5) |
|
|
141 | (3) |
|
|
144 | (2) |
|
5.2.4 Inclusion Policy on Multi-Level Caches |
|
|
146 | (4) |
|
5.2.5 Unified/Split/Banked Cache Organization and Cache Pipelining |
|
|
150 | (2) |
|
5.2.6 Cache Addressing and Translation Lookaside Buffer |
|
|
152 | (3) |
|
|
155 | (1) |
|
|
155 | (7) |
|
5.3.1 The Power Law of Cache Misses |
|
|
158 | (1) |
|
5.3.2 Stack Distance Profile |
|
|
159 | (1) |
|
5.3.3 Cache Performance Metrics |
|
|
160 | (2) |
|
|
162 | (4) |
|
5.4.1 Stride and Sequential Prefetching |
|
|
164 | (1) |
|
5.4.2 Prefetching in Multiprocessor Systems |
|
|
165 | (1) |
|
5.5 Cache Design in Multicore Architecture |
|
|
166 | (1) |
|
5.6 Physical Cache Organization |
|
|
167 | (4) |
|
5.6.1 United Cache Organization |
|
|
167 | (1) |
|
5.6.2 Distributed Cache Organization |
|
|
168 | (1) |
|
5.6.3 Hybrid United+Distributed Cache Organization |
|
|
169 | (2) |
|
5.7 Logical Cache Organization |
|
|
171 | (9) |
|
|
175 | (2) |
|
5.7.2 Improving Distance Locality of Shared Cache |
|
|
177 | (1) |
|
5.7.3 Capacity Sharing in the Private Cache Organization |
|
|
178 | (2) |
|
|
180 | (6) |
|
5.8.1 IBM Power? Memory Hierarchy |
|
|
180 | (4) |
|
5.8.2 Comparing AMD Shanghai and Intel Barcelona's Memory Hierarchy |
|
|
184 | (2) |
|
|
186 | (5) |
6 Introduction to Shared Memory Multiprocessors |
|
191 | (14) |
|
6.1 The Cache Coherence Problem |
|
|
192 | (3) |
|
6.2 Memory Consistency Problem |
|
|
195 | (2) |
|
6.3 Synchronization Problem |
|
|
197 | (5) |
|
|
202 | (3) |
7 Basic Cache Coherence Issues |
|
205 | (60) |
|
|
206 | (6) |
|
7.1.1 Basic Support for Bus-Based Multiprocessors |
|
|
209 | (3) |
|
7.2 Cache Coherence in Bus-Based Multiprocessors |
|
|
212 | (23) |
|
7.2.1 Coherence Protocol for Write Through Caches |
|
|
212 | (2) |
|
7.2.2 MSI Protocol with Write Back Caches |
|
|
214 | (6) |
|
7.2.3 MESI Protocol with Write Back Caches |
|
|
220 | (6) |
|
7.2.4 MOESI Protocol with Write Back Caches |
|
|
226 | (5) |
|
7.2.5 Update-Based Protocol with Write Back Caches |
|
|
231 | (4) |
|
7.3 Impact of Cache Design on Cache Coherence Performance |
|
|
235 | (1) |
|
7.4 Performance and Other Practical Issues |
|
|
236 | (4) |
|
7.4.1 Prefetching and Coherence Misses |
|
|
236 | (1) |
|
|
237 | (2) |
|
|
239 | (1) |
|
7.5 Broadcast Protocol with Point-to-Point Interconnect |
|
|
240 | (17) |
|
|
257 | (8) |
8 Hardware Support for Synchronization |
|
265 | (36) |
|
|
266 | (16) |
|
8.1.1 Evaluating the Performance of Lock Implementations |
|
|
266 | (1) |
|
8.1.2 The Need for Atomic Instructions |
|
|
266 | (3) |
|
|
269 | (2) |
|
8.1.4 Test and Test and Set Lock |
|
|
271 | (1) |
|
8.1.5 Load Linked and Store Conditional Lock |
|
|
272 | (4) |
|
|
276 | (2) |
|
8.1.7 Array-Based Queuing Lock |
|
|
278 | (2) |
|
8.1.8 Qualitative Comparison of Lock Implementations |
|
|
280 | (2) |
|
8.2 Bather Implementations |
|
|
282 | (5) |
|
8.2.1 Sense-Reversing Centralized Barrier |
|
|
282 | (3) |
|
8.2.2 Combining Tree Barrier |
|
|
285 | (1) |
|
8.2.3 Hardware Barrier Implementation |
|
|
285 | (2) |
|
|
287 | (7) |
|
|
294 | (7) |
9 Memory Consistency Models |
|
301 | (30) |
|
9.1 Programmers' Intuition |
|
|
302 | (4) |
|
9.2 Architecture Mechanisms for Ensuring Sequential Consistency |
|
|
306 | (4) |
|
9.2.1 Basic SC Implementation on a Bus-Based Multiprocessor |
|
|
306 | (2) |
|
9.2.2 Techniques to Improve SC Performance |
|
|
308 | (2) |
|
9.3 Relaxed Consistency Models |
|
|
310 | (10) |
|
|
311 | (1) |
|
9.3.2 Processor Consistency |
|
|
311 | (2) |
|
|
313 | (2) |
|
9.3.4 Release Consistency |
|
|
315 | (4) |
|
9.3.5 Lazy Release Consistency |
|
|
319 | (1) |
|
9.4 Synchronization in Different Memory Consistency Models |
|
|
320 | (4) |
|
|
324 | (7) |
10 Advanced Cache Coherence Issues |
|
331 | (40) |
|
10.1 Directory Coherence Protocols |
|
|
332 | (1) |
|
10.2 Overview of Directory Coherence Protocol |
|
|
332 | (7) |
|
10.2.1 Directory Format and Location |
|
|
333 | (6) |
|
10.3 Basic Directory Cache Coherence Protocol |
|
|
339 | (4) |
|
10.4 Implementation Correctness and Performance |
|
|
343 | (13) |
|
10.4.1 Handling Races Due to Out-of-Sync Directory State |
|
|
343 | (3) |
|
10.4.2 Handling Races Due to Non-Instantaneous Processing of a Request |
|
|
346 | (7) |
|
10.4.3 Write Propagation and Transaction Serialization |
|
|
353 | (2) |
|
10.4.4 Synchronization Support |
|
|
355 | (1) |
|
10.4.5 Memory Consistency Models |
|
|
356 | (1) |
|
10.5 Contemporary Design Issues |
|
|
356 | (11) |
|
10.5.1 Dealing with Imprecise Directory Information |
|
|
356 | (5) |
|
10.5.2 Granularity of Coherence |
|
|
361 | (2) |
|
10.5.3 System Partitioning |
|
|
363 | (2) |
|
10.5.4 Accelerating Thread Migration |
|
|
365 | (2) |
|
|
367 | (4) |
11 Interconnection Network Architecture |
|
371 | (38) |
|
11.1 Link, Channel, and Latency |
|
|
372 | (4) |
|
|
376 | (5) |
|
11.3 Routing Policies and Algorithms |
|
|
381 | (12) |
|
|
393 | (4) |
|
11.5 Case Study: Alpha 21364 Network Architecture |
|
|
397 | (2) |
|
11.6 Multicore Design Issues |
|
|
399 | (4) |
|
11.6.1 Contemporary Design Issues |
|
|
401 | (2) |
|
|
403 | (6) |
12 MATT Architecture |
|
409 | (18) |
|
12.1 SIMT Programming Model |
|
|
410 | (2) |
|
12.2 Mapping SIMT Workloads to SIMT Cores |
|
|
412 | (1) |
|
12.3 STMT Core Architecture |
|
|
413 | (10) |
|
|
413 | (1) |
|
12.3.2 SLMDization/Vectorization: Warp Formation |
|
|
413 | (2) |
|
12.3.3 Fine-Grain Multithreading (Warp-Level Parallelism) |
|
|
415 | (1) |
|
|
416 | (1) |
|
12.3.5 Pipeline Execution |
|
|
417 | (1) |
|
12.3.6 Control Flow Processing |
|
|
418 | (1) |
|
|
419 | (4) |
|
|
423 | (4) |
13 Ask the Experts |
|
427 | (26) |
Bibliography |
|
453 | (6) |
Index |
|
459 | |