Preface |
|
xv | |
About the Editor-in-Chief and Authors |
|
xix | |
Part I Prologue |
|
|
|
3 | (50) |
|
1.1 The Dawn of the Many-Core Era |
|
|
3 | (2) |
|
1.2 Communication-Centric Cross-Layer Optimizations |
|
|
5 | (2) |
|
1.3 A Baseline Design Space Exploration of NoCs |
|
|
7 | (10) |
|
|
8 | (1) |
|
|
9 | (2) |
|
|
11 | (2) |
|
1.3.4 Router Microarchitecture |
|
|
13 | (3) |
|
|
16 | (1) |
|
1.4 Review of NoC Research |
|
|
17 | (6) |
|
1.4.1 Research on Topologies. |
|
|
17 | (1) |
|
1.4.2 Research on Unicast Routing |
|
|
18 | (1) |
|
1.4.3 Research on Supporting Collective Communications |
|
|
19 | (1) |
|
1.4.4 Research on Flow Control |
|
|
20 | (2) |
|
1.4.5 Research on Router Microarchitecture |
|
|
22 | (1) |
|
1.5 Trends of Real Processors |
|
|
23 | (15) |
|
1.5.1 The MIT Raw Processor |
|
|
23 | (1) |
|
1.5.2 The Tilera TILE64 Processor |
|
|
24 | (2) |
|
1.5.3 The Sony/Toshiba/IBM Cell Processor |
|
|
26 | (2) |
|
1.5.4 The U.T. Austin TRIPS Processor |
|
|
28 | (1) |
|
1.5.5 The Intel Teraflops Processor |
|
|
29 | (1) |
|
1.5.6 The Intel SCC Processor |
|
|
30 | (2) |
|
1.5.7 The Intel Larrabee Processor |
|
|
32 | (2) |
|
1.5.8 The Intel Knights Corner Processor |
|
|
34 | (2) |
|
1.5.9 Summary of Real Processors |
|
|
36 | (2) |
|
|
38 | (2) |
|
|
40 | (13) |
Part II Logic Implementations |
|
|
Chapter 2 A Single-Cycle Router with Wing Channels |
|
|
53 | (24) |
|
|
53 | (2) |
|
2.2 The Router Architecture |
|
|
55 | (7) |
|
2.2.1 The Overall Architecture |
|
|
56 | (4) |
|
|
60 | (2) |
|
2.3 Microarchitecture Designs |
|
|
62 | (5) |
|
|
62 | (2) |
|
2.3.2 Fast Arbiter Components |
|
|
64 | (1) |
|
2.3.3 SIG Managers and SIG Controllers |
|
|
65 | (2) |
|
|
67 | (7) |
|
2.4.1 Simulation Infrastructures |
|
|
67 | (1) |
|
2.4.2 Pipeline Delay Analysis |
|
|
67 | (1) |
|
2.4.3 Latency and Throughput |
|
|
68 | (5) |
|
2.4.4 Area and Power Consumption |
|
|
73 | (1) |
|
|
74 | (1) |
|
|
74 | (3) |
|
Chapter 3 Dynamic Virtual Channel Routers with Congestion Awareness |
|
|
77 | (30) |
|
|
77 | (2) |
|
3.2 DVC with Congestion Awareness |
|
|
79 | (3) |
|
|
79 | (2) |
|
3.2.2 Congestion Avoidance Scheme |
|
|
81 | (1) |
|
3.3 Multiple-Port Shared Buffer with Congestion Awareness |
|
|
82 | (3) |
|
3.3.1 DVC Scheme Among Multiple Ports |
|
|
82 | (2) |
|
3.3.2 Congestion Avoidance Scheme |
|
|
84 | (1) |
|
3.4 DVC Router Microarchitecture |
|
|
85 | (6) |
|
|
86 | (2) |
|
3.4.2 Metric Aggregation and Congestion Avoidance |
|
|
88 | (2) |
|
3.4.3 VC Allocation Module |
|
|
90 | (1) |
|
3.5 HiBB Router Microarchitecture |
|
|
91 | (5) |
|
|
92 | (1) |
|
3.5.2 VC Allocation and Output Port Allocation |
|
|
92 | (3) |
|
|
95 | (1) |
|
|
96 | (6) |
|
3.6.1 DVC Router Evaluation |
|
|
96 | (2) |
|
3.6.2 HiBB Router Evaluation |
|
|
98 | (4) |
|
|
102 | (1) |
|
|
102 | (5) |
|
Chapter 4 Virtual Bus Structure-Based Network-on-Chip Topologies |
|
|
107 | (34) |
|
|
108 | (1) |
|
|
109 | (1) |
|
|
110 | (4) |
|
4.3.1 Baseline On-Chip Communication Networks |
|
|
110 | (1) |
|
4.3.2 Analysis of NoC Problems |
|
|
111 | (2) |
|
4.3.3 Advantages of a Transaction-Based Bus |
|
|
113 | (1) |
|
|
114 | (11) |
|
4.4.1 Interconnect Structures |
|
|
114 | (2) |
|
|
116 | (7) |
|
4.4.3 Starvation and Deadlock Avoidance |
|
|
123 | (1) |
|
4.4.4 The VBON Router Microarchitecture |
|
|
124 | (1) |
|
|
125 | (10) |
|
4.5.1 Simulation Infrastructures |
|
|
126 | (3) |
|
4.5.2 Synthetic Traffic Evaluations |
|
|
129 | (3) |
|
4.5.3 Real Application Evaluations |
|
|
132 | (1) |
|
4.5.4 Power Consumption Analysis |
|
|
132 | (1) |
|
|
132 | (3) |
|
|
135 | (1) |
|
|
136 | (5) |
Part III Routing And Flow Control |
|
|
Chapter 5 Routing Algorithms for Workload Consolidation |
|
|
141 | (34) |
|
|
142 | (1) |
|
|
143 | (2) |
|
|
145 | (3) |
|
5.3.1 Insufficient Information |
|
|
145 | (1) |
|
5.3.2 Intraregion Interference |
|
|
145 | (2) |
|
5.3.3 Inter-Region Interference |
|
|
147 | (1) |
|
5.4 Destination-Based Adaptive Routing |
|
|
148 | (7) |
|
5.4.1 Destination-Based Selection Strategy |
|
|
148 | (4) |
|
5.4.2 Routing Function Design |
|
|
152 | (3) |
|
|
155 | (12) |
|
5.5.1 Evaluation of Routing Functions |
|
|
156 | (2) |
|
5.5.2 Single-Region Performance |
|
|
158 | (3) |
|
5.5.3 Multiple-Region Performance |
|
|
161 | (2) |
|
|
163 | (3) |
|
|
166 | (1) |
|
5.6 Analysis and Discussion |
|
|
167 | (2) |
|
5.6.1 In-Depth Analysis of Interference |
|
|
167 | (2) |
|
5.6.2 Design Space Exploration |
|
|
169 | (1) |
|
|
169 | (1) |
|
|
170 | (5) |
|
Chapter 6 Flow Control for Fully Adaptive Routing |
|
|
175 | (40) |
|
|
176 | (3) |
|
|
179 | (1) |
|
6.2.1 Deadlock Avoidance Theories |
|
|
179 | (1) |
|
6.2.2 Fully Adaptive Routing Algorithms |
|
|
179 | (1) |
|
|
180 | (1) |
|
|
180 | (1) |
|
6.3.2 Routing Flexibility |
|
|
180 | (1) |
|
6.4 Flow Control and Routing Designs |
|
|
181 | (9) |
|
6.4.1 Whole Packet Forwarding |
|
|
182 | (3) |
|
6.4.2 Aggressive VC Reallocation for EVCs |
|
|
185 | (3) |
|
6.4.3 Maintain Routing Flexibility |
|
|
188 | (1) |
|
6.4.4 Router Microarchitecture |
|
|
188 | (2) |
|
6.5 Evaluation on Synthetic Traffic |
|
|
190 | (9) |
|
6.5.1 Performance of Synthetic Workloads |
|
|
191 | (1) |
|
6.5.2 Buffer Utilization of Routing Algorithms |
|
|
192 | (2) |
|
6.5.3 Sensitivity to Network Design |
|
|
194 | (5) |
|
6.6 Evaluation of PARSEC Workloads |
|
|
199 | (2) |
|
6.6.1 Methodology and Configuration |
|
|
199 | (1) |
|
|
200 | (1) |
|
6.7 Detailed Analysis of Flow Control |
|
|
201 | (6) |
|
6.7.1 The Detailed Buffer Utilization |
|
|
201 | (3) |
|
6.7.2 The Effect of Flow Control on Fairness |
|
|
204 | (3) |
|
|
207 | (2) |
|
|
207 | (1) |
|
6.8.2 Dynamically Allocated Multiqueue and Hybrid Flow Controls |
|
|
208 | (1) |
|
|
209 | (1) |
|
Appendix: Logical Equivalence of Alg and Alg + WPF |
|
|
209 | (2) |
|
|
211 | (4) |
|
Chapter 7 Deadlock-Free Flow Control for Torus Networks-on-Chip |
|
|
215 | (40) |
|
|
216 | (2) |
|
7.2 Limitations of Existing Designs |
|
|
218 | (3) |
|
|
218 | (1) |
|
7.2.2 Localized Bubble Scheme |
|
|
219 | (1) |
|
7.2.3 Critical Bubble Scheme |
|
|
219 | (1) |
|
7.2.4 Inefficiency with Variable-Size Packets |
|
|
220 | (1) |
|
7.3 Flit Bubble Flow Control |
|
|
221 | (4) |
|
7.3.1 Theoretical Description |
|
|
221 | (1) |
|
|
222 | (1) |
|
|
223 | (1) |
|
|
224 | (1) |
|
7.4 Router Microarchitecture |
|
|
225 | (2) |
|
|
225 | (1) |
|
|
226 | (1) |
|
|
227 | (1) |
|
7.6 Evaluation on 1D Tori (Rings) |
|
|
228 | (3) |
|
|
228 | (2) |
|
|
230 | (1) |
|
7.6.3 Latency of Short and Long Packets |
|
|
231 | (1) |
|
7.7 Evaluation on 2D Tori |
|
|
231 | (9) |
|
7.7.1 Performance fora 4 x 4 Torus |
|
|
231 | (2) |
|
7.7.2 Sensitivity to SFP Ratios |
|
|
233 | (1) |
|
7.7.3 Sensitivity to Buffer Size |
|
|
234 | (2) |
|
7.7.4 Scalability for an 8 x 8 Torus |
|
|
236 | (1) |
|
7.7.5 Effect of Starvation |
|
|
236 | (2) |
|
7.7.6 Real Application Performance |
|
|
238 | (1) |
|
7.7.7 Large-Scale Systems and Message Passing |
|
|
239 | (1) |
|
7.8 Overheads: Power and Area |
|
|
240 | (8) |
|
|
240 | (1) |
|
|
241 | (3) |
|
|
244 | (1) |
|
7.8.4 Comparison with Meshes |
|
|
245 | (3) |
|
7.9 Discussion and Related Work |
|
|
248 | (1) |
|
|
248 | (1) |
|
|
248 | (1) |
|
|
249 | (1) |
|
|
249 | (6) |
Part IV Programming Paradigms |
|
|
Chapter 8 Supporting Cache-Coherent Collective Communications |
|
|
255 | (30) |
|
|
256 | (2) |
|
8.2 Message Combination Framework |
|
|
258 | (5) |
|
|
260 | (1) |
|
8.2.2 Message Combination Example |
|
|
260 | (3) |
|
8.2.3 Insufficient MCT Entries |
|
|
263 | (1) |
|
|
263 | (2) |
|
8.4 Router Pipeline and Microarchitecture |
|
|
265 | (2) |
|
|
267 | (11) |
|
|
269 | (3) |
|
8.5.2 Comparing Multicast VN Configurations |
|
|
272 | (2) |
|
|
274 | (2) |
|
8.5.4 Sensitivity to Network Design |
|
|
276 | (2) |
|
|
278 | (2) |
|
|
280 | (1) |
|
8.7.1 Message Combination |
|
|
280 | (1) |
|
8.7.2 NoC Multicast Routing |
|
|
280 | (1) |
|
|
281 | (1) |
|
|
281 | (4) |
|
Chapter 9 Network-on-Chip Customizations for Message Passing Interface Primitives |
|
|
285 | (32) |
|
|
286 | (1) |
|
|
287 | (2) |
|
|
289 | (1) |
|
9.3.1 MPI Adaption in NoC Designs |
|
|
289 | (1) |
|
9.3.2 Optimizations of MPI Functions |
|
|
290 | (1) |
|
9.4 Communication Customization Architectures |
|
|
290 | (12) |
|
9.4.1 Architecture Overview |
|
|
290 | (2) |
|
9.4.2 The Customized NoC Design: VBON |
|
|
292 | (1) |
|
9.4.3 The MPI Primitive Implementation: MU |
|
|
292 | (10) |
|
|
302 | (10) |
|
|
302 | (1) |
|
9.5.2 Experimental Results |
|
|
303 | (9) |
|
|
312 | (1) |
|
|
312 | (5) |
|
Chapter 10 Message Passing Interface Communication Protocol Optimizations |
|
|
317 | (36) |
|
|
318 | (1) |
|
|
319 | (7) |
|
10.2.1 Communication Protocols in MPI |
|
|
319 | (1) |
|
|
320 | (5) |
|
|
325 | (1) |
|
|
326 | (2) |
|
10.4 Adaptive Communication Mechanisms |
|
|
328 | (10) |
|
10.4.1 Goals and Approaches |
|
|
328 | (1) |
|
10.4.2 Baseline MPI-Accelerated NoC Designs |
|
|
329 | (2) |
|
10.4.3 ADCM Architectural Support |
|
|
331 | (6) |
|
10.4.4 Comparison with the Ideal Protocol |
|
|
337 | (1) |
|
|
338 | (9) |
|
|
338 | (2) |
|
10.5.2 Synthetic Traffic Results |
|
|
340 | (3) |
|
10.5.3 Real Application Results |
|
|
343 | (3) |
|
10.5.4 Sensitivity Analysis |
|
|
346 | (1) |
|
10.5.5 The Hardware Overhead |
|
|
346 | (1) |
|
|
347 | (1) |
|
|
348 | (5) |
Part V Epilogue |
|
|
Chapter 11 Conclusions and Future Work |
|
|
353 | (4) |
|
|
353 | (2) |
|
|
355 | (2) |
Index |
|
357 | |