Foreword |
|
ix | |
Preface |
|
xvii | |
Acknowledgments |
|
xxv | |
|
Chapter 1 Fundamentals of Quantitative Design and Analysis |
|
|
|
|
2 | (4) |
|
|
6 | (5) |
|
1.3 Defining Computer Architecture |
|
|
11 | (7) |
|
|
18 | (5) |
|
1.5 Trends in Power and Energy in Integrated Circuits |
|
|
23 | (6) |
|
|
29 | (7) |
|
|
36 | (3) |
|
1.8 Measuring, Reporting, and Summarizing Performance |
|
|
39 | (9) |
|
1.9 Quantitative Principles of Computer Design |
|
|
48 | (7) |
|
1.10 Putting It All Together: Performance, Price, and Power |
|
|
55 | (3) |
|
1.11 Fallacies and Pitfalls |
|
|
58 | (6) |
|
|
64 | (3) |
|
1.13 Historical Perspectives and References |
|
|
67 | (11) |
|
Case Studies and Exercises |
|
|
67 | (11) |
|
|
Chapter 2 Memory Hierarchy Design |
|
|
|
|
78 | (6) |
|
2.2 Memory Technology and Optimizations |
|
|
84 | (10) |
|
2.3 Ten Advanced Optimizations of Cache Performance |
|
|
94 | (24) |
|
2.4 Virtual Memory and Virtual Machines |
|
|
118 | (8) |
|
2.5 Cross-Cutting Issues: The Design of Memory Hierarchies |
|
|
126 | (3) |
|
2.6 Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700 |
|
|
129 | (13) |
|
2.7 Fallacies and Pitfalls |
|
|
142 | (4) |
|
2.8 Concluding Remarks: Looking Ahead |
|
|
146 | (2) |
|
2.9 Historical Perspectives and References |
|
|
148 | (20) |
|
Case Studies and Exercises |
|
|
148 | (20) |
|
|
|
|
|
Chapter 3 Instruction-Level Parallelism and Its Exploitation |
|
|
|
3.1 Instruction-Level Parallelism: Concepts and Challenges |
|
|
168 | (8) |
|
3.2 Basic Compiler Techniques for Exposing ILP |
|
|
176 | (6) |
|
3.3 Reducing Branch Costs With Advanced Branch Prediction |
|
|
182 | (9) |
|
3.4 Overcoming Data Hazards With Dynamic Scheduling |
|
|
191 | (10) |
|
3.5 Dynamic Scheduling: Examples and the Algorithm |
|
|
201 | (7) |
|
3.6 Hardware-Based Speculation |
|
|
208 | (10) |
|
3.7 Exploiting ILP Using Multiple Issue and Static Scheduling |
|
|
218 | (4) |
|
3.8 Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and Speculation |
|
|
222 | (6) |
|
3.9 Advanced Techniques for Instruction Delivery and Speculation |
|
|
228 | (12) |
|
3.10 Cross-Cutting Issues |
|
|
240 | (2) |
|
3.11 Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput |
|
|
242 | (5) |
|
3.12 Putting It All Together: The Intel Core i7 6700 and ARM Cortex-A53 |
|
|
247 | (11) |
|
3.13 Fallacies and Pitfalls |
|
|
258 | (6) |
|
3.14 Concluding Remarks: What's Ahead? |
|
|
264 | (2) |
|
3.15 Historical Perspective and References |
|
|
266 | (16) |
|
Case Studies and Exercises |
|
|
266 | (16) |
|
|
|
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures |
|
|
|
|
282 | (1) |
|
|
283 | (21) |
|
4.3 SIMD Instruction Set Extensions for Multimedia |
|
|
304 | (6) |
|
4.4 Graphics Processing Units |
|
|
310 | (26) |
|
4.5 Detecting and Enhancing Loop-Level Parallelism |
|
|
336 | (9) |
|
|
345 | (1) |
|
4.7 Putting It All Together: Embedded Versus Server GPUs and Tesla Versus Core i7 |
|
|
346 | (7) |
|
4.8 Fallacies and Pitfalls |
|
|
353 | (4) |
|
|
357 | (1) |
|
4.10 Historical Perspective and References |
|
|
357 | (11) |
|
|
357 | (11) |
|
|
Chapter 5 Thread-Level Parallelism |
|
|
|
|
368 | (9) |
|
5.2 Centralized Shared-Memory Architectures |
|
|
377 | (16) |
|
5.3 Performance of Symmetric Shared-Memory Multiprocessors |
|
|
393 | (11) |
|
5.4 Distributed Shared-Memory and Directory-Based Coherence |
|
|
404 | (8) |
|
5.5 Synchronization: The Basics |
|
|
412 | (5) |
|
5.6 Models of Memory Consistency: An Introduction |
|
|
417 | (5) |
|
|
422 | (4) |
|
5.8 Putting It All Together: Multicore Processors and Their Performance |
|
|
426 | (12) |
|
5.9 Fallacies and Pitfalls |
|
|
438 | (4) |
|
5.10 The Future of Multicore Scaling |
|
|
442 | (2) |
|
|
444 | (1) |
|
5.12 Historical Perspectives and References |
|
|
445 | (21) |
|
Case Studies and Exercises |
|
|
446 | (20) |
|
|
|
Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism |
|
|
|
|
466 | (5) |
|
6.2 Programming Models and Workloads for Warehouse-Scale Computers |
|
|
471 | (6) |
|
6.3 Computer Architecture of Warehouse-Scale Computers |
|
|
477 | (5) |
|
6.4 The Efficiency and Cost of Warehouse-Scale Computers |
|
|
482 | (8) |
|
6.5 Cloud Computing: The Return of Utility Computing |
|
|
490 | (11) |
|
|
501 | (2) |
|
6.7 Putting It All Together: A Google Warehouse-Scale Computer |
|
|
503 | (11) |
|
6.8 Fallacies and Pitfalls |
|
|
514 | (4) |
|
|
518 | (1) |
|
6.10 Historical Perspectives and References |
|
|
519 | (21) |
|
Case Studies and Exercises |
|
|
519 | (21) |
|
Parthasarathy Ranganathan |
|
|
Chapter 7 Domain-Specific Architectures |
|
|
|
|
540 | (3) |
|
|
543 | (1) |
|
7.3 Example Domain: Deep Neural Networks |
|
|
544 | (13) |
|
7.4 Google's Tensor Processing Unit, an Inference Data Center Accelerator |
|
|
557 | (10) |
|
7.5 Microsoft Catapult, a Flexible Data Center Accelerator |
|
|
567 | (12) |
|
7.6 Intel Crest, a Data Center Accelerator for Training |
|
|
579 | (1) |
|
7.7 Pixel Visual Core, a Personal Mobile Device Image Processing Unit |
|
|
579 | (13) |
|
|
592 | (3) |
|
7.9 Putting It All Together: CPUs Versus GPUs Versus DNN Accelerators |
|
|
595 | (7) |
|
7.10 Fallacies and Pitfalls |
|
|
602 | (2) |
|
|
604 | (2) |
|
7.12 Historical Perspectives and References |
|
|
606 | (1) |
|
Case Studies and Exercises |
|
|
606 | |
|
|
Appendix A Instruction Set Principles |
|
|
|
|
2 | (1) |
|
A.2 Classifying Instruction Set Architectures |
|
|
3 | (4) |
|
|
7 | (6) |
|
A.4 Type and Size of Operands |
|
|
13 | (2) |
|
A.5 Operations in the Instruction Set |
|
|
15 | (1) |
|
A.6 Instructions for Control Flow |
|
|
16 | (5) |
|
A.7 Encoding an Instruction Set |
|
|
21 | (3) |
|
A.8 Cross-Cutting Issues: The Role of Compilers |
|
|
24 | (9) |
|
A.9 Putting It All Together: The RISC-V Architecture |
|
|
33 | (9) |
|
A.10 Fallacies and Pitfalls |
|
|
42 | (4) |
|
|
46 | (1) |
|
A.12 Historical Perspective and References |
|
|
47 | (1) |
|
|
47 | |
|
|
Appendix B Review of Memory Hierarchy |
|
|
|
|
2 | (13) |
|
|
15 | (7) |
|
B.3 Six Basic Cache Optimizations |
|
|
22 | (18) |
|
|
40 | (9) |
|
B.5 Protection and Examples of Virtual Memory |
|
|
49 | (8) |
|
B.6 Fallacies and Pitfalls |
|
|
57 | (2) |
|
|
59 | (1) |
|
B.8 Historical Perspective and References |
|
|
59 | (1) |
|
|
60 | |
|
|
Appendix C Pipelining: Basic and Intermediate Concepts |
|
|
|
|
2 | (8) |
|
C.2 The Major Hurdle of Pipelining---Pipeline Hazards |
|
|
10 | (16) |
|
C.3 How Is Pipelining Implemented? |
|
|
26 | (11) |
|
C.4 What Makes Pipelining Hard to Implement? |
|
|
37 | (8) |
|
C.5 Extending the RISC V Integer Pipeline to Handle Multicycle Operations |
|
|
45 | (10) |
|
C.6 Putting It All Together: The MIPS R4000 Pipeline |
|
|
55 | (10) |
|
|
65 | (5) |
|
C.8 Fallacies and Pitfalls |
|
|
70 | (1) |
|
|
71 | (1) |
|
C.10 Historical Perspective and References |
|
|
71 | (1) |
|
|
71 | |
|
|
|
|
Appendix D Storage Systems |
|
|
|
Appendix E Embedded Systems |
|
|
|
|
Appendix F Interconnection Networks |
|
|
|
|
|
Appendix G Vector Processors in More Depth |
|
|
|
|
Appendix H Hardware and Software for VLIW and EPIC |
|
|
|
Appendix I Large-Scale Multiprocessors and Scientific Applications |
|
|
|
Appendix J Computer Arithmetic |
|
|
|
|
Appendix K Survey of Instruction Set Architectures |
|
|
|
Appendix L Advanced Concepts on Address Translation |
|
|
|
|
Appendix M Historical Perspectives and References |
|
|
References |
|
1 | (1) |
Index |
|
1 | |