Preface |
|
xvii | |
About the Authors |
|
xix | |
List of Figures |
|
xxi | |
List of Tables |
|
xxv | |
List of Excerpts |
|
xxix | |
Chapter 1 Introduction |
|
1 | (6) |
|
|
1 | (2) |
|
|
3 | (4) |
Chapter 2 Determining an Exaflop Strategy |
|
7 | (14) |
|
2.1 Foreword By John Levesque |
|
|
7 | (1) |
|
|
8 | (1) |
|
2.3 Looking At The Application |
|
|
9 | (4) |
|
2.4 Degree Of Hybridization Required |
|
|
13 | (2) |
|
2.5 Decomposition And I/O |
|
|
15 | (1) |
|
2.6 Parallel And Vector Lengths |
|
|
15 | (1) |
|
2.7 Productivity And Performance Portability |
|
|
15 | (4) |
|
|
19 | (1) |
|
|
19 | (2) |
Chapter 3 Target Hybrid Multi/Manycore System |
|
21 | (22) |
|
3.1 Foreword By John Levesque |
|
|
21 | (1) |
|
3.2 Understanding The Architecture |
|
|
22 | (1) |
|
|
23 | (2) |
|
|
24 | (1) |
|
|
25 | (1) |
|
|
25 | (3) |
|
3.4.1 Knight's Landing Cache |
|
|
27 | (1) |
|
|
28 | (5) |
|
|
33 | (5) |
|
3.7 Importance Of Vectorization |
|
|
38 | (2) |
|
3.8 Alignment For Vectorization |
|
|
40 | (1) |
|
|
40 | (3) |
Chapter 4 How Compilers Optimize Programs |
|
43 | (24) |
|
4.1 Foreword By John Levesque |
|
|
43 | (2) |
|
|
45 | (1) |
|
|
45 | (2) |
|
|
47 | (1) |
|
4.5 Comment-Line Directive |
|
|
48 | (1) |
|
4.6 Interprocedural Analysis |
|
|
49 | (1) |
|
|
49 | (1) |
|
4.8 Fortran 2003 And Inefficiencies |
|
|
50 | (5) |
|
|
51 | (2) |
|
4.8.2 Use Optimized Libraries |
|
|
53 | (1) |
|
4.8.3 Passing Array Sections |
|
|
53 | (1) |
|
4.8.4 Using Modules for Local Variables |
|
|
54 | (1) |
|
|
54 | (1) |
|
4.9 C/C+ + And Inefficiencies |
|
|
55 | (6) |
|
4.10 Compiler Scalar Optimizations |
|
|
61 | (4) |
|
4.10.1 Strength Reduction |
|
|
61 | (2) |
|
4.10.2 Avoiding Floating Point Exponents |
|
|
63 | (1) |
|
4.10.3 Common Subexpression Elimination |
|
|
64 | (1) |
|
|
65 | (2) |
Chapter 5 Gathering Runtime Statistics for Optimizing |
|
67 | (12) |
|
5.1 Foreword By John Levesque |
|
|
67 | (1) |
|
|
68 | (1) |
|
5.3 What's Important To Profile |
|
|
69 | (7) |
|
|
69 | (5) |
|
|
74 | (2) |
|
|
76 | (1) |
|
|
77 | (2) |
Chapter 6 Utilization of Available Memory Bandwidth |
|
79 | (18) |
|
6.1 Foreword By John Levesque |
|
|
79 | (1) |
|
|
80 | (1) |
|
6.3 Importance Of Cache Optimization |
|
|
80 | (1) |
|
6.4 Variable Analysis In Multiple Loops |
|
|
81 | (3) |
|
6.5 Optimizing For The Cache Hierarchy |
|
|
84 | (9) |
|
6.6 Combining Multiple Loops |
|
|
93 | (3) |
|
|
96 | (1) |
|
|
96 | (1) |
Chapter 7 Vectorization |
|
97 | (50) |
|
7.1 Foreword By John Levesque |
|
|
97 | (1) |
|
|
98 | (1) |
|
7.3 Vectorization Inhibitors |
|
|
99 | (2) |
|
7.4 Vectorization Rejection From Inefficiencies |
|
|
101 | (10) |
|
7.4.1 Access Modes and Computational Intensity |
|
|
101 | (3) |
|
|
104 | (3) |
|
7.5 Striding Versus Contiguous Accessing |
|
|
107 | (4) |
|
|
111 | (3) |
|
7.7 Loops Saving Maxima And Minima |
|
|
114 | (2) |
|
7.8 Multinested Loop Structures |
|
|
116 | (3) |
|
7.9 There's MATMUL And Then There's MATMUL |
|
|
119 | (3) |
|
7.10 Decision Processes In Loops |
|
|
122 | (12) |
|
7.10.1 Loop-Independent Conditionals |
|
|
123 | (2) |
|
7.10.2 Conditionals Directly Testing Indicies |
|
|
125 | (5) |
|
7.10.3 Loop-Dependent Conditionals |
|
|
130 | (2) |
|
7.10.4 Conditionals Causing Early Loop Exit |
|
|
132 | (2) |
|
7.11 Handling Function Calls Within Loops |
|
|
134 | (5) |
|
|
139 | (4) |
|
7.13 Outer Loop Vectorization |
|
|
143 | (1) |
|
|
144 | (3) |
Chapter 8 Hybridization of an Application |
|
147 | (22) |
|
8.1 Foreword By John Levesque |
|
|
147 | (1) |
|
|
147 | (1) |
|
8.3 The Node's NUMA Architecture |
|
|
148 | (1) |
|
8.4 First Touch In The Himeno Benchmark |
|
|
149 | (4) |
|
8.5 Identifying Which Loops To Thread |
|
|
153 | (5) |
|
|
158 | (9) |
|
|
167 | (2) |
Chapter 9 Porting Entire Applications |
|
169 | (74) |
|
9.1 Foreword By John Levesque |
|
|
169 | (1) |
|
|
170 | (1) |
|
9.3 SPEC OpenMP Benchmarks |
|
|
170 | (38) |
|
|
170 | (5) |
|
|
175 | (2) |
|
|
177 | (2) |
|
|
179 | (3) |
|
|
182 | (2) |
|
|
184 | (6) |
|
|
190 | (2) |
|
|
192 | (2) |
|
|
194 | (7) |
|
|
201 | (7) |
|
9.4 NASA Parallel Benchmark (NPB) BT |
|
|
208 | (10) |
|
|
218 | (5) |
|
|
223 | (3) |
|
9.7 Refactoring S3D - 2016 Production Version |
|
|
226 | (4) |
|
9.8 Performance Portable - S3D On Titan |
|
|
230 | (11) |
|
|
241 | (2) |
Chapter 10 Future Hardware Advancements |
|
243 | (12) |
|
|
243 | (1) |
|
|
244 | (1) |
|
|
244 | (1) |
|
|
244 | (1) |
|
|
245 | (5) |
|
10.3.1 Scalable Vector Extension |
|
|
245 | (3) |
|
|
248 | (1) |
|
|
249 | (1) |
|
|
249 | (1) |
|
|
249 | (1) |
|
10.4 Future Memory Technologies |
|
|
250 | (2) |
|
10.4.1 Die-Stacking Technologies |
|
|
250 | (1) |
|
|
251 | (1) |
|
10.5 Future Hardware Conclusions |
|
|
252 | (3) |
|
10.5.1 Increased Thread Counts |
|
|
252 | (1) |
|
|
252 | (2) |
|
10.5.3 Increasingly Complex Memory Hierarchies |
|
|
254 | (1) |
Appendix A Supercomputer Cache Architectures |
|
255 | (6) |
|
|
255 | (6) |
Appendix B The Translation Look-Aside Buffer |
|
261 | (2) |
|
B.1 Introduction To The TLB |
|
|
261 | (2) |
Appendix C Command Line Options and Compiler Directives |
|
263 | (2) |
|
C.1 Command Line Options And Compiler Directives |
|
|
263 | (2) |
Appendix D Previously Used Optimizations |
|
265 | (4) |
|
|
265 | (1) |
|
|
266 | (1) |
|
|
266 | (1) |
|
|
266 | (1) |
|
|
266 | (1) |
|
D.6 Removal Of Loop-Independent Ifs |
|
|
267 | (1) |
|
D.7 Use Of Intrinsics To Remove Ifs |
|
|
267 | (1) |
|
|
267 | (1) |
|
|
267 | (1) |
|
D.10 Pulling Loops Into Subroutines |
|
|
267 | (1) |
|
|
268 | (1) |
|
|
268 | (1) |
|
D.13 Outer Loop Vectorization |
|
|
268 | (1) |
Appendix E I/O Optimization |
|
269 | (4) |
|
|
269 | (1) |
|
|
269 | (1) |
|
|
269 | (1) |
|
E.2.2 Multiple Writers - Multiple Files |
|
|
270 | (1) |
|
E.2.3 Collective I/O to Single or Multiple Files |
|
|
270 | (1) |
|
|
270 | (3) |
Appendix F Terminology |
|
273 | (4) |
|
|
273 | (4) |
Appendix G 12-Step Process |
|
277 | (2) |
|
|
277 | (1) |
|
|
277 | (2) |
Bibliography |
|
279 | (4) |
Crypto |
|
283 | (2) |
Index |
|
285 | |