Muutke küpsiste eelistusi

Programming for Hybrid Multi/Manycore MPP Systems [Kõva köide]

, (Cray, Inc., Knoxville, Tennessee, USA)
  • Formaat: Hardback, 342 pages, kõrgus x laius: 234x156 mm, kaal: 676 g, 253 Tables, black and white; 74 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Computational Science
  • Ilmumisaeg: 10-Oct-2017
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 1439873712
  • ISBN-13: 9781439873717
  • Formaat: Hardback, 342 pages, kõrgus x laius: 234x156 mm, kaal: 676 g, 253 Tables, black and white; 74 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Computational Science
  • Ilmumisaeg: 10-Oct-2017
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 1439873712
  • ISBN-13: 9781439873717
"Ask not what your compiler can do for you, ask what you can do for your compiler." --John Levesque, Director of Crays Supercomputing Centers of Excellence

The next decade of computationally intense computing lies with more powerful multi/manycore nodes where processors share a large memory space. These nodes will be the building block for systems that range from a single node workstation up to systems approaching the exaflop regime. The node itself will consist of 10s to 100s of MIMD (multiple instruction, multiple data) processing units with SIMD (single instruction, multiple data) parallel instructions. Since a standard, affordable memory architecture will not be able to supply the bandwidth required by these cores, new memory organizations will be introduced. These new node architectures will represent a significant challenge to application developers.

Programming for Hybrid Multi/Manycore MPP Systems attempts to briefly describe the current state-of-the-art in programming these systems, and proposes an approach for developing a performance-portable application that can effectively utilize all of these systems from a single application. The book starts with a strategy for optimizing an application for multi/manycore architectures. It then looks at the three typical architectures, covering their advantages and disadvantages.

The next section of the book explores the other important component of the targetthe compiler. The compiler will ultimately convert the input language to executable code on the target, and the book explores how to make the compiler do what we want. The book then talks about gathering runtime statistics from running the application on the important problem sets previously discussed.

How best to utilize available memory bandwidth and virtualization is covered next, along with hybridization of a program. The last part of the book includes several major applications, and examines future hardware advancements and how the application developer may prepare for those advancements.
Preface xvii
About the Authors xix
List of Figures xxi
List of Tables xxv
List of Excerpts xxix
Chapter 1 Introduction 1(6)
1.1 Introduction
1(2)
1.2
Chapter Overviews
3(4)
Chapter 2 Determining an Exaflop Strategy 7(14)
2.1 Foreword By John Levesque
7(1)
2.2 Introduction
8(1)
2.3 Looking At The Application
9(4)
2.4 Degree Of Hybridization Required
13(2)
2.5 Decomposition And I/O
15(1)
2.6 Parallel And Vector Lengths
15(1)
2.7 Productivity And Performance Portability
15(4)
2.8 Conclusion
19(1)
2.9 Exercises
19(2)
Chapter 3 Target Hybrid Multi/Manycore System 21(22)
3.1 Foreword By John Levesque
21(1)
3.2 Understanding The Architecture
22(1)
3.3 Cache Architectures
23(2)
3.3.1 Xeon Cache
24(1)
3.3.2 NVIDIA GPU Cache
25(1)
3.4 Memory Hierarchy
25(3)
3.4.1 Knight's Landing Cache
27(1)
3.5 KNL Clustering Modes
28(5)
3.6 KNL McDram Modes
33(5)
3.7 Importance Of Vectorization
38(2)
3.8 Alignment For Vectorization
40(1)
3.9 Exercises
40(3)
Chapter 4 How Compilers Optimize Programs 43(24)
4.1 Foreword By John Levesque
43(2)
4.2 Introduction
45(1)
4.3 Memory Allocation
45(2)
4.4 Memory Alignment
47(1)
4.5 Comment-Line Directive
48(1)
4.6 Interprocedural Analysis
49(1)
4.7 Compiler Switches
49(1)
4.8 Fortran 2003 And Inefficiencies
50(5)
4.8.1 Array Syntax
51(2)
4.8.2 Use Optimized Libraries
53(1)
4.8.3 Passing Array Sections
53(1)
4.8.4 Using Modules for Local Variables
54(1)
4.8.5 Derived Types
54(1)
4.9 C/C+ + And Inefficiencies
55(6)
4.10 Compiler Scalar Optimizations
61(4)
4.10.1 Strength Reduction
61(2)
4.10.2 Avoiding Floating Point Exponents
63(1)
4.10.3 Common Subexpression Elimination
64(1)
4.11 Exercises
65(2)
Chapter 5 Gathering Runtime Statistics for Optimizing 67(12)
5.1 Foreword By John Levesque
67(1)
5.2 Introduction
68(1)
5.3 What's Important To Profile
69(7)
5.3.1 Profiling NAS BT
69(5)
5.3.2 Profiling VH1
74(2)
5.4 Conclusion
76(1)
5.5 Exercises
77(2)
Chapter 6 Utilization of Available Memory Bandwidth 79(18)
6.1 Foreword By John Levesque
79(1)
6.2 Introduction
80(1)
6.3 Importance Of Cache Optimization
80(1)
6.4 Variable Analysis In Multiple Loops
81(3)
6.5 Optimizing For The Cache Hierarchy
84(9)
6.6 Combining Multiple Loops
93(3)
6.7 Conclusion
96(1)
6.8 Exercises
96(1)
Chapter 7 Vectorization 97(50)
7.1 Foreword By John Levesque
97(1)
7.2 Introduction
98(1)
7.3 Vectorization Inhibitors
99(2)
7.4 Vectorization Rejection From Inefficiencies
101(10)
7.4.1 Access Modes and Computational Intensity
101(3)
7.4.2 Conditionals
104(3)
7.5 Striding Versus Contiguous Accessing
107(4)
7.6 Wrap-Around Scalar
111(3)
7.7 Loops Saving Maxima And Minima
114(2)
7.8 Multinested Loop Structures
116(3)
7.9 There's MATMUL And Then There's MATMUL
119(3)
7.10 Decision Processes In Loops
122(12)
7.10.1 Loop-Independent Conditionals
123(2)
7.10.2 Conditionals Directly Testing Indicies
125(5)
7.10.3 Loop-Dependent Conditionals
130(2)
7.10.4 Conditionals Causing Early Loop Exit
132(2)
7.11 Handling Function Calls Within Loops
134(5)
7.12 Rank Expansion
139(4)
7.13 Outer Loop Vectorization
143(1)
7.14 Exercises
144(3)
Chapter 8 Hybridization of an Application 147(22)
8.1 Foreword By John Levesque
147(1)
8.2 Introduction
147(1)
8.3 The Node's NUMA Architecture
148(1)
8.4 First Touch In The Himeno Benchmark
149(4)
8.5 Identifying Which Loops To Thread
153(5)
8.6 SPMD OpenMP
158(9)
8.7 Exercises
167(2)
Chapter 9 Porting Entire Applications 169(74)
9.1 Foreword By John Levesque
169(1)
9.2 Introduction
170(1)
9.3 SPEC OpenMP Benchmarks
170(38)
9.3.1 WUPWISE
170(5)
9.3.2 MGRID
175(2)
9.3.3 GALGEL
177(2)
9.3.4 APSI
179(3)
9.3.5 FMA3D
182(2)
9.3.6 AMMP
184(6)
9.3.7 SWIM
190(2)
9.3.8 APPLU
192(2)
9.3.9 EQUAKE
194(7)
9.3.10 ART
201(7)
9.4 NASA Parallel Benchmark (NPB) BT
208(10)
9.5 Refactoring VH-1
218(5)
9.6 Refactoring LESLIE3D
223(3)
9.7 Refactoring S3D - 2016 Production Version
226(4)
9.8 Performance Portable - S3D On Titan
230(11)
9.9 Exercises
241(2)
Chapter 10 Future Hardware Advancements 243(12)
10.1 Introduction
243(1)
10.2 Future X86 CPUS
244(1)
10.2.1 Intel Skylake
244(1)
10.2.2 AMD Zen
244(1)
10.3 Future Arm CPUS
245(5)
10.3.1 Scalable Vector Extension
245(3)
10.3.2 Broadcom Vulcan
248(1)
10.3.3 Cavium Thunder X
249(1)
10.3.4 Fujitsu Post-K
249(1)
10.3.5 Qualcomm Centriq
249(1)
10.4 Future Memory Technologies
250(2)
10.4.1 Die-Stacking Technologies
250(1)
10.4.2 Compute Near Data
251(1)
10.5 Future Hardware Conclusions
252(3)
10.5.1 Increased Thread Counts
252(1)
10.5.2 Wider Vectors
252(2)
10.5.3 Increasingly Complex Memory Hierarchies
254(1)
Appendix A Supercomputer Cache Architectures 255(6)
A.1 Associativity
255(6)
Appendix B The Translation Look-Aside Buffer 261(2)
B.1 Introduction To The TLB
261(2)
Appendix C Command Line Options and Compiler Directives 263(2)
C.1 Command Line Options And Compiler Directives
263(2)
Appendix D Previously Used Optimizations 265(4)
D.1 Loop Reordering
265(1)
D.2 Index Reordering
266(1)
D.3 Loop Unrolling
266(1)
D.4 Loop Fission
266(1)
D.5 Scalar Promotion
266(1)
D.6 Removal Of Loop-Independent Ifs
267(1)
D.7 Use Of Intrinsics To Remove Ifs
267(1)
D.8 Strip Mining
267(1)
D.9 Subroutine Inlining
267(1)
D.10 Pulling Loops Into Subroutines
267(1)
D.11 Cache Blocking
268(1)
D.12 Loop Fusion
268(1)
D.13 Outer Loop Vectorization
268(1)
Appendix E I/O Optimization 269(4)
E.1 Introduction
269(1)
E.2 I/O Strategies
269(1)
E.2.1 Spokesperson
269(1)
E.2.2 Multiple Writers - Multiple Files
270(1)
E.2.3 Collective I/O to Single or Multiple Files
270(1)
E.3 Lustre Mechanics
270(3)
Appendix F Terminology 273(4)
F.1 Selected Definitions
273(4)
Appendix G 12-Step Process 277(2)
G.1 Introduction
277(1)
G.2 Process
277(2)
Bibliography 279(4)
Crypto 283(2)
Index 285
John Levesque works in the Chief Technology Office at Cray Inc. where he is responsible for application performance on Crays HPC systems. He is also the director of Crays Supercomputing Center of Excellence for the Trinity System installed the end of 2016 at Los Alamos Scientific Laboratory. Prior to Trinity, he was director of the Center of Excellence at the Oak Ridge National Laboratory (ORNL). ORNL installed a 27 Petaflop Cray XK6 system, Titan which was the fastest computer in the world according to the Top500 list in 2012; and a 2.7 Petaflop Cray XT4, Jaguar which was number one in 2009. For the past 50 years, Mr. Levesque has optimized scientific application programs for successful HPC systems. He is an expert in application tuning and compiler analysis of scientific applications. He has written two previous books, on optimization for the Cray 1 in 1989 [ 20] and on optimization for multi-core MPP systems in 2010 [ 19].

Aaron Vose is an HPC software engineer who spent two years at Crays Supercomputing Center of Excellence at Oak Ridge National Laboratory. Aaron helped domain scientists at ORNL port and optimize scientific software to achieve maximum scalability and performance on world-class, highperformance computing resources, such as the Titan supercomputer. Aaron now works for Cray Inc. as a software engineer helping R&D to design nextgeneration computer systems. Prior to joining Cray, Aaron spent time at the National Institute for Computational Sciences (NICS) as well as the Joint Institute for Computational Sciences (JICS). There, he worked on scaling and porting bioinformatics software to the Kraken supercomputer. Aaron holds a Masters degree in Computer Science from the University of Tennessee at Knoxville.