Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Programming for Hybrid Multi/Manycore MPP Systems [Kõva köide]

1.00/5 (4 hinnangut Goodreads-ist)

Aaron Vose, John Levesque (Cray, Inc., Knoxville, Tennessee, USA)

Formaat: Hardback, 342 pages, kõrgus x laius: 234x156 mm, kaal: 676 g, 253 Tables, black and white; 74 Illustrations, black and white
Sari: Chapman & Hall/CRC Computational Science
Ilmumisaeg: 10-Oct-2017
Kirjastus: Chapman & Hall/CRC
ISBN-10: 1439873712
ISBN-13: 9781439873717

Teised raamatud teemal:

Kõva köide
Hind: 114,84 €
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Hardback, 342 pages, kõrgus x laius: 234x156 mm, kaal: 676 g, 253 Tables, black and white; 74 Illustrations, black and white
Sari: Chapman & Hall/CRC Computational Science
Ilmumisaeg: 10-Oct-2017
Kirjastus: Chapman & Hall/CRC
ISBN-10: 1439873712
ISBN-13: 9781439873717

Teised raamatud teemal:

Püsilink: https://www.kriso.ee/db/9781439873717.html

Märksõnad:

"Ask not what your compiler can do for you, ask what you can do for your compiler." --John Levesque, Director of Crays Supercomputing Centers of Excellence

The next decade of computationally intense computing lies with more powerful multi/manycore nodes where processors share a large memory space. These nodes will be the building block for systems that range from a single node workstation up to systems approaching the exaflop regime. The node itself will consist of 10s to 100s of MIMD (multiple instruction, multiple data) processing units with SIMD (single instruction, multiple data) parallel instructions. Since a standard, affordable memory architecture will not be able to supply the bandwidth required by these cores, new memory organizations will be introduced. These new node architectures will represent a significant challenge to application developers.

Programming for Hybrid Multi/Manycore MPP Systems attempts to briefly describe the current state-of-the-art in programming these systems, and proposes an approach for developing a performance-portable application that can effectively utilize all of these systems from a single application. The book starts with a strategy for optimizing an application for multi/manycore architectures. It then looks at the three typical architectures, covering their advantages and disadvantages.

The next section of the book explores the other important component of the targetthe compiler. The compiler will ultimately convert the input language to executable code on the target, and the book explores how to make the compiler do what we want. The book then talks about gathering runtime statistics from running the application on the important problem sets previously discussed.

How best to utilize available memory bandwidth and virtualization is covered next, along with hybridization of a program. The last part of the book includes several major applications, and examines future hardware advancements and how the application developer may prepare for those advancements.

Preface

xvii

About the Authors

xix

List of Figures

xxi

List of Tables

xxv

List of Excerpts

xxix

Chapter 1 Introduction

(6)

1.1 Introduction

(2)

1.2
Chapter Overviews

(4)

Chapter 2 Determining an Exaflop Strategy

(14)

2.1 Foreword By John Levesque

(1)

2.2 Introduction

(1)

2.3 Looking At The Application

(4)

2.4 Degree Of Hybridization Required

(2)

2.5 Decomposition And I/O

(1)

2.6 Parallel And Vector Lengths

(1)

2.7 Productivity And Performance Portability

(4)

2.8 Conclusion

(1)

2.9 Exercises

(2)

Chapter 3 Target Hybrid Multi/Manycore System

(22)

3.1 Foreword By John Levesque

(1)

3.2 Understanding The Architecture

(1)

3.3 Cache Architectures

(2)

3.3.1 Xeon Cache

(1)

3.3.2 NVIDIA GPU Cache

(1)

3.4 Memory Hierarchy

(3)

3.4.1 Knight's Landing Cache

(1)

3.5 KNL Clustering Modes

(5)

3.6 KNL McDram Modes

(5)

3.7 Importance Of Vectorization

(2)

3.8 Alignment For Vectorization

(1)

3.9 Exercises

(3)

Chapter 4 How Compilers Optimize Programs

(24)

4.1 Foreword By John Levesque

(2)

4.2 Introduction

(1)

4.3 Memory Allocation

(2)

4.4 Memory Alignment

(1)

4.5 Comment-Line Directive

(1)

4.6 Interprocedural Analysis

(1)

4.7 Compiler Switches

(1)

4.8 Fortran 2003 And Inefficiencies

(5)

4.8.1 Array Syntax

(2)

4.8.2 Use Optimized Libraries

(1)

4.8.3 Passing Array Sections

(1)

4.8.4 Using Modules for Local Variables

(1)

4.8.5 Derived Types

(1)

4.9 C/C+ + And Inefficiencies

(6)

4.10 Compiler Scalar Optimizations

(4)

4.10.1 Strength Reduction

(2)

4.10.2 Avoiding Floating Point Exponents

(1)

4.10.3 Common Subexpression Elimination

(1)

4.11 Exercises

(2)

Chapter 5 Gathering Runtime Statistics for Optimizing

(12)

5.1 Foreword By John Levesque

(1)

5.2 Introduction

(1)

5.3 What's Important To Profile

(7)

5.3.1 Profiling NAS BT

(5)

5.3.2 Profiling VH1

(2)

5.4 Conclusion

(1)

5.5 Exercises

(2)

Chapter 6 Utilization of Available Memory Bandwidth

(18)

6.1 Foreword By John Levesque

(1)

6.2 Introduction

(1)

6.3 Importance Of Cache Optimization

(1)

6.4 Variable Analysis In Multiple Loops

(3)

6.5 Optimizing For The Cache Hierarchy

(9)

6.6 Combining Multiple Loops

(3)

6.7 Conclusion

(1)

6.8 Exercises

(1)

Chapter 7 Vectorization

(50)

7.1 Foreword By John Levesque

(1)

7.2 Introduction

(1)

7.3 Vectorization Inhibitors

(2)

7.4 Vectorization Rejection From Inefficiencies

101

(10)

7.4.1 Access Modes and Computational Intensity

101

(3)

7.4.2 Conditionals

104

(3)

7.5 Striding Versus Contiguous Accessing

107

(4)

7.6 Wrap-Around Scalar

111

(3)

7.7 Loops Saving Maxima And Minima

114

(2)

7.8 Multinested Loop Structures

116

(3)

7.9 There's MATMUL And Then There's MATMUL

119

(3)

7.10 Decision Processes In Loops

122

(12)

7.10.1 Loop-Independent Conditionals

123

(2)

7.10.2 Conditionals Directly Testing Indicies

125

(5)

7.10.3 Loop-Dependent Conditionals

130

(2)

7.10.4 Conditionals Causing Early Loop Exit

132

(2)

7.11 Handling Function Calls Within Loops

134

(5)

7.12 Rank Expansion

139

(4)

7.13 Outer Loop Vectorization

143

(1)

7.14 Exercises

144

(3)

Chapter 8 Hybridization of an Application

147

(22)

8.1 Foreword By John Levesque

147

(1)

8.2 Introduction

147

(1)

8.3 The Node's NUMA Architecture

148

(1)

8.4 First Touch In The Himeno Benchmark

149

(4)

8.5 Identifying Which Loops To Thread

153

(5)

8.6 SPMD OpenMP

158

(9)

8.7 Exercises

167

(2)

Chapter 9 Porting Entire Applications

169

(74)

9.1 Foreword By John Levesque

169

(1)

9.2 Introduction

170

(1)

9.3 SPEC OpenMP Benchmarks

170

(38)

9.3.1 WUPWISE

170

(5)

9.3.2 MGRID

175

(2)

9.3.3 GALGEL

177

(2)

9.3.4 APSI

179

(3)

9.3.5 FMA3D

182

(2)

9.3.6 AMMP

184

(6)

9.3.7 SWIM

190

(2)

9.3.8 APPLU

192

(2)

9.3.9 EQUAKE

194

(7)

9.3.10 ART

201

(7)

9.4 NASA Parallel Benchmark (NPB) BT

208

(10)

9.5 Refactoring VH-1

218

(5)

9.6 Refactoring LESLIE3D

223

(3)

9.7 Refactoring S3D - 2016 Production Version

226

(4)

9.8 Performance Portable - S3D On Titan

230

(11)

9.9 Exercises

241

(2)

Chapter 10 Future Hardware Advancements

243

(12)

10.1 Introduction

243

(1)

10.2 Future X86 CPUS

244

(1)

10.2.1 Intel Skylake

244

(1)

10.2.2 AMD Zen

244

(1)

10.3 Future Arm CPUS

245

(5)

10.3.1 Scalable Vector Extension

245

(3)

10.3.2 Broadcom Vulcan

248

(1)

10.3.3 Cavium Thunder X

249

(1)

10.3.4 Fujitsu Post-K

249

(1)

10.3.5 Qualcomm Centriq

249

(1)

10.4 Future Memory Technologies

250

(2)

10.4.1 Die-Stacking Technologies

250

(1)

10.4.2 Compute Near Data

251

(1)

10.5 Future Hardware Conclusions

252

(3)

10.5.1 Increased Thread Counts

252

(1)

10.5.2 Wider Vectors

252

(2)

10.5.3 Increasingly Complex Memory Hierarchies

254

(1)

Appendix A Supercomputer Cache Architectures

255

(6)

A.1 Associativity

255

(6)

Appendix B The Translation Look-Aside Buffer

261

(2)

B.1 Introduction To The TLB

261

(2)

Appendix C Command Line Options and Compiler Directives

263

(2)

C.1 Command Line Options And Compiler Directives

263

(2)

Appendix D Previously Used Optimizations

265

(4)

D.1 Loop Reordering

265

(1)

D.2 Index Reordering

266

(1)

D.3 Loop Unrolling

266

(1)

D.4 Loop Fission

266

(1)

D.5 Scalar Promotion

266

(1)

D.6 Removal Of Loop-Independent Ifs

267

(1)

D.7 Use Of Intrinsics To Remove Ifs

267

(1)

D.8 Strip Mining

267

(1)

D.9 Subroutine Inlining

267

(1)

D.10 Pulling Loops Into Subroutines

267

(1)

D.11 Cache Blocking

268

(1)

D.12 Loop Fusion

268

(1)

D.13 Outer Loop Vectorization

268

(1)

Appendix E I/O Optimization

269

(4)

E.1 Introduction

269

(1)

E.2 I/O Strategies

269

(1)

E.2.1 Spokesperson

269

(1)

E.2.2 Multiple Writers - Multiple Files

270

(1)

E.2.3 Collective I/O to Single or Multiple Files

270

(1)

E.3 Lustre Mechanics

270

(3)

Appendix F Terminology

273

(4)

F.1 Selected Definitions

273

(4)

Appendix G 12-Step Process

277

(2)

G.1 Introduction

277

(1)

G.2 Process

277

(2)

Bibliography

279

(4)

Crypto

283

(2)

Index

285

John Levesque works in the Chief Technology Office at Cray Inc. where he is responsible for application performance on Crays HPC systems. He is also the director of Crays Supercomputing Center of Excellence for the Trinity System installed the end of 2016 at Los Alamos Scientific Laboratory. Prior to Trinity, he was director of the Center of Excellence at the Oak Ridge National Laboratory (ORNL). ORNL installed a 27 Petaflop Cray XK6 system, Titan which was the fastest computer in the world according to the Top500 list in 2012; and a 2.7 Petaflop Cray XT4, Jaguar which was number one in 2009. For the past 50 years, Mr. Levesque has optimized scientific application programs for successful HPC systems. He is an expert in application tuning and compiler analysis of scientific applications. He has written two previous books, on optimization for the Cray 1 in 1989 [ 20] and on optimization for multi-core MPP systems in 2010 [ 19].

Aaron Vose is an HPC software engineer who spent two years at Crays Supercomputing Center of Excellence at Oak Ridge National Laboratory. Aaron helped domain scientists at ORNL port and optimize scientific software to achieve maximum scalability and performance on world-class, highperformance computing resources, such as the Titan supercomputer. Aaron now works for Cray Inc. as a software engineer helping R&D to design nextgeneration computer systems. Prior to joining Cray, Aaron spent time at the National Institute for Computational Sciences (NICS) as well as the Joint Institute for Computational Sciences (JICS). There, he worked on scaling and porting bioinformatics software to the Kraken supercomputer. Aaron holds a Masters degree in Computer Science from the University of Tennessee at Knoxville.

Programming for Hybrid Multi/Manycore MPP Systems [Kõva köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv