Klienditugi: 7440010 (E-R 10-18)

E-raamat: High Performance Parallelism Pearls Volume One: Multicore and Many-core Programming Approaches

3.43/5 (11 hinnangut Goodreads-ist)

James Jeffers (Principal Engineer and Visualization Lead, Intel Corporation), James Reinders (Director and Programming Model Architect, Intel Corporation)

Formaat: PDF+DRM
Ilmumisaeg: 04-Nov-2014
Kirjastus: Morgan Kaufmann Publishers In
Keel: eng
ISBN-13: 9780128021996

Teised raamatud teemal:

Computer architecture & logic design

Formaat - PDF+DRM
Hind: 66,87 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: PDF+DRM
Ilmumisaeg: 04-Nov-2014
Kirjastus: Morgan Kaufmann Publishers In
Keel: eng
ISBN-13: 9780128021996

Teised raamatud teemal:

Computer architecture & logic design

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming - illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as chemistry, engineering, and environmental science. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating "success stories" demonstrating not just the features of these powerful systems, but also how to leverage parallelism across these heterogeneous systems.

Promotes consistent standards-based programming, showing in detail how to code for high performance on multicore processors and Intel® Xeon Phi™
Examples from multiple vertical domains illustrating parallel optimizations to modernize real-world codes
Source code available for download to facilitate further exploration

Arvustused

"This book will make it much easier in general to exploit high levels of parallelism including programming optimally for the Intel Xeon Phi products. The common programming methodology between the Xeon and Xeon Phi families is good news for the entire scientific and engineering community; the same programming can realize parallel scaling and vectorization for both multicore and many-core." -from the Foreword by Sverre Jarp, CERN Openlab CTO

Muu info

Case studies and examples illustrating the power of high performance parallelism

Contributors

Acknowledgments

xxxix

Foreword

xli

Preface

xlv

Chapter 1 Introduction

(6)

Learning from Successful Experiences

(1)

Code Modernization

(1)

Modernize with Concurrent Algorithms

(1)

Modernize with Vectorization and Data Locality

(1)

Understanding Power Usage

(1)

ispc and OpenCL Anyone?

(1)

Intel Xeon Phi Coprocessor Specific

(1)

Many-Core, Neo-Heterogeneous

(1)

No "Xeon Phi" In The Title, Neo-Heterogeneous Programming

(1)

The Future of Many-Core

(1)

Downloads

(1)

For More Information

(2)

Chapter 2 From "Correct" to "Correct & Efficient": A Hydro2D Case Study with Godunov's Scheme

(36)

Scientific Computing on Contemporary Computers

(2)

Modern Computing Environments

(1)

CEA's Hydro2D

(1)

A Numerical Method for Shock Hydrodynamics

(4)

Euler's Equation

(1)

Godunov's Method

(2)

Where It Fits

(1)

Features of Modern Architectures

(2)

Performance-Oriented Architecture

(1)

Programming Tools and Runtimes

(1)

Our Computing Environments

(1)

Paths to Performance

(24)

Running Hydro2D

(1)

Hydro2D's Structure

(5)

Optimizations

(1)

Memory Usage

(1)

Thread-Level Parallelism

(8)

Arithmetic Efficiency and Instruction-Level Parallelism

(2)

Data-Level Parallelism

(7)

Summary

(3)

The Coprocessor vs the Processor

(1)

A Rising Tide Lifts All Boats

(2)

Performance Strategies

(1)

For More Information

(1)

Chapter 3 Better Concurrency and SIMD on HBM

(26)

The Application: HIROMB-BOOS-Model

(1)

Key Usage: DMI

(1)

HBM Execution Profile

(1)

Overview for the Optimization of HBM

(1)

Data Structures: Locality Done Right

(4)

Thread Parallelism in HBM

(5)

Data Parallelism: SIMD Vectorization

(6)

Trivial Obstacles

(3)

Premature Abstraction is the Root of All Evil

(3)

Results

(1)

Profiling Details

(1)

Scaling on Processor vs. Coprocessor

(2)

Contiguous Attribute

(2)

Summary

(1)

References

(1)

For More Information

(3)

Chapter 4 Optimizing for Reacting Navier-Stokes Equations

(18)

Getting Started

(1)

Version 1.0 Baseline

(3)

Version 2.0 ThreadBox

(4)

Version 3.0 Stack Memory

(1)

Version 4.0 Blocking

(3)

Version 5.0 Vectorization

(3)

Intel Xeon Phi Coprocessor Results

(1)

Summary

(1)

For More Information

(2)

Chapter 5 Plesiochronous Phasing Barriers

(30)

What Can Be Done to Improve the Code?

(2)

What More Can Be Done to Improve the Code?

(1)

Hyper-Thread Phalanx

(2)

What is Nonoptimal About This Strategy?

(1)

Coding the Hyper-Thread Phalanx

(1)

How to Determine Thread Binding to Core and HT Within Core?

(5)

The Hyper-Thread Phalanx Hand-Partitioning Technique

(2)

A Lesson Learned

(2)

Back to Work

(1)

Data Alignment

(4)

Use Aligned Data When Possible

100

(1)

Redundancy Can Be Good for You

100

(3)

The Plesiochronous Phasing Barrier

103

(2)

Let us do Something to Recover This Wasted Time

105

(4)

A Few "Left to the Reader" Possibilities

109

(1)

Xeon Host Performance Improvements Similar to Xeon Phi

110

(5)

Summary

115

(1)

For More Information

115

(2)

Chapter 6 Parallel Evaluation of Fault Tree Expressions

117

(12)

Motivation and Background

117

(1)

Expressions

117

(1)

Expression of Choice: Fault Trees

117

(1)

An Application for Fault Trees: Ballistic Simulation

118

(1)

Example Implementation

118

(8)

Using ispc for Vectorization

121

(5)

Other Considerations

126

(2)

Summary

128

(1)

For More Information

128

(1)

Chapter 7 Deep-Learning Numerical Optimization

129

(14)

Fitting an Objective Function

129

(5)

Objective Functions and Principle Components Analysis

134

(1)

Software and Example Data

135

(1)

Training Data

136

(3)

Runtime Results

139

(2)

Scaling Results

141

(1)

Summary

141

(1)

For More Information

142

(1)

Chapter 8 Optimizing Gather/Scatter Patterns

143

(16)

Gather/Scatter Instructions in Intel® Architecture

145

(1)

Gather/Scatter Patterns in Molecular Dynamics

145

(3)

Optimizing Gather/Scatter Patterns

148

(8)

Improving Temporal and Spatial Locality

148

(2)

Choosing an Appropriate Data Layout: AoS Versus SoA

150

(1)

On-the-Fly Transposition Between AoS and SoA

151

(3)

Amortizing Gather/Scatter and Transposition Costs

154

(2)

Summary

156

(1)

For More Information

157

(2)

Chapter 9 A Many-Core Implementation of the Direct N-Body Problem

159

(16)

N-Body Simulations

159

(1)

Initial Solution

159

(3)

Theoretical Limit

162

(2)

Reduce the Overheads, Align Your Data

164

(3)

Optimize the Memory Hierarchy

167

(3)

Improving Our Tiling

170

(2)

What Does All This Mean to the Host Version?

172

(2)

Summary

174

(1)

For More Information

174

(1)

Chapter 10 N-Body Methods

175

(10)

Fast N-Body Methods and Direct TV-Body Kernels

175

(1)

Applications of TV-Body Methods

176

(1)

Direct TV-Body Code

177

(2)

Performance Results

179

(3)

Summary

182

(1)

For More Information

183

(2)

Chapter 11 Dynamic Load Balancing Using OpenMP 4.0

185

(16)

Maximizing Hardware Usage

185

(2)

The N-Body Kernel

187

(4)

The Offloaded Version

191

(2)

A First Processor Combined with Coprocessor Version

193

(3)

Version for Processor with Multiple Coprocessors

196

(4)

For More Information

200

(1)

Chapter 12 Concurrent Kernel Offloading

201

(24)

Setting the Context

201

(3)

Motivating Example: Particle Dynamics

202

(1)

Organization of This
Chapter

203

(1)

Concurrent Kernels on the Coprocessor

204

(9)

Coprocessor Device Partitioning and Thread Affinity

204

(6)

Concurrent Data Transfers

210

(3)

Force Computation in PD Using Concurrent Kernel Offloading

213

(8)

Parallel Force Evaluation Using Newton's 3rd Law

213

(2)

Implementation of the Concurrent Force Computation

215

(5)

Performance Evaluation: Before and After

220

(1)

The Bottom Line

221

(2)

For More Information

223

(2)

Chapter 13 Heterogeneous Computing with MPI

225

(14)

MPI in the Modern Clusters

225

(1)

MPI Task Location

226

(5)

Single-Task Hybrid Programs

229

(2)

Selection of the DAPL Providers

231

(6)

The First Provider OFA-V2-MLX4_0-1U

231

(1)

The Second Provider ofa-v2-scif0 and the Impact of the Intra-Node Fabric

232

(1)

The Last Provider, Also Called the Proxy

232

(2)

Hybrid Application Scalability

234

(2)

Load Balance

236

(1)

Task and Thread Mapping

236

(1)

Summary

237

(1)

Acknowledgments

238

(1)

For More Information

238

(1)

Chapter 14 Power Analysis on the Intel® Xeon Phi™ Coprocessor

239

(16)

Power Analysis 101

239

(2)

Measuring Power and Temperature with Software

241

(5)

Creating a Power and Temperature Monitor Script

243

(1)

Creating a Power and Temperature Logger with the micsmc Tool

243

(2)

Power Analysis Using IPMI

245

(1)

Hardware-Based Power Analysis Methods

246

(6)

A Hardware-Based Coprocessor Power Analyzer

249

(3)

Summary

252

(1)

For More Information

253

(2)

Chapter 15 Integrating Intel Xeon Phi Coprocessors into a Cluster Environment

255

(22)

Early Explorations

255

(1)

Beacon System History

256

(1)

Beacon System Architecture

256

(2)

Hardware

256

(1)

Software Environment

256

(2)

Intel MPSS Installation Procedure

258

(7)

Preparing the System

258

(1)

Installation of the Intel MPSS Stack

259

(2)

Generating and Customizing Configuration Files

261

(4)

MPSS Upgrade Procedure

265

(1)

Setting Up the Resource and Workload Managers

265

(4)

TORQUE

265

(1)

Prologue

266

(2)

Epilogue

268

(1)

TORQUE/Coprocessor Integration

268

(1)

Moab

269

(1)

Improving Network Locality

269

(1)

Moab/Coprocessor Integration

269

(1)

Health Checking and Monitoring

269

(2)

Scripting Common Commands

271

(2)

User Software Environment

273

(1)

Future Directions

274

(1)

Summary

275

(1)

Acknowledgments

275

(1)

For More Information

275

(2)

Chapter 16 Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors

277

(10)

Network Configuration Concepts and Goals

278

(3)

A Look At Networking Options

278

(2)

Steps to Set Up a Cluster Enabled Coprocessor

280

(1)

Coprocessor File Systems Support

281

(4)

Support for NFS

282

(1)

Support for Lustre® File System

282

(2)

Support for Fraunhofer BeeGFS® (formerly FHGFS) File System

284

(1)

Support for Panasas® PanFS® File System

285

(1)

Choosing a Cluster File System

285

(1)

Summary

285

(1)

For More Information

286

(1)

Chapter 17 NWChem: Quantum Chemistry Simulations at Scale

287

(20)

Introduction

287

(1)

Overview of Single-Reference CC Formalism

288

(3)

NWChem Software Architecture

291

(2)

Global Arrays

291

(1)

Tensor Contraction Engine

292

(1)

Engineering an Offload Solution

293

(4)

Offload Architecture

297

(1)

Kernel Optimizations

298

(3)

Performance Evaluation

301

(3)

Summary

304

(1)

Acknowledgments

305

(1)

For More Information

305

(2)

Chapter 18 Efficient Nested Parallelism on Large-Scale Systems

307

(12)

Motivation

307

(1)

The Benchmark

307

(2)

Baseline Benchmarking

309

(1)

Pipeline Approach---Flat_arena Class

310

(1)

Intel® TBB User-Managed Task Arenas

311

(2)

Hierarchical Approach---Hierarchical_arena Class

313

(1)

Performance Evaluation

314

(2)

Implication on NUMA Architectures

316

(1)

Summary

317

(1)

For More Information

318

(1)

Chapter 19 Performance Optimization of Black-Scholes Pricing

319

(22)

Financial Market Model Basics and the Black-Scholes Formula

320

(3)

Financial Market Mathematical Model

320

(1)

European Option and Fair Price Concepts

321

(1)

Black-Scholes Formula

322

(1)

Options Pricing

322

(1)

Test Infrastructure

323

(1)

Case Study

323

(15)

Preliminary Version---Checking Correctness

323

(1)

Reference Version---Choose Appropriate Data Structures

323

(2)

Reference Version---Do Not Mix Data Types

325

(1)

Vectorize Loops

326

(3)

Use Fast Math Functions: erff() vs. cdfnormf()

329

(2)

Equivalent Transformations of Code

331

(1)

Align Arrays

331

(2)

Reduce Precision if Possible

333

(1)

Work in Parallel

334

(1)

Use Warm-Up

334

(2)

Using the Intel Xeon Phi Coprocessor ---"No Effort" Port

336

(1)

Use Intel Xeon Phi Coprocessor: Work in Parallel

337

(1)

Use Intel Xeon Phi Coprocessor and Streaming Stores

338

(1)

Summary

338

(1)

For More Information

339

(2)

Chapter 20 Data Transfer Using the Intel COI Library

341

(8)

First Steps with the Intel COI Library

341

(1)

COI Buffer Types and Transfer Performance

342

(4)

Applications

346

(2)

Summary

348

(1)

For More Information

348

(1)

Chapter 21 High-Performance Ray Tracing

349

(10)

Background

349

(2)

Vectorizing Ray Traversal

351

(1)

The Embree Ray Tracing Kernels

352

(1)

Using Embree in an Application

352

(2)

Performance

354

(3)

Summary

357

(1)

For More Information

358

(1)

Chapter 22 Portable Performance with OpenCL

359

(18)

The Dilemma

359

(1)

A Brief Introduction to OpenCL

360

(4)

A Matrix Multiply Example in OpenCL

364

(2)

OpenCL and the Intel Xeon Phi Coprocessor

366

(2)

Matrix Multiply Performance Results

368

(1)

Case Study: Molecular Docking

369

(4)

Results: Portable Performance

373

(1)

Related Work

374

(1)

Summary

375

(1)

For More Information

375

(2)

Chapter 23 Characterization and Optimization Methodology Applied to Stencil Computations

377

(20)

Introduction

377

(1)

Performance Evaluation

378

(4)

AI of the Test Platforms

379

(1)

AI of the Kernel

380

(2)

Standard Optimizations

382

(13)

Automatic Application Tuning

386

(6)

The Auto-Tuning Tool

392

(1)

Results

393

(2)

Summary

395

(1)

For More Information

395

(2)

Chapter 24 Profiling-Guided Optimization

397

(28)

Matrix Transposition in Computer Science

397

(2)

Tools and Methods

399

(1)

"Serial": Our Original In-Place Transposition

400

(5)

"Parallel": Adding Parallelism with OpenMP

405

(1)

"Tiled": Improving Data Locality

405

(6)

"Regularized": Microkernel with Multiversioning

411

(6)

"Planned": Exposing More Parallelism

417

(4)

Summary

421

(2)

For More Information

423

(2)

Chapter 25 Heterogeneous MPI Application Optimization with ITAC

425

(18)

Asian Options Pricing

425

(1)

Application Design

426

(2)

Synchronization in Heterogeneous Clusters

428

(1)

Finding Bottlenecks with ITAC

429

(1)

Setting Up ITAC

430

(1)

Unbalanced MPI Run

431

(3)

Manual Workload Balance

434

(2)

Dynamic "Boss-Workers" Load Balancing

436

(3)

Conclusion

439

(2)

For More Information

441

(2)

Chapter 26 Scalable Out-of-Core Solvers on a Cluster

443

(14)

Introduction

443

(1)

An OOC Factorization Based on ScaLAPACK

444

(3)

In-Core Factorization

445

(1)

OOC Factorization

446

(1)

Porting from NVIDIA GPU to the Intel Xeon Phi Coprocessor

447

(2)

Numerical Results

449

(5)

Conclusions and Future Work

454

(1)

Acknowledgments

454

(1)

For More Information

454

(3)

Chapter 27 Sparse Matrix-Vector Multiplication: Parallelization and Vectorization

457

(20)

Background

457

(1)

Sparse Matrix Data Structures

458

(4)

Compressed Data Structures

459

(3)

Blocking

462

(1)

Parallel SpMV Multiplication

462

(3)

Partially Distributed Parallel SpMV

462

(1)

Fully Distributed Parallel SpMV

463

(2)

Vectorization on the Intel Xeon Phi Coprocessor

465

(5)

Implementation of the Vectorized SpMV Kernel

467

(3)

Evaluation

470

(4)

On the Intel Xeon Phi Coprocessor

471

(1)

On Intel Xeon CPUs

472

(2)

Performance Comparison

474

(1)

Summary

474

(1)

Acknowledgments

475

(1)

For More Information

475

(2)

Chapter 28 Morton Order Improves Performance

477

(14)

Improving Cache Locality by Data Ordering

477

(1)

Improving Performance

477

(1)

Matrix Transpose

478

(4)

Matrix Multiply

482

(6)

Summary

488

(2)

For More Information

490

(1)

Author Index

491

(4)

Subject Index

495

James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the worlds first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for a number of Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. James has published numerous articles, contributed to several books and is widely interviewed on parallelism. James has managed software development groups, customer service and consulting teams, business development and marketing teams. James is sought after to keynote on parallel programming, and is the author/co-author of three books currently in print including Structured Parallel Programming, published by Morgan Kaufmann in 2012. Jim Jeffers was the primary strategic planner and one of the first full-time employees on the program that became Intel ® MIC. He served as lead SW Engineering Manager on the program and formed and launched the SW development team. As the program evolved, he became the workloads (applications) and SW performance team manager. He has some of the deepest insight into the market, architecture and programming usages of the MIC product line. He has been a developer and development manager for embedded and high performance systems for close to 30 years.

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97801280219962e.html

Märksõnad:

E-raamat: High Performance Parallelism Pearls Volume One: Multicore and Many-core Programming Approaches

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Arvustused

Muu info

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv