Muutke küpsiste eelistusi

E-raamat: High Performance Parallelism Pearls Volume One: Multicore and Many-core Programming Approaches

(Principal Engineer and Visualization Lead, Intel Corporation), (Director and Programming Model Architect, Intel Corporation)
  • Formaat: PDF+DRM
  • Ilmumisaeg: 04-Nov-2014
  • Kirjastus: Morgan Kaufmann Publishers In
  • Keel: eng
  • ISBN-13: 9780128021996
Teised raamatud teemal:
  • Formaat - PDF+DRM
  • Hind: 66,87 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: PDF+DRM
  • Ilmumisaeg: 04-Nov-2014
  • Kirjastus: Morgan Kaufmann Publishers In
  • Keel: eng
  • ISBN-13: 9780128021996
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming - illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as chemistry, engineering, and environmental science. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating "success stories" demonstrating not just the features of these powerful systems, but also how to leverage parallelism across these heterogeneous systems.

  • Promotes consistent standards-based programming, showing in detail how to code for high performance on multicore processors and Intel® Xeon Phi™
  • Examples from multiple vertical domains illustrating parallel optimizations to modernize real-world codes
  • Source code available for download to facilitate further exploration

Arvustused

"This book will make it much easier in general to exploit high levels of parallelism including programming optimally for the Intel Xeon Phi products. The common programming methodology between the Xeon and Xeon Phi families is good news for the entire scientific and engineering community; the same programming can realize parallel scaling and vectorization for both multicore and many-core." -from the Foreword by Sverre Jarp, CERN Openlab CTO

Muu info

Case studies and examples illustrating the power of high performance parallelism
Contributors xv
Acknowledgments xxxix
Foreword xli
Preface xlv
Chapter 1 Introduction
1(6)
Learning from Successful Experiences
1(1)
Code Modernization
1(1)
Modernize with Concurrent Algorithms
2(1)
Modernize with Vectorization and Data Locality
2(1)
Understanding Power Usage
2(1)
ispc and OpenCL Anyone?
2(1)
Intel Xeon Phi Coprocessor Specific
3(1)
Many-Core, Neo-Heterogeneous
3(1)
No "Xeon Phi" In The Title, Neo-Heterogeneous Programming
3(1)
The Future of Many-Core
4(1)
Downloads
4(1)
For More Information
5(2)
Chapter 2 From "Correct" to "Correct & Efficient": A Hydro2D Case Study with Godunov's Scheme
7(36)
Scientific Computing on Contemporary Computers
7(2)
Modern Computing Environments
8(1)
CEA's Hydro2D
9(1)
A Numerical Method for Shock Hydrodynamics
9(4)
Euler's Equation
10(1)
Godunov's Method
10(2)
Where It Fits
12(1)
Features of Modern Architectures
13(2)
Performance-Oriented Architecture
13(1)
Programming Tools and Runtimes
14(1)
Our Computing Environments
14(1)
Paths to Performance
15(24)
Running Hydro2D
15(1)
Hydro2D's Structure
15(5)
Optimizations
20(1)
Memory Usage
21(1)
Thread-Level Parallelism
22(8)
Arithmetic Efficiency and Instruction-Level Parallelism
30(2)
Data-Level Parallelism
32(7)
Summary
39(3)
The Coprocessor vs the Processor
39(1)
A Rising Tide Lifts All Boats
39(2)
Performance Strategies
41(1)
For More Information
42(1)
Chapter 3 Better Concurrency and SIMD on HBM
43(26)
The Application: HIROMB-BOOS-Model
43(1)
Key Usage: DMI
44(1)
HBM Execution Profile
44(1)
Overview for the Optimization of HBM
45(1)
Data Structures: Locality Done Right
46(4)
Thread Parallelism in HBM
50(5)
Data Parallelism: SIMD Vectorization
55(6)
Trivial Obstacles
55(3)
Premature Abstraction is the Root of All Evil
58(3)
Results
61(1)
Profiling Details
62(1)
Scaling on Processor vs. Coprocessor
62(2)
Contiguous Attribute
64(2)
Summary
66(1)
References
66(1)
For More Information
66(3)
Chapter 4 Optimizing for Reacting Navier-Stokes Equations
69(18)
Getting Started
69(1)
Version 1.0 Baseline
70(3)
Version 2.0 ThreadBox
73(4)
Version 3.0 Stack Memory
77(1)
Version 4.0 Blocking
77(3)
Version 5.0 Vectorization
80(3)
Intel Xeon Phi Coprocessor Results
83(1)
Summary
84(1)
For More Information
85(2)
Chapter 5 Plesiochronous Phasing Barriers
87(30)
What Can Be Done to Improve the Code?
89(2)
What More Can Be Done to Improve the Code?
91(1)
Hyper-Thread Phalanx
91(2)
What is Nonoptimal About This Strategy?
93(1)
Coding the Hyper-Thread Phalanx
93(1)
How to Determine Thread Binding to Core and HT Within Core?
94(5)
The Hyper-Thread Phalanx Hand-Partitioning Technique
95(2)
A Lesson Learned
97(2)
Back to Work
99(1)
Data Alignment
99(4)
Use Aligned Data When Possible
100(1)
Redundancy Can Be Good for You
100(3)
The Plesiochronous Phasing Barrier
103(2)
Let us do Something to Recover This Wasted Time
105(4)
A Few "Left to the Reader" Possibilities
109(1)
Xeon Host Performance Improvements Similar to Xeon Phi
110(5)
Summary
115(1)
For More Information
115(2)
Chapter 6 Parallel Evaluation of Fault Tree Expressions
117(12)
Motivation and Background
117(1)
Expressions
117(1)
Expression of Choice: Fault Trees
117(1)
An Application for Fault Trees: Ballistic Simulation
118(1)
Example Implementation
118(8)
Using ispc for Vectorization
121(5)
Other Considerations
126(2)
Summary
128(1)
For More Information
128(1)
Chapter 7 Deep-Learning Numerical Optimization
129(14)
Fitting an Objective Function
129(5)
Objective Functions and Principle Components Analysis
134(1)
Software and Example Data
135(1)
Training Data
136(3)
Runtime Results
139(2)
Scaling Results
141(1)
Summary
141(1)
For More Information
142(1)
Chapter 8 Optimizing Gather/Scatter Patterns
143(16)
Gather/Scatter Instructions in Intel® Architecture
145(1)
Gather/Scatter Patterns in Molecular Dynamics
145(3)
Optimizing Gather/Scatter Patterns
148(8)
Improving Temporal and Spatial Locality
148(2)
Choosing an Appropriate Data Layout: AoS Versus SoA
150(1)
On-the-Fly Transposition Between AoS and SoA
151(3)
Amortizing Gather/Scatter and Transposition Costs
154(2)
Summary
156(1)
For More Information
157(2)
Chapter 9 A Many-Core Implementation of the Direct N-Body Problem
159(16)
N-Body Simulations
159(1)
Initial Solution
159(3)
Theoretical Limit
162(2)
Reduce the Overheads, Align Your Data
164(3)
Optimize the Memory Hierarchy
167(3)
Improving Our Tiling
170(2)
What Does All This Mean to the Host Version?
172(2)
Summary
174(1)
For More Information
174(1)
Chapter 10 N-Body Methods
175(10)
Fast N-Body Methods and Direct TV-Body Kernels
175(1)
Applications of TV-Body Methods
176(1)
Direct TV-Body Code
177(2)
Performance Results
179(3)
Summary
182(1)
For More Information
183(2)
Chapter 11 Dynamic Load Balancing Using OpenMP 4.0
185(16)
Maximizing Hardware Usage
185(2)
The N-Body Kernel
187(4)
The Offloaded Version
191(2)
A First Processor Combined with Coprocessor Version
193(3)
Version for Processor with Multiple Coprocessors
196(4)
For More Information
200(1)
Chapter 12 Concurrent Kernel Offloading
201(24)
Setting the Context
201(3)
Motivating Example: Particle Dynamics
202(1)
Organization of This
Chapter
203(1)
Concurrent Kernels on the Coprocessor
204(9)
Coprocessor Device Partitioning and Thread Affinity
204(6)
Concurrent Data Transfers
210(3)
Force Computation in PD Using Concurrent Kernel Offloading
213(8)
Parallel Force Evaluation Using Newton's 3rd Law
213(2)
Implementation of the Concurrent Force Computation
215(5)
Performance Evaluation: Before and After
220(1)
The Bottom Line
221(2)
For More Information
223(2)
Chapter 13 Heterogeneous Computing with MPI
225(14)
MPI in the Modern Clusters
225(1)
MPI Task Location
226(5)
Single-Task Hybrid Programs
229(2)
Selection of the DAPL Providers
231(6)
The First Provider OFA-V2-MLX4_0-1U
231(1)
The Second Provider ofa-v2-scif0 and the Impact of the Intra-Node Fabric
232(1)
The Last Provider, Also Called the Proxy
232(2)
Hybrid Application Scalability
234(2)
Load Balance
236(1)
Task and Thread Mapping
236(1)
Summary
237(1)
Acknowledgments
238(1)
For More Information
238(1)
Chapter 14 Power Analysis on the Intel® Xeon Phi™ Coprocessor
239(16)
Power Analysis 101
239(2)
Measuring Power and Temperature with Software
241(5)
Creating a Power and Temperature Monitor Script
243(1)
Creating a Power and Temperature Logger with the micsmc Tool
243(2)
Power Analysis Using IPMI
245(1)
Hardware-Based Power Analysis Methods
246(6)
A Hardware-Based Coprocessor Power Analyzer
249(3)
Summary
252(1)
For More Information
253(2)
Chapter 15 Integrating Intel Xeon Phi Coprocessors into a Cluster Environment
255(22)
Early Explorations
255(1)
Beacon System History
256(1)
Beacon System Architecture
256(2)
Hardware
256(1)
Software Environment
256(2)
Intel MPSS Installation Procedure
258(7)
Preparing the System
258(1)
Installation of the Intel MPSS Stack
259(2)
Generating and Customizing Configuration Files
261(4)
MPSS Upgrade Procedure
265(1)
Setting Up the Resource and Workload Managers
265(4)
TORQUE
265(1)
Prologue
266(2)
Epilogue
268(1)
TORQUE/Coprocessor Integration
268(1)
Moab
269(1)
Improving Network Locality
269(1)
Moab/Coprocessor Integration
269(1)
Health Checking and Monitoring
269(2)
Scripting Common Commands
271(2)
User Software Environment
273(1)
Future Directions
274(1)
Summary
275(1)
Acknowledgments
275(1)
For More Information
275(2)
Chapter 16 Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors
277(10)
Network Configuration Concepts and Goals
278(3)
A Look At Networking Options
278(2)
Steps to Set Up a Cluster Enabled Coprocessor
280(1)
Coprocessor File Systems Support
281(4)
Support for NFS
282(1)
Support for Lustre® File System
282(2)
Support for Fraunhofer BeeGFS® (formerly FHGFS) File System
284(1)
Support for Panasas® PanFS® File System
285(1)
Choosing a Cluster File System
285(1)
Summary
285(1)
For More Information
286(1)
Chapter 17 NWChem: Quantum Chemistry Simulations at Scale
287(20)
Introduction
287(1)
Overview of Single-Reference CC Formalism
288(3)
NWChem Software Architecture
291(2)
Global Arrays
291(1)
Tensor Contraction Engine
292(1)
Engineering an Offload Solution
293(4)
Offload Architecture
297(1)
Kernel Optimizations
298(3)
Performance Evaluation
301(3)
Summary
304(1)
Acknowledgments
305(1)
For More Information
305(2)
Chapter 18 Efficient Nested Parallelism on Large-Scale Systems
307(12)
Motivation
307(1)
The Benchmark
307(2)
Baseline Benchmarking
309(1)
Pipeline Approach---Flat_arena Class
310(1)
Intel® TBB User-Managed Task Arenas
311(2)
Hierarchical Approach---Hierarchical_arena Class
313(1)
Performance Evaluation
314(2)
Implication on NUMA Architectures
316(1)
Summary
317(1)
For More Information
318(1)
Chapter 19 Performance Optimization of Black-Scholes Pricing
319(22)
Financial Market Model Basics and the Black-Scholes Formula
320(3)
Financial Market Mathematical Model
320(1)
European Option and Fair Price Concepts
321(1)
Black-Scholes Formula
322(1)
Options Pricing
322(1)
Test Infrastructure
323(1)
Case Study
323(15)
Preliminary Version---Checking Correctness
323(1)
Reference Version---Choose Appropriate Data Structures
323(2)
Reference Version---Do Not Mix Data Types
325(1)
Vectorize Loops
326(3)
Use Fast Math Functions: erff() vs. cdfnormf()
329(2)
Equivalent Transformations of Code
331(1)
Align Arrays
331(2)
Reduce Precision if Possible
333(1)
Work in Parallel
334(1)
Use Warm-Up
334(2)
Using the Intel Xeon Phi Coprocessor ---"No Effort" Port
336(1)
Use Intel Xeon Phi Coprocessor: Work in Parallel
337(1)
Use Intel Xeon Phi Coprocessor and Streaming Stores
338(1)
Summary
338(1)
For More Information
339(2)
Chapter 20 Data Transfer Using the Intel COI Library
341(8)
First Steps with the Intel COI Library
341(1)
COI Buffer Types and Transfer Performance
342(4)
Applications
346(2)
Summary
348(1)
For More Information
348(1)
Chapter 21 High-Performance Ray Tracing
349(10)
Background
349(2)
Vectorizing Ray Traversal
351(1)
The Embree Ray Tracing Kernels
352(1)
Using Embree in an Application
352(2)
Performance
354(3)
Summary
357(1)
For More Information
358(1)
Chapter 22 Portable Performance with OpenCL
359(18)
The Dilemma
359(1)
A Brief Introduction to OpenCL
360(4)
A Matrix Multiply Example in OpenCL
364(2)
OpenCL and the Intel Xeon Phi Coprocessor
366(2)
Matrix Multiply Performance Results
368(1)
Case Study: Molecular Docking
369(4)
Results: Portable Performance
373(1)
Related Work
374(1)
Summary
375(1)
For More Information
375(2)
Chapter 23 Characterization and Optimization Methodology Applied to Stencil Computations
377(20)
Introduction
377(1)
Performance Evaluation
378(4)
AI of the Test Platforms
379(1)
AI of the Kernel
380(2)
Standard Optimizations
382(13)
Automatic Application Tuning
386(6)
The Auto-Tuning Tool
392(1)
Results
393(2)
Summary
395(1)
For More Information
395(2)
Chapter 24 Profiling-Guided Optimization
397(28)
Matrix Transposition in Computer Science
397(2)
Tools and Methods
399(1)
"Serial": Our Original In-Place Transposition
400(5)
"Parallel": Adding Parallelism with OpenMP
405(1)
"Tiled": Improving Data Locality
405(6)
"Regularized": Microkernel with Multiversioning
411(6)
"Planned": Exposing More Parallelism
417(4)
Summary
421(2)
For More Information
423(2)
Chapter 25 Heterogeneous MPI Application Optimization with ITAC
425(18)
Asian Options Pricing
425(1)
Application Design
426(2)
Synchronization in Heterogeneous Clusters
428(1)
Finding Bottlenecks with ITAC
429(1)
Setting Up ITAC
430(1)
Unbalanced MPI Run
431(3)
Manual Workload Balance
434(2)
Dynamic "Boss-Workers" Load Balancing
436(3)
Conclusion
439(2)
For More Information
441(2)
Chapter 26 Scalable Out-of-Core Solvers on a Cluster
443(14)
Introduction
443(1)
An OOC Factorization Based on ScaLAPACK
444(3)
In-Core Factorization
445(1)
OOC Factorization
446(1)
Porting from NVIDIA GPU to the Intel Xeon Phi Coprocessor
447(2)
Numerical Results
449(5)
Conclusions and Future Work
454(1)
Acknowledgments
454(1)
For More Information
454(3)
Chapter 27 Sparse Matrix-Vector Multiplication: Parallelization and Vectorization
457(20)
Background
457(1)
Sparse Matrix Data Structures
458(4)
Compressed Data Structures
459(3)
Blocking
462(1)
Parallel SpMV Multiplication
462(3)
Partially Distributed Parallel SpMV
462(1)
Fully Distributed Parallel SpMV
463(2)
Vectorization on the Intel Xeon Phi Coprocessor
465(5)
Implementation of the Vectorized SpMV Kernel
467(3)
Evaluation
470(4)
On the Intel Xeon Phi Coprocessor
471(1)
On Intel Xeon CPUs
472(2)
Performance Comparison
474(1)
Summary
474(1)
Acknowledgments
475(1)
For More Information
475(2)
Chapter 28 Morton Order Improves Performance
477(14)
Improving Cache Locality by Data Ordering
477(1)
Improving Performance
477(1)
Matrix Transpose
478(4)
Matrix Multiply
482(6)
Summary
488(2)
For More Information
490(1)
Author Index 491(4)
Subject Index 495
James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the worlds first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for a number of Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. James has published numerous articles, contributed to several books and is widely interviewed on parallelism. James has managed software development groups, customer service and consulting teams, business development and marketing teams. James is sought after to keynote on parallel programming, and is the author/co-author of three books currently in print including Structured Parallel Programming, published by Morgan Kaufmann in 2012. Jim Jeffers was the primary strategic planner and one of the first full-time employees on the program that became Intel ® MIC. He served as lead SW Engineering Manager on the program and formed and launched the SW development team. As the program evolved, he became the workloads (applications) and SW performance team manager. He has some of the deepest insight into the market, architecture and programming usages of the MIC product line. He has been a developer and development manager for embedded and high performance systems for close to 30 years.