Contributors |
|
xv | |
Acknowledgments |
|
xxxix | |
Foreword |
|
xli | |
Preface |
|
xlv | |
|
|
1 | (6) |
|
Learning from Successful Experiences |
|
|
1 | (1) |
|
|
1 | (1) |
|
Modernize with Concurrent Algorithms |
|
|
2 | (1) |
|
Modernize with Vectorization and Data Locality |
|
|
2 | (1) |
|
Understanding Power Usage |
|
|
2 | (1) |
|
|
2 | (1) |
|
Intel Xeon Phi Coprocessor Specific |
|
|
3 | (1) |
|
Many-Core, Neo-Heterogeneous |
|
|
3 | (1) |
|
No "Xeon Phi" In The Title, Neo-Heterogeneous Programming |
|
|
3 | (1) |
|
|
4 | (1) |
|
|
4 | (1) |
|
|
5 | (2) |
|
Chapter 2 From "Correct" to "Correct & Efficient": A Hydro2D Case Study with Godunov's Scheme |
|
|
7 | (36) |
|
Scientific Computing on Contemporary Computers |
|
|
7 | (2) |
|
Modern Computing Environments |
|
|
8 | (1) |
|
|
9 | (1) |
|
A Numerical Method for Shock Hydrodynamics |
|
|
9 | (4) |
|
|
10 | (1) |
|
|
10 | (2) |
|
|
12 | (1) |
|
Features of Modern Architectures |
|
|
13 | (2) |
|
Performance-Oriented Architecture |
|
|
13 | (1) |
|
Programming Tools and Runtimes |
|
|
14 | (1) |
|
Our Computing Environments |
|
|
14 | (1) |
|
|
15 | (24) |
|
|
15 | (1) |
|
|
15 | (5) |
|
|
20 | (1) |
|
|
21 | (1) |
|
|
22 | (8) |
|
Arithmetic Efficiency and Instruction-Level Parallelism |
|
|
30 | (2) |
|
|
32 | (7) |
|
|
39 | (3) |
|
The Coprocessor vs the Processor |
|
|
39 | (1) |
|
A Rising Tide Lifts All Boats |
|
|
39 | (2) |
|
|
41 | (1) |
|
|
42 | (1) |
|
Chapter 3 Better Concurrency and SIMD on HBM |
|
|
43 | (26) |
|
The Application: HIROMB-BOOS-Model |
|
|
43 | (1) |
|
|
44 | (1) |
|
|
44 | (1) |
|
Overview for the Optimization of HBM |
|
|
45 | (1) |
|
Data Structures: Locality Done Right |
|
|
46 | (4) |
|
Thread Parallelism in HBM |
|
|
50 | (5) |
|
Data Parallelism: SIMD Vectorization |
|
|
55 | (6) |
|
|
55 | (3) |
|
Premature Abstraction is the Root of All Evil |
|
|
58 | (3) |
|
|
61 | (1) |
|
|
62 | (1) |
|
Scaling on Processor vs. Coprocessor |
|
|
62 | (2) |
|
|
64 | (2) |
|
|
66 | (1) |
|
|
66 | (1) |
|
|
66 | (3) |
|
Chapter 4 Optimizing for Reacting Navier-Stokes Equations |
|
|
69 | (18) |
|
|
69 | (1) |
|
|
70 | (3) |
|
|
73 | (4) |
|
|
77 | (1) |
|
|
77 | (3) |
|
Version 5.0 Vectorization |
|
|
80 | (3) |
|
Intel Xeon Phi Coprocessor Results |
|
|
83 | (1) |
|
|
84 | (1) |
|
|
85 | (2) |
|
Chapter 5 Plesiochronous Phasing Barriers |
|
|
87 | (30) |
|
What Can Be Done to Improve the Code? |
|
|
89 | (2) |
|
What More Can Be Done to Improve the Code? |
|
|
91 | (1) |
|
|
91 | (2) |
|
What is Nonoptimal About This Strategy? |
|
|
93 | (1) |
|
Coding the Hyper-Thread Phalanx |
|
|
93 | (1) |
|
How to Determine Thread Binding to Core and HT Within Core? |
|
|
94 | (5) |
|
The Hyper-Thread Phalanx Hand-Partitioning Technique |
|
|
95 | (2) |
|
|
97 | (2) |
|
|
99 | (1) |
|
|
99 | (4) |
|
Use Aligned Data When Possible |
|
|
100 | (1) |
|
Redundancy Can Be Good for You |
|
|
100 | (3) |
|
The Plesiochronous Phasing Barrier |
|
|
103 | (2) |
|
Let us do Something to Recover This Wasted Time |
|
|
105 | (4) |
|
A Few "Left to the Reader" Possibilities |
|
|
109 | (1) |
|
Xeon Host Performance Improvements Similar to Xeon Phi |
|
|
110 | (5) |
|
|
115 | (1) |
|
|
115 | (2) |
|
Chapter 6 Parallel Evaluation of Fault Tree Expressions |
|
|
117 | (12) |
|
Motivation and Background |
|
|
117 | (1) |
|
|
117 | (1) |
|
Expression of Choice: Fault Trees |
|
|
117 | (1) |
|
An Application for Fault Trees: Ballistic Simulation |
|
|
118 | (1) |
|
|
118 | (8) |
|
Using ispc for Vectorization |
|
|
121 | (5) |
|
|
126 | (2) |
|
|
128 | (1) |
|
|
128 | (1) |
|
Chapter 7 Deep-Learning Numerical Optimization |
|
|
129 | (14) |
|
Fitting an Objective Function |
|
|
129 | (5) |
|
Objective Functions and Principle Components Analysis |
|
|
134 | (1) |
|
Software and Example Data |
|
|
135 | (1) |
|
|
136 | (3) |
|
|
139 | (2) |
|
|
141 | (1) |
|
|
141 | (1) |
|
|
142 | (1) |
|
Chapter 8 Optimizing Gather/Scatter Patterns |
|
|
143 | (16) |
|
Gather/Scatter Instructions in Intel® Architecture |
|
|
145 | (1) |
|
Gather/Scatter Patterns in Molecular Dynamics |
|
|
145 | (3) |
|
Optimizing Gather/Scatter Patterns |
|
|
148 | (8) |
|
Improving Temporal and Spatial Locality |
|
|
148 | (2) |
|
Choosing an Appropriate Data Layout: AoS Versus SoA |
|
|
150 | (1) |
|
On-the-Fly Transposition Between AoS and SoA |
|
|
151 | (3) |
|
Amortizing Gather/Scatter and Transposition Costs |
|
|
154 | (2) |
|
|
156 | (1) |
|
|
157 | (2) |
|
Chapter 9 A Many-Core Implementation of the Direct N-Body Problem |
|
|
159 | (16) |
|
|
159 | (1) |
|
|
159 | (3) |
|
|
162 | (2) |
|
Reduce the Overheads, Align Your Data |
|
|
164 | (3) |
|
Optimize the Memory Hierarchy |
|
|
167 | (3) |
|
|
170 | (2) |
|
What Does All This Mean to the Host Version? |
|
|
172 | (2) |
|
|
174 | (1) |
|
|
174 | (1) |
|
Chapter 10 N-Body Methods |
|
|
175 | (10) |
|
Fast N-Body Methods and Direct TV-Body Kernels |
|
|
175 | (1) |
|
Applications of TV-Body Methods |
|
|
176 | (1) |
|
|
177 | (2) |
|
|
179 | (3) |
|
|
182 | (1) |
|
|
183 | (2) |
|
Chapter 11 Dynamic Load Balancing Using OpenMP 4.0 |
|
|
185 | (16) |
|
Maximizing Hardware Usage |
|
|
185 | (2) |
|
|
187 | (4) |
|
|
191 | (2) |
|
A First Processor Combined with Coprocessor Version |
|
|
193 | (3) |
|
Version for Processor with Multiple Coprocessors |
|
|
196 | (4) |
|
|
200 | (1) |
|
Chapter 12 Concurrent Kernel Offloading |
|
|
201 | (24) |
|
|
201 | (3) |
|
Motivating Example: Particle Dynamics |
|
|
202 | (1) |
|
Organization of This Chapter |
|
|
203 | (1) |
|
Concurrent Kernels on the Coprocessor |
|
|
204 | (9) |
|
Coprocessor Device Partitioning and Thread Affinity |
|
|
204 | (6) |
|
Concurrent Data Transfers |
|
|
210 | (3) |
|
Force Computation in PD Using Concurrent Kernel Offloading |
|
|
213 | (8) |
|
Parallel Force Evaluation Using Newton's 3rd Law |
|
|
213 | (2) |
|
Implementation of the Concurrent Force Computation |
|
|
215 | (5) |
|
Performance Evaluation: Before and After |
|
|
220 | (1) |
|
|
221 | (2) |
|
|
223 | (2) |
|
Chapter 13 Heterogeneous Computing with MPI |
|
|
225 | (14) |
|
MPI in the Modern Clusters |
|
|
225 | (1) |
|
|
226 | (5) |
|
Single-Task Hybrid Programs |
|
|
229 | (2) |
|
Selection of the DAPL Providers |
|
|
231 | (6) |
|
The First Provider OFA-V2-MLX4_0-1U |
|
|
231 | (1) |
|
The Second Provider ofa-v2-scif0 and the Impact of the Intra-Node Fabric |
|
|
232 | (1) |
|
The Last Provider, Also Called the Proxy |
|
|
232 | (2) |
|
Hybrid Application Scalability |
|
|
234 | (2) |
|
|
236 | (1) |
|
|
236 | (1) |
|
|
237 | (1) |
|
|
238 | (1) |
|
|
238 | (1) |
|
Chapter 14 Power Analysis on the Intel® Xeon Phi™ Coprocessor |
|
|
239 | (16) |
|
|
239 | (2) |
|
Measuring Power and Temperature with Software |
|
|
241 | (5) |
|
Creating a Power and Temperature Monitor Script |
|
|
243 | (1) |
|
Creating a Power and Temperature Logger with the micsmc Tool |
|
|
243 | (2) |
|
Power Analysis Using IPMI |
|
|
245 | (1) |
|
Hardware-Based Power Analysis Methods |
|
|
246 | (6) |
|
A Hardware-Based Coprocessor Power Analyzer |
|
|
249 | (3) |
|
|
252 | (1) |
|
|
253 | (2) |
|
Chapter 15 Integrating Intel Xeon Phi Coprocessors into a Cluster Environment |
|
|
255 | (22) |
|
|
255 | (1) |
|
|
256 | (1) |
|
Beacon System Architecture |
|
|
256 | (2) |
|
|
256 | (1) |
|
|
256 | (2) |
|
Intel MPSS Installation Procedure |
|
|
258 | (7) |
|
|
258 | (1) |
|
Installation of the Intel MPSS Stack |
|
|
259 | (2) |
|
Generating and Customizing Configuration Files |
|
|
261 | (4) |
|
|
265 | (1) |
|
Setting Up the Resource and Workload Managers |
|
|
265 | (4) |
|
|
265 | (1) |
|
|
266 | (2) |
|
|
268 | (1) |
|
TORQUE/Coprocessor Integration |
|
|
268 | (1) |
|
|
269 | (1) |
|
Improving Network Locality |
|
|
269 | (1) |
|
Moab/Coprocessor Integration |
|
|
269 | (1) |
|
Health Checking and Monitoring |
|
|
269 | (2) |
|
Scripting Common Commands |
|
|
271 | (2) |
|
User Software Environment |
|
|
273 | (1) |
|
|
274 | (1) |
|
|
275 | (1) |
|
|
275 | (1) |
|
|
275 | (2) |
|
Chapter 16 Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors |
|
|
277 | (10) |
|
Network Configuration Concepts and Goals |
|
|
278 | (3) |
|
A Look At Networking Options |
|
|
278 | (2) |
|
Steps to Set Up a Cluster Enabled Coprocessor |
|
|
280 | (1) |
|
Coprocessor File Systems Support |
|
|
281 | (4) |
|
|
282 | (1) |
|
Support for Lustre® File System |
|
|
282 | (2) |
|
Support for Fraunhofer BeeGFS® (formerly FHGFS) File System |
|
|
284 | (1) |
|
Support for Panasas® PanFS® File System |
|
|
285 | (1) |
|
Choosing a Cluster File System |
|
|
285 | (1) |
|
|
285 | (1) |
|
|
286 | (1) |
|
Chapter 17 NWChem: Quantum Chemistry Simulations at Scale |
|
|
287 | (20) |
|
|
287 | (1) |
|
Overview of Single-Reference CC Formalism |
|
|
288 | (3) |
|
NWChem Software Architecture |
|
|
291 | (2) |
|
|
291 | (1) |
|
Tensor Contraction Engine |
|
|
292 | (1) |
|
Engineering an Offload Solution |
|
|
293 | (4) |
|
|
297 | (1) |
|
|
298 | (3) |
|
|
301 | (3) |
|
|
304 | (1) |
|
|
305 | (1) |
|
|
305 | (2) |
|
Chapter 18 Efficient Nested Parallelism on Large-Scale Systems |
|
|
307 | (12) |
|
|
307 | (1) |
|
|
307 | (2) |
|
|
309 | (1) |
|
Pipeline Approach---Flat_arena Class |
|
|
310 | (1) |
|
Intel® TBB User-Managed Task Arenas |
|
|
311 | (2) |
|
Hierarchical Approach---Hierarchical_arena Class |
|
|
313 | (1) |
|
|
314 | (2) |
|
Implication on NUMA Architectures |
|
|
316 | (1) |
|
|
317 | (1) |
|
|
318 | (1) |
|
Chapter 19 Performance Optimization of Black-Scholes Pricing |
|
|
319 | (22) |
|
Financial Market Model Basics and the Black-Scholes Formula |
|
|
320 | (3) |
|
Financial Market Mathematical Model |
|
|
320 | (1) |
|
European Option and Fair Price Concepts |
|
|
321 | (1) |
|
|
322 | (1) |
|
|
322 | (1) |
|
|
323 | (1) |
|
|
323 | (15) |
|
Preliminary Version---Checking Correctness |
|
|
323 | (1) |
|
Reference Version---Choose Appropriate Data Structures |
|
|
323 | (2) |
|
Reference Version---Do Not Mix Data Types |
|
|
325 | (1) |
|
|
326 | (3) |
|
Use Fast Math Functions: erff() vs. cdfnormf() |
|
|
329 | (2) |
|
Equivalent Transformations of Code |
|
|
331 | (1) |
|
|
331 | (2) |
|
Reduce Precision if Possible |
|
|
333 | (1) |
|
|
334 | (1) |
|
|
334 | (2) |
|
Using the Intel Xeon Phi Coprocessor ---"No Effort" Port |
|
|
336 | (1) |
|
Use Intel Xeon Phi Coprocessor: Work in Parallel |
|
|
337 | (1) |
|
Use Intel Xeon Phi Coprocessor and Streaming Stores |
|
|
338 | (1) |
|
|
338 | (1) |
|
|
339 | (2) |
|
Chapter 20 Data Transfer Using the Intel COI Library |
|
|
341 | (8) |
|
First Steps with the Intel COI Library |
|
|
341 | (1) |
|
COI Buffer Types and Transfer Performance |
|
|
342 | (4) |
|
|
346 | (2) |
|
|
348 | (1) |
|
|
348 | (1) |
|
Chapter 21 High-Performance Ray Tracing |
|
|
349 | (10) |
|
|
349 | (2) |
|
Vectorizing Ray Traversal |
|
|
351 | (1) |
|
The Embree Ray Tracing Kernels |
|
|
352 | (1) |
|
Using Embree in an Application |
|
|
352 | (2) |
|
|
354 | (3) |
|
|
357 | (1) |
|
|
358 | (1) |
|
Chapter 22 Portable Performance with OpenCL |
|
|
359 | (18) |
|
|
359 | (1) |
|
A Brief Introduction to OpenCL |
|
|
360 | (4) |
|
A Matrix Multiply Example in OpenCL |
|
|
364 | (2) |
|
OpenCL and the Intel Xeon Phi Coprocessor |
|
|
366 | (2) |
|
Matrix Multiply Performance Results |
|
|
368 | (1) |
|
Case Study: Molecular Docking |
|
|
369 | (4) |
|
Results: Portable Performance |
|
|
373 | (1) |
|
|
374 | (1) |
|
|
375 | (1) |
|
|
375 | (2) |
|
Chapter 23 Characterization and Optimization Methodology Applied to Stencil Computations |
|
|
377 | (20) |
|
|
377 | (1) |
|
|
378 | (4) |
|
|
379 | (1) |
|
|
380 | (2) |
|
|
382 | (13) |
|
Automatic Application Tuning |
|
|
386 | (6) |
|
|
392 | (1) |
|
|
393 | (2) |
|
|
395 | (1) |
|
|
395 | (2) |
|
Chapter 24 Profiling-Guided Optimization |
|
|
397 | (28) |
|
Matrix Transposition in Computer Science |
|
|
397 | (2) |
|
|
399 | (1) |
|
"Serial": Our Original In-Place Transposition |
|
|
400 | (5) |
|
"Parallel": Adding Parallelism with OpenMP |
|
|
405 | (1) |
|
"Tiled": Improving Data Locality |
|
|
405 | (6) |
|
"Regularized": Microkernel with Multiversioning |
|
|
411 | (6) |
|
"Planned": Exposing More Parallelism |
|
|
417 | (4) |
|
|
421 | (2) |
|
|
423 | (2) |
|
Chapter 25 Heterogeneous MPI Application Optimization with ITAC |
|
|
425 | (18) |
|
|
425 | (1) |
|
|
426 | (2) |
|
Synchronization in Heterogeneous Clusters |
|
|
428 | (1) |
|
Finding Bottlenecks with ITAC |
|
|
429 | (1) |
|
|
430 | (1) |
|
|
431 | (3) |
|
|
434 | (2) |
|
Dynamic "Boss-Workers" Load Balancing |
|
|
436 | (3) |
|
|
439 | (2) |
|
|
441 | (2) |
|
Chapter 26 Scalable Out-of-Core Solvers on a Cluster |
|
|
443 | (14) |
|
|
443 | (1) |
|
An OOC Factorization Based on ScaLAPACK |
|
|
444 | (3) |
|
|
445 | (1) |
|
|
446 | (1) |
|
Porting from NVIDIA GPU to the Intel Xeon Phi Coprocessor |
|
|
447 | (2) |
|
|
449 | (5) |
|
Conclusions and Future Work |
|
|
454 | (1) |
|
|
454 | (1) |
|
|
454 | (3) |
|
Chapter 27 Sparse Matrix-Vector Multiplication: Parallelization and Vectorization |
|
|
457 | (20) |
|
|
457 | (1) |
|
Sparse Matrix Data Structures |
|
|
458 | (4) |
|
Compressed Data Structures |
|
|
459 | (3) |
|
|
462 | (1) |
|
Parallel SpMV Multiplication |
|
|
462 | (3) |
|
Partially Distributed Parallel SpMV |
|
|
462 | (1) |
|
Fully Distributed Parallel SpMV |
|
|
463 | (2) |
|
Vectorization on the Intel Xeon Phi Coprocessor |
|
|
465 | (5) |
|
Implementation of the Vectorized SpMV Kernel |
|
|
467 | (3) |
|
|
470 | (4) |
|
On the Intel Xeon Phi Coprocessor |
|
|
471 | (1) |
|
|
472 | (2) |
|
|
474 | (1) |
|
|
474 | (1) |
|
|
475 | (1) |
|
|
475 | (2) |
|
Chapter 28 Morton Order Improves Performance |
|
|
477 | (14) |
|
Improving Cache Locality by Data Ordering |
|
|
477 | (1) |
|
|
477 | (1) |
|
|
478 | (4) |
|
|
482 | (6) |
|
|
488 | (2) |
|
|
490 | (1) |
Author Index |
|
491 | (4) |
Subject Index |
|
495 | |