Acknowledgments |
|
xiii | |
Foreword |
|
xvii | |
Preface |
|
xxiii | |
Section I Knights Landing |
|
|
|
3 | (12) |
|
Introduction to Many-Core Programming |
|
|
4 | (1) |
|
|
4 | (2) |
|
Why Intel® Xeon Phi™ Processors Are Needed |
|
|
6 | (2) |
|
Processors Versus Coprocessor |
|
|
8 | (1) |
|
Measuring Readiness for Highly Parallel Execution |
|
|
9 | (1) |
|
|
10 | (1) |
|
Enjoy the Lack of Porting Needed but Still Tune! |
|
|
10 | (1) |
|
Transformation for Performance |
|
|
11 | (1) |
|
Hyper-Threading Versus Multithreading |
|
|
11 | (1) |
|
|
12 | (1) |
|
Why We Could Skip To Section II Now |
|
|
12 | (1) |
|
|
13 | (2) |
|
Chapter 2 Knights Landing Overview |
|
|
15 | (10) |
|
|
15 | (1) |
|
|
16 | (1) |
|
|
17 | (4) |
|
Motivation: Our Vision and Purpose |
|
|
21 | (2) |
|
|
23 | (1) |
|
|
24 | (1) |
|
Chapter 3 Programming MCDRAM and Cluster Modes |
|
|
25 | (38) |
|
Programming for Cluster Modes |
|
|
26 | (1) |
|
Programming for Memory Modes |
|
|
27 | (18) |
|
Query Memory Mode and MCDRAM Available |
|
|
45 | (1) |
|
SNC Performance Implications of Allocation and Threading |
|
|
45 | (2) |
|
How to Not Hard Code the NUMA Node Numbers |
|
|
47 | (1) |
|
Approaches to Determining What to Put in MCDRAM |
|
|
48 | (8) |
|
Why Rebooting Is Required to Change Modes |
|
|
56 | (1) |
|
|
56 | (4) |
|
|
60 | (1) |
|
|
60 | (3) |
|
Chapter 4 Knights Landing Architecture |
|
|
63 | (22) |
|
|
63 | (8) |
|
|
71 | (5) |
|
|
76 | (2) |
|
|
78 | (4) |
|
Interactions of Cluster and Memory Modes |
|
|
82 | (2) |
|
|
84 | (1) |
|
|
84 | (1) |
|
Chapter 5 Intel Omni-Path Fabric |
|
|
85 | (22) |
|
|
85 | (3) |
|
Performance and Scalability |
|
|
88 | (2) |
|
|
90 | (2) |
|
|
92 | (3) |
|
|
95 | (6) |
|
Unicast Address Resolution |
|
|
101 | (2) |
|
Multicast Address Resolution |
|
|
103 | (1) |
|
|
104 | (1) |
|
|
105 | (2) |
|
Chapter 6 µarch Optimization Advice |
|
|
107 | (42) |
|
Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3 |
|
|
107 | (2) |
|
|
109 | (1) |
|
|
110 | (9) |
|
Direct Mapped MCDRAM Cache |
|
|
119 | (1) |
|
|
120 | (24) |
|
|
144 | (1) |
|
|
145 | (4) |
Section II Parallel Programming |
|
|
Chapter 7 Programming Overview for Knights Landing |
|
|
149 | (6) |
|
To Refactor, or Not to Refactor, That Is the Question |
|
|
150 | (1) |
|
Evolutionary Optimization of Applications |
|
|
151 | (1) |
|
Revolutionary Optimization of Applications |
|
|
152 | (1) |
|
Know When to Hold'em and When to Fold'em |
|
|
153 | (1) |
|
|
154 | (1) |
|
Chapter 8 Tasks and Threads |
|
|
155 | (18) |
|
|
157 | (5) |
|
|
162 | (3) |
|
|
165 | (5) |
|
|
170 | (1) |
|
|
171 | (1) |
|
|
172 | (1) |
|
|
173 | (40) |
|
|
174 | (1) |
|
|
174 | (1) |
|
Three Approaches to Achieving Vectorization |
|
|
174 | (2) |
|
Six-Step Vectorization Methodology |
|
|
176 | (2) |
|
Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on |
|
|
178 | (9) |
|
|
187 | (3) |
|
|
190 | (2) |
|
|
192 | (14) |
|
Use Array Sections to Encourage Vectorization |
|
|
206 | (3) |
|
Look at What the Compiler Created: Assembly Code Inspection |
|
|
209 | (2) |
|
Numerical Result Variations with Vectorization |
|
|
211 | (1) |
|
|
211 | (1) |
|
|
211 | (2) |
|
Chapter 10 Vectorization Advisor |
|
|
213 | (38) |
|
Getting Started with Intel Advisor for Knights Landing |
|
|
214 | (2) |
|
Enabling and Improving AVX-512 Code with the Survey Report |
|
|
216 | (16) |
|
Memory Access Pattern Report |
|
|
232 | (1) |
|
AVX-512 Gather/Scatter Profiler |
|
|
233 | (3) |
|
Mask Utilization and FLOPS Profiler |
|
|
236 | (2) |
|
|
238 | (2) |
|
Explore AVX-512 Code Characteristics Without AVX-512 Hardware |
|
|
240 | (2) |
|
Example - Analysis of a Computational Chemistry Code |
|
|
242 | (8) |
|
|
250 | (1) |
|
|
250 | (1) |
|
Chapter 11 Vectorization with SDLT |
|
|
251 | (18) |
|
|
251 | (1) |
|
|
252 | (2) |
|
|
254 | (2) |
|
Example Normalizing 3d Points with SIMD |
|
|
256 | (2) |
|
What Is Wrong with AOS Memory Layout and SIMD? |
|
|
258 | (1) |
|
SIMD Prefers Unit-Stride Memory Accesses |
|
|
259 | (1) |
|
Alpha-Blended Overlay Reference |
|
|
260 | (3) |
|
Alpha-Blended Overlay With SDLT |
|
|
263 | (3) |
|
|
266 | (1) |
|
|
266 | (1) |
|
|
267 | (2) |
|
Chapter 12 Vectorization with AVX-512 Intrinsics |
|
|
269 | (28) |
|
|
269 | (5) |
|
|
274 | (3) |
|
Migrating From Knights Corner |
|
|
277 | (1) |
|
|
278 | (3) |
|
Learning AVX-512 Instructions |
|
|
281 | (1) |
|
Learning AVX-512 Intrinsics |
|
|
281 | (2) |
|
Step-by-Step Example Using AVX-512 Intrinsics |
|
|
283 | (11) |
|
Results Using Our Intrinsics Code |
|
|
294 | (1) |
|
|
295 | (2) |
|
Chapter 13 Performance Libraries |
|
|
297 | (18) |
|
Intel Performance Library Overview |
|
|
297 | (2) |
|
Intel Math Kernel Library Overview |
|
|
299 | (1) |
|
Intel Data Analytics Library Overview |
|
|
300 | (2) |
|
|
302 | (1) |
|
Intel Integrated Performance Primitives Library Overview |
|
|
303 | (2) |
|
Intel Performance Libraries and Intel Compilers |
|
|
305 | (1) |
|
Native (Direct) Library Usage |
|
|
306 | (2) |
|
Offloading to Knights Landing While Using a Library |
|
|
308 | (4) |
|
Precision Choices and Variations |
|
|
312 | (1) |
|
Performance Tip for Faster Dynamic Libraries |
|
|
313 | (1) |
|
|
314 | (1) |
|
Chapter 14 Profiling and Timing |
|
|
315 | (24) |
|
Introduction to Knight Landing Tuning |
|
|
315 | (1) |
|
Event-Monitoring Registers |
|
|
316 | (1) |
|
|
317 | (6) |
|
Potential Performance Issues |
|
|
323 | (10) |
|
Intel VTune Amplifier XE Product |
|
|
333 | (1) |
|
Performance Application Programming Interface |
|
|
334 | (1) |
|
|
334 | (1) |
|
|
335 | (1) |
|
Tuning and Analysis Utilities |
|
|
335 | (1) |
|
|
335 | (2) |
|
|
337 | (1) |
|
|
337 | (2) |
|
|
339 | (30) |
|
|
339 | (1) |
|
|
339 | (1) |
|
|
340 | (1) |
|
How to Run MPI Applications |
|
|
341 | (6) |
|
Analyzing MPI Application Runs |
|
|
347 | (5) |
|
Tuning of MPI Applications |
|
|
352 | (3) |
|
|
355 | (2) |
|
Recent Trends in MPI Coding |
|
|
357 | (5) |
|
|
362 | (3) |
|
|
365 | (1) |
|
|
365 | (4) |
|
Chapter 16 PGAS Programming Models |
|
|
369 | (14) |
|
|
369 | (3) |
|
Why Use PGAS on Knights Landing? |
|
|
372 | (1) |
|
|
373 | (5) |
|
|
378 | (3) |
|
|
381 | (1) |
|
|
381 | (1) |
|
|
382 | (1) |
|
Chapter 17 Software-Defined Visualization |
|
|
383 | (20) |
|
Motivation for Software-Defined Visualization |
|
|
384 | (3) |
|
Software-Defined Visualization Architecture |
|
|
387 | (1) |
|
OpenSWR: OpenGL Raster-Graphics Software Rendering |
|
|
388 | (2) |
|
Embree: High-Performance Ray Tracing Kernel Library |
|
|
390 | (2) |
|
OSPRay: Scalable Ray Tracing Framework |
|
|
392 | (7) |
|
|
399 | (1) |
|
|
400 | (1) |
|
|
400 | (3) |
|
Chapter 18 Offload to Knights Landing |
|
|
403 | (10) |
|
Offload Programming Model-Using with Knights Landing |
|
|
403 | (1) |
|
Processors Versus Coprocessor |
|
|
404 | (1) |
|
Offload Model Considerations |
|
|
405 | (1) |
|
|
406 | (2) |
|
Concurrent Host and Target Execution |
|
|
408 | (2) |
|
|
410 | (1) |
|
|
411 | (1) |
|
|
411 | (2) |
|
Chapter 19 Power Analysis |
|
|
413 | (30) |
|
Power Demand Gates Exascale |
|
|
413 | (2) |
|
|
415 | (1) |
|
Hardware-Based Power Analysis Techniques |
|
|
416 | (3) |
|
Software-Based Knights Landing Power Analyzer |
|
|
419 | (10) |
|
ManyCore Platform Software Package Power Tools |
|
|
429 | (1) |
|
Running Average Power Limit |
|
|
430 | (4) |
|
Performance Profiling on Knights Landing |
|
|
434 | (2) |
|
Intel Remote Management Module |
|
|
436 | (2) |
|
|
438 | (1) |
|
|
439 | (4) |
Section III Pearls |
|
|
Chapter 20 Optimizing Classical Molecular Dynamics in LAMMPS |
|
|
443 | (28) |
|
|
443 | (3) |
|
|
446 | (1) |
|
Knights Landing Processors |
|
|
447 | (2) |
|
|
449 | (1) |
|
|
449 | (1) |
|
|
450 | (2) |
|
|
452 | (7) |
|
|
459 | (3) |
|
Long-Range Electrostatics |
|
|
462 | (1) |
|
MPI and OpenMP Parallelization |
|
|
462 | (3) |
|
|
465 | (1) |
|
System, Build, and Run Configurations |
|
|
465 | (1) |
|
|
466 | (1) |
|
Organic Photovoltaic Molecules |
|
|
467 | (1) |
|
|
467 | (1) |
|
Rhodopsin Protein in Solvated Lipid Bilayer |
|
|
468 | (1) |
|
Coarse Grain Liquid Crystal Simulation |
|
|
468 | (1) |
|
Coarse-Grain Water Simulation |
|
|
468 | (1) |
|
|
469 | (1) |
|
|
470 | (1) |
|
|
470 | (1) |
|
Chapter 21 High Performance Seismic Simulations |
|
|
471 | (28) |
|
High-Order Seismic Simulations |
|
|
472 | (1) |
|
|
472 | (4) |
|
Application Characteristics |
|
|
476 | (8) |
|
Intel Architecture as Compute Engine |
|
|
484 | (1) |
|
Highly-Efficient Small Matrix Kernels |
|
|
484 | (1) |
|
Sparse Matrix Kernel Generation and Sparse/Dense Kernel Selection |
|
|
485 | (1) |
|
Dense Matrix Kernel Generation: AVX2 |
|
|
486 | (1) |
|
Dense Matrix Kernel Generation: AVX-512 |
|
|
487 | (2) |
|
Kernel Performance Benchmarking |
|
|
489 | (1) |
|
Incorporating Knights Landing's Different Memory Subsystems |
|
|
490 | (3) |
|
|
493 | (1) |
|
|
493 | (2) |
|
|
495 | (2) |
|
|
497 | (1) |
|
|
498 | (1) |
|
Chapter 22 Weather Research and Forecasting (WRF) |
|
|
499 | (12) |
|
|
499 | (1) |
|
WRF Execution Profile: Relatively Flat |
|
|
500 | (1) |
|
History of WRF on Intel Many-Core (Intel Xeon Phi Product Line) |
|
|
500 | (1) |
|
Our Early Experiences with WRF on Knights Landing |
|
|
501 | (2) |
|
Compiling WRF for Intel Xeon and Intel Xeon Phi Systems |
|
|
503 | (1) |
|
WRF CONUS12km Benchmark Performance |
|
|
504 | (1) |
|
|
504 | (3) |
|
Vectorization: Boost of AVX-512 Over AVX2 |
|
|
507 | (1) |
|
|
508 | (1) |
|
|
509 | (1) |
|
|
509 | (2) |
|
Chapter 23 N-Body simulation |
|
|
511 | (16) |
|
Parallel Programming for Noncomputer Scientists |
|
|
511 | (1) |
|
Step-by-Step Improvements |
|
|
512 | (1) |
|
|
513 | (2) |
|
|
515 | (1) |
|
Initial Implementation (Optimization Step 0) |
|
|
515 | (1) |
|
Thread Parallelism (Optimization Step 1) |
|
|
516 | (2) |
|
Scalar Performance Tuning (Optimization Step 2) |
|
|
518 | (1) |
|
Vectorization with SOA (Optimization Step 3) |
|
|
519 | (2) |
|
Memory Traffic (Optimization Step 4) |
|
|
521 | (2) |
|
Impact of MCDRAM on Performance |
|
|
523 | (1) |
|
|
524 | (1) |
|
|
525 | (2) |
|
Chapter 24 Machine Learning |
|
|
527 | (22) |
|
Convolutional Neural Networks |
|
|
528 | (10) |
|
|
538 | (10) |
|
|
548 | (1) |
|
Chapter 25 Trinity Workloads |
|
|
549 | (32) |
|
Out of the Box Performance |
|
|
549 | (22) |
|
Optimizing MiniGhost OpenMP Performance |
|
|
571 | (7) |
|
|
578 | (1) |
|
|
579 | (2) |
|
Chapter 26 Quantum Chromodynamics |
|
|
581 | (18) |
|
|
581 | (1) |
|
The QPhiX Library and Code Generator |
|
|
582 | (1) |
|
|
583 | (3) |
|
Configuring the QPhiX Code Generator |
|
|
586 | (3) |
|
|
589 | (1) |
|
|
590 | (7) |
|
|
597 | (1) |
|
|
597 | (2) |
Contributors |
|
599 | (14) |
Glossary |
|
613 | (10) |
Index |
|
623 | |