Foreword |
|
xiii | |
Preface |
|
xvii | |
Acknowledgements |
|
xix | |
|
|
1 | (22) |
|
|
1 | (1) |
|
Why Intel® Xeon Phi™ coprocessors are needed |
|
|
2 | (3) |
|
Platforms with coprocessors |
|
|
5 | (1) |
|
The first Intel® Xeon Phi™ coprocessor |
|
|
6 | (3) |
|
Keeping the "Ninja Gap" under control |
|
|
9 | (1) |
|
Transforming-and-tuning double advantage |
|
|
10 | (1) |
|
When to use an Intel® Xeon Phi™ coprocessor |
|
|
11 | (1) |
|
Maximizing performance on processors first |
|
|
11 | (1) |
|
Why scaling past one hundred threads is so important |
|
|
12 | (3) |
|
Maximizing parallel program performance |
|
|
15 | (1) |
|
Measuring readiness for highly parallel execution |
|
|
15 | (1) |
|
|
16 | (1) |
|
Beyond the ease of porting to increased performance |
|
|
16 | (1) |
|
Transformation for performance |
|
|
17 | (1) |
|
Hyper-threading versus multithreading |
|
|
17 | (1) |
|
Coprocessor major usage model: MPI versus offload |
|
|
18 | (1) |
|
Compiler and programming models |
|
|
19 | (1) |
|
|
20 | (1) |
|
|
21 | (1) |
|
|
21 | (2) |
|
Chapter 2 High Performance Closed Track Test Drive! |
|
|
23 | (36) |
|
Looking under the hood: coprocessor specifications |
|
|
24 | (2) |
|
Starting the car: communicating with the coprocessor |
|
|
26 | (2) |
|
Taking it out easy: running our first code |
|
|
28 | (4) |
|
Starting to accelerate: running more than one thread |
|
|
32 | (6) |
|
Petal to the metal: hitting full speed using all cores |
|
|
38 | (11) |
|
Easing in to the first curve: accessing memory bandwidth |
|
|
49 | (5) |
|
High speed banked curved: maximizing memory bandwidth |
|
|
54 | (3) |
|
Back to the pit: a summary |
|
|
57 | (2) |
|
Chapter 3 A Friendly Country Road Race |
|
|
59 | (24) |
|
Preparing for our country road trip: chapter focus |
|
|
59 | (1) |
|
Getting a feel for the road: the 9-point stencil algorithm |
|
|
60 | (1) |
|
At the starting line: the baseline 9-point stencil implementation |
|
|
61 | (7) |
|
Rough road ahead: running the baseline stencil code |
|
|
68 | (2) |
|
Cobblestone street ride: vectors but not yet scaling |
|
|
70 | (2) |
|
Open road all-out race: vectors plus scaling |
|
|
72 | (3) |
|
Some grease and wrenches!: a bit of tuning |
|
|
75 | (6) |
|
Adjusting the "Alignment" |
|
|
76 | (1) |
|
|
77 | (2) |
|
Using huge 2-MB memory pages |
|
|
79 | (2) |
|
|
81 | (1) |
|
|
81 | (2) |
|
Chapter 4 Driving Around Town: Optimizing A Real-World Code Example |
|
|
83 | (24) |
|
Choosing the direction: the basic diffusion calculation |
|
|
84 | (1) |
|
Turn ahead: accounting for boundary effects |
|
|
84 | (7) |
|
Finding a wide boulevard: scaling the code |
|
|
91 | (2) |
|
Thunder road: ensuring vectorization |
|
|
93 | (4) |
|
Peeling out: peeling code from the inner loop |
|
|
97 | (3) |
|
Trying higher octane fuel: improving speed using data locality and tiling |
|
|
100 | (5) |
|
High speed driver certificate: summary of our high speed tour |
|
|
105 | (2) |
|
Chapter 5 Lots of Data (Vectors) |
|
|
107 | (58) |
|
|
107 | (1) |
|
|
108 | (1) |
|
Five approaches to achieving vectorization |
|
|
108 | (2) |
|
Six step vectorization methodology |
|
|
110 | (2) |
|
Step 1 Measure baseline release build performance |
|
|
111 | (1) |
|
Step 2 Determine hotspots using Intel® VTune™ amplifier XE |
|
|
111 | (1) |
|
Step 3 Determine loop candidates using Intel compiler vec-report |
|
|
111 | (1) |
|
Step 4 Get advice using the Intel Compiler GAP report and toolkit resources |
|
|
112 | (1) |
|
Step 5 Implement GAP advice and other suggestions (such as using elemental functions and/or array notations) |
|
|
112 | (1) |
|
|
112 | (1) |
|
Streaming through caches: data layout, alignment, prefetching, and so on |
|
|
112 | (11) |
|
Why data layout affects vectorization performance |
|
|
113 | (1) |
|
|
114 | (2) |
|
|
116 | (5) |
|
|
121 | (2) |
|
|
123 | (3) |
|
Avoid manual loop unrolling |
|
|
123 | (1) |
|
Requirements for a loop to vectorize (Intel® Compiler) |
|
|
124 | (2) |
|
Importance of inlining, interference with simple profiling |
|
|
126 | (1) |
|
|
126 | (2) |
|
Memory disambiguation inside vector-loops |
|
|
127 | (1) |
|
|
128 | (22) |
|
|
129 | (5) |
|
The VECTOR and NOVECTOR directives |
|
|
134 | (1) |
|
|
135 | (2) |
|
Random number function vectorization |
|
|
137 | (1) |
|
Utilizing full vectors, -opt-assume-safe-padding |
|
|
138 | (4) |
|
Option -opt-assume-safe-padding |
|
|
142 | (1) |
|
Data alignment to assist vectorization |
|
|
142 | (4) |
|
Tradeoffs in array notations due to vector lengths |
|
|
146 | (4) |
|
Use array sections to encourage vectorization |
|
|
150 | (6) |
|
|
150 | (2) |
|
Cilk Plus array sections and elemental functions |
|
|
152 | (4) |
|
Look at what the compiler created: assembly code inspection |
|
|
156 | (7) |
|
How to find the assembly code |
|
|
157 | (1) |
|
Quick inspection of assembly code |
|
|
158 | (5) |
|
Numerical result variations with vectorization |
|
|
163 | (1) |
|
|
163 | (1) |
|
|
163 | (2) |
|
Chapter 6 Lots of Tasks (not Threads) |
|
|
165 | (24) |
|
OpenMP, Fortran 2008, Intel® TBB, Intel® Cilk™ Plus, Intel® MKL |
|
|
166 | (2) |
|
Task creation needs to happen on the coprocessor |
|
|
166 | (2) |
|
Importance of thread pools |
|
|
168 | (1) |
|
|
168 | (3) |
|
Parallel processing model |
|
|
168 | (1) |
|
|
169 | (1) |
|
Significant controls over OpenMP |
|
|
169 | (1) |
|
|
170 | (1) |
|
|
171 | (3) |
|
|
171 | (1) |
|
DO CONCURRENT and DATA RACES |
|
|
171 | (1) |
|
|
172 | (1) |
|
DO CONCURRENT vs. FOR ALL |
|
|
173 | (1) |
|
DO CONCURRENT vs. OpenMP "Parallel" |
|
|
173 | (1) |
|
|
174 | (7) |
|
|
175 | (2) |
|
|
177 | (1) |
|
|
177 | (1) |
|
|
177 | (1) |
|
|
178 | (1) |
|
|
179 | (1) |
|
|
180 | (1) |
|
|
180 | (1) |
|
|
181 | (1) |
|
|
181 | (6) |
|
|
183 | (1) |
|
Borrowing components from TBB |
|
|
183 | (1) |
|
Loaning components to TBB |
|
|
184 | (1) |
|
|
184 | (1) |
|
|
184 | (1) |
|
|
185 | (2) |
|
|
187 | (1) |
|
Array notation and elemental functions |
|
|
187 | (1) |
|
|
187 | (1) |
|
|
187 | (1) |
|
|
188 | (1) |
|
|
189 | (54) |
|
|
190 | (1) |
|
Choosing offload vs. native execution |
|
|
191 | (1) |
|
Non-shared memory model: using offload pragmas/directives |
|
|
191 | (1) |
|
Shared virtual memory model: using offload with shared VM |
|
|
191 | (1) |
|
Intel® Math Kernel Library (Intel MKL) automatic offload |
|
|
192 | (1) |
|
Language extensions for offload |
|
|
192 | (3) |
|
Compiler options and environment variables for offload |
|
|
193 | (2) |
|
Sharing environment variables for offload |
|
|
195 | (1) |
|
Offloading to multiple coprocessors |
|
|
195 | (1) |
|
Using pragma/directive offload |
|
|
195 | (22) |
|
Placing variables and functions on the coprocessor |
|
|
198 | (2) |
|
Managing memory allocation for pointer variables |
|
|
200 | (6) |
|
Optimization for time: another reason to persist allocations |
|
|
206 | (1) |
|
Target-specific code using a pragma in C/C + + |
|
|
206 | (3) |
|
Target-specific code using a directive in fortran |
|
|
209 | (1) |
|
Code that should not be built for processor-only execution |
|
|
209 | (2) |
|
Predefined macros for Intel® MIC architecture |
|
|
211 | (1) |
|
|
211 | (1) |
|
Allocating memory for parts of C/C++ arrays |
|
|
212 | (1) |
|
Allocating memory for parts of fortran arrays |
|
|
213 | (1) |
|
Moving data from one variable to another |
|
|
214 | (1) |
|
Restrictions on offloaded code using a pragma |
|
|
215 | (2) |
|
Using offload with shared virtual memory |
|
|
217 | (11) |
|
Using shared memory and shared variables |
|
|
217 | (2) |
|
|
219 | (1) |
|
Shared memory management functions |
|
|
219 | (1) |
|
Synchronous and asynchronous function execution: _cilk_offload |
|
|
219 | (1) |
|
Sharing variables and functions: _cilk_shared |
|
|
220 | (2) |
|
Rules for using _cilk_shared and _cilk_offload |
|
|
222 | (1) |
|
Synchronization between the processor and the target |
|
|
222 | (1) |
|
Writing target-specific code with _cilk_offload |
|
|
223 | (1) |
|
Restrictions on offloaded code using shared virtual memory |
|
|
224 | (1) |
|
Persistent data when using shared virtual memory |
|
|
225 | (2) |
|
C++ declarations of persistent data with shared virtual memory |
|
|
227 | (1) |
|
About asynchronous computation |
|
|
228 | (1) |
|
About asynchronous data transfer |
|
|
229 | (5) |
|
Asynchronous data transfer from the processor to the coprocessor |
|
|
229 | (5) |
|
Applying the target attribute to multiple declarations |
|
|
234 | (4) |
|
Vec-report option used with offloads |
|
|
235 | (1) |
|
Measuring timing and data in offload regions |
|
|
236 | (1) |
|
|
236 | (1) |
|
Using libraries in offloaded code |
|
|
237 | (1) |
|
About creating offload libraries with xiar and xild |
|
|
237 | (1) |
|
Performing file I/O on the coprocessor |
|
|
238 | (2) |
|
Logging stdout and stderr from offloaded code |
|
|
240 | (1) |
|
|
241 | (1) |
|
|
241 | (2) |
|
Chapter 8 Coprocessor Architecture |
|
|
243 | (26) |
|
The Intel® Xeon Phi™ coprocessor family |
|
|
244 | (1) |
|
|
245 | (1) |
|
Intel® Xeon Phi™ coprocessor silicon overview |
|
|
246 | (1) |
|
Individual coprocessor core architecture |
|
|
247 | (2) |
|
Instruction and multithread processing |
|
|
249 | (2) |
|
Cache organization and memory access considerations |
|
|
251 | (1) |
|
|
252 | (1) |
|
Vector processing unit architecture |
|
|
253 | (4) |
|
|
254 | (3) |
|
Coprocessor PCIe system interface and DMA |
|
|
257 | (3) |
|
|
258 | (2) |
|
Coprocessor power management capabilities |
|
|
260 | (3) |
|
Reliability, availability, and serviceability (RAS) |
|
|
263 | (2) |
|
Machine check architecture (MCA) |
|
|
264 | (1) |
|
Coprocessor system management controller (SMC) |
|
|
265 | (2) |
|
|
265 | (1) |
|
Thermal design power monitoring and control |
|
|
266 | (1) |
|
|
266 | (1) |
|
Potential application impact |
|
|
266 | (1) |
|
|
267 | (1) |
|
|
267 | (1) |
|
|
267 | (2) |
|
Chapter 9 Coprocessor System Software |
|
|
269 | (24) |
|
Coprocessor software architecture overview |
|
|
269 | (2) |
|
|
271 | (1) |
|
Ring levels: user and kernel |
|
|
271 | (1) |
|
Coprocessor programming models and options |
|
|
271 | (5) |
|
|
273 | (1) |
|
Coprocessor MPI programming models |
|
|
274 | (2) |
|
Coprocessor software architecture components |
|
|
276 | (1) |
|
Development tools and application layer |
|
|
276 | (1) |
|
Intel® manycore platform software stack |
|
|
277 | (10) |
|
|
277 | (1) |
|
COI: coprocessor offload infrastructure |
|
|
278 | (1) |
|
SCIF: symmetric communications interface |
|
|
278 | (1) |
|
Virtual networking (NetDev), TCP/IP, and sockets |
|
|
278 | (1) |
|
Coprocessor system management |
|
|
279 | (3) |
|
Coprocessor components for MPI applications |
|
|
282 | (5) |
|
Linux support for Intel® Xeon Phi™ coprocessors |
|
|
287 | (1) |
|
Tuning memory allocation performance |
|
|
288 | (2) |
|
Controlling the number of 2 MB pages |
|
|
288 | (1) |
|
Monitoring the number of 2 MB pages on the coprocessor |
|
|
288 | (1) |
|
A sample method for allocating 2 MB pages |
|
|
289 | (1) |
|
|
290 | (1) |
|
|
291 | (2) |
|
Chapter 10 Linux on the Coprocessor |
|
|
293 | (32) |
|
Coprocessor Linux baseline |
|
|
293 | (1) |
|
Introduction to coprocessor Linux bootstrap and configuration |
|
|
294 | (1) |
|
Default coprocessor Linux configuration |
|
|
295 | (2) |
|
Step 1 Ensure root access |
|
|
296 | (1) |
|
Step 2 Generate the default configuration |
|
|
296 | (1) |
|
Step 3 Change configuration |
|
|
296 | (1) |
|
Step 4 Start the Intel® MPSS service |
|
|
296 | (1) |
|
Changing coprocessor configuration |
|
|
297 | (8) |
|
|
297 | (1) |
|
|
298 | (1) |
|
Configuring boot parameters |
|
|
298 | (2) |
|
Coprocessor root file system |
|
|
300 | (5) |
|
|
305 | (7) |
|
Coprocessor state control |
|
|
306 | (1) |
|
|
306 | (1) |
|
Shutting down coprocessors |
|
|
306 | (1) |
|
Rebooting the coprocessors |
|
|
306 | (1) |
|
|
307 | (1) |
|
Coprocessor configuration initialization and propagation |
|
|
308 | (1) |
|
Helper functions for configuration parameters |
|
|
309 | (2) |
|
Other file system helper functions |
|
|
311 | (1) |
|
|
312 | (3) |
|
Adding files to the root file system |
|
|
313 | (1) |
|
Example: Adding a new global file set |
|
|
314 | (1) |
|
Coprocessor Linux boot process |
|
|
315 | (3) |
|
|
315 | (3) |
|
Coprocessors in a Linux cluster |
|
|
318 | (4) |
|
|
319 | (1) |
|
How Intel® Cluster Checker works |
|
|
319 | (1) |
|
Intel® Cluster Checker support for coprocessors |
|
|
320 | (2) |
|
|
322 | (1) |
|
|
323 | (2) |
|
|
325 | (18) |
|
Intel Math Kernel Library overview |
|
|
326 | (1) |
|
Intel MKL differences on the coprocessor |
|
|
327 | (1) |
|
Intel MKL and Intel compiler |
|
|
327 | (1) |
|
Coprocessor support overview |
|
|
327 | (3) |
|
Control functions for automatic offload |
|
|
328 | (2) |
|
Examples of how to set the environment variables |
|
|
330 | (1) |
|
Using the coprocessor in native mode |
|
|
330 | (2) |
|
Tips for using native mode |
|
|
332 | (1) |
|
Using automatic offload mode |
|
|
332 | (5) |
|
How to enable automatic offload |
|
|
333 | (1) |
|
Examples of using control work division |
|
|
333 | (1) |
|
Tips for effective use of automatic offload |
|
|
333 | (3) |
|
Some tips for effective use of Intel MKL with or without offload |
|
|
336 | (1) |
|
Using compiler-assisted offload |
|
|
337 | (2) |
|
Tips for using compiler assisted offload |
|
|
338 | (1) |
|
Precision choices and variations |
|
|
339 | (3) |
|
Fast transcendentals and mathematics |
|
|
339 | (1) |
|
Understanding the potential for floating-point arithmetic variations |
|
|
339 | (3) |
|
|
342 | (1) |
|
|
342 | (1) |
|
|
343 | (20) |
|
|
343 | (2) |
|
Using MPI on Intel® Xeon Phi™ coprocessors |
|
|
345 | (4) |
|
Heterogeneity (and why it matters) |
|
|
345 | (3) |
|
Prerequisites (batteries not included) |
|
|
348 | (1) |
|
|
349 | (5) |
|
|
350 | (1) |
|
|
350 | (4) |
|
Using MPI natively on the coprocessor |
|
|
354 | (7) |
|
|
354 | (2) |
|
Trapezoidal rule (revisited) |
|
|
356 | (5) |
|
|
361 | (1) |
|
|
362 | (1) |
|
Chapter 13 Profiling and Timing |
|
|
363 | (22) |
|
Event monitoring registers on the coprocessor |
|
|
364 | (1) |
|
List of events used in this guide |
|
|
364 | (1) |
|
|
364 | (6) |
|
|
365 | (4) |
|
Compute to data access ratio |
|
|
369 | (1) |
|
Potential performance issues |
|
|
370 | (7) |
|
|
371 | (2) |
|
|
373 | (1) |
|
|
374 | (2) |
|
|
376 | (1) |
|
Intel® VTune™ Amplifier XE product |
|
|
377 | (1) |
|
|
378 | (1) |
|
Performance application programming interface |
|
|
378 | (1) |
|
MPI analysis: Intel Trace Analyzer and Collector |
|
|
378 | (2) |
|
Generating a trace file: coprocessor only application |
|
|
379 | (1) |
|
Generating a trace file: processor + coprocessor application |
|
|
379 | (1) |
|
|
380 | (3) |
|
Clocksources on the coprocessor |
|
|
380 | (1) |
|
MIC elapsed time counter (micetc) |
|
|
380 | (1) |
|
|
380 | (1) |
|
|
381 | (1) |
|
|
381 | (1) |
|
|
382 | (1) |
|
Measuring timing and data in offload regions |
|
|
383 | (1) |
|
|
383 | (1) |
|
|
383 | (2) |
|
|
385 | (2) |
|
|
385 | (1) |
|
|
386 | (1) |
|
|
386 | (1) |
|
|
386 | (1) |
Glossary |
|
387 | (14) |
Index |
|
401 | |