About the Authors |
|
xvii | |
Preface |
|
xix | |
Acknowledgments |
|
xxiii | |
|
|
1 | (24) |
|
Read the Book, Not the Spec |
|
|
2 | (1) |
|
SYCL1.2.1 vs. SYCL 2020, and DPC++ |
|
|
3 | (1) |
|
|
4 | (1) |
|
|
4 | (1) |
|
Hello, World! and a SYCL Program Dissection |
|
|
5 | (1) |
|
|
6 | (1) |
|
It Is All About Parallelism |
|
|
7 | (5) |
|
|
7 | (1) |
|
|
8 | (1) |
|
|
8 | (1) |
|
|
9 | (1) |
|
|
9 | (1) |
|
|
10 | (1) |
|
Data-Parallel Programming |
|
|
11 | (1) |
|
Key Attributes of DPC++ and SYCL |
|
|
12 | (10) |
|
|
12 | (1) |
|
|
13 | (1) |
|
|
13 | (1) |
|
|
14 | (1) |
|
|
15 | (3) |
|
|
18 | (3) |
|
Portability and Direct Programming |
|
|
21 | (1) |
|
Concurrency vs. Parallelism |
|
|
22 | (1) |
|
|
23 | (2) |
|
Chapter 2 Where Code Executes |
|
|
25 | (36) |
|
|
26 | (3) |
|
|
27 | (1) |
|
|
28 | (1) |
|
|
29 | (1) |
|
Method#1 Run on a Device of Any Type |
|
|
30 | (5) |
|
|
31 | (3) |
|
Binding a Queue to a Device, When Any Device Will Do |
|
|
34 | (1) |
|
Method#2 Using the Host Device for Development and Debugging |
|
|
35 | (3) |
|
Method#3 Using a GPU (or Other Accelerators) |
|
|
38 | (5) |
|
|
38 | (1) |
|
|
39 | (4) |
|
Method#4 Using Multiple Devices |
|
|
43 | (2) |
|
Method#5 Custom (Very Specific) Device Selection |
|
|
45 | (1) |
|
Device selector Base Class |
|
|
45 | (1) |
|
Mechanisms to Score a Device |
|
|
46 | (1) |
|
Three Paths to Device Code Execution on CPU |
|
|
46 | (2) |
|
Creating Work on a Device |
|
|
48 | (10) |
|
Introducing the Task Graph |
|
|
48 | (2) |
|
Where Is the Device Code? |
|
|
50 | (3) |
|
|
53 | (3) |
|
|
56 | (2) |
|
|
58 | (3) |
|
Chapter 3 Data Management |
|
|
61 | (30) |
|
|
62 | (1) |
|
The Data Management Problem |
|
|
63 | (1) |
|
Device Local vs. Device Remote |
|
|
63 | (1) |
|
Managing Multiple Memories |
|
|
64 | (2) |
|
|
64 | (1) |
|
|
65 | (1) |
|
Selecting the Right Strategy |
|
|
66 | (1) |
|
|
66 | (1) |
|
|
67 | (4) |
|
Accessing Memory Through Pointers |
|
|
67 | (1) |
|
|
68 | (3) |
|
|
71 | (4) |
|
|
72 | (1) |
|
|
72 | (2) |
|
|
74 | (1) |
|
Ordering the Uses of Data |
|
|
75 | (11) |
|
|
77 | (1) |
|
Out-of-Order (OoO) Queues |
|
|
78 | (1) |
|
Explicit Dependences with Events |
|
|
78 | (2) |
|
Implicit Dependences with Accessors |
|
|
80 | (6) |
|
Choosing a Data Management Strategy |
|
|
86 | (1) |
|
Handler Class: Key Members |
|
|
87 | (3) |
|
|
90 | (1) |
|
Chapter 4 Expressing Parallelism |
|
|
91 | (40) |
|
Parallelism Within Kernels |
|
|
92 | (5) |
|
|
93 | (2) |
|
|
95 | (2) |
|
Overview of Language Features |
|
|
97 | (2) |
|
Separating Kernels from Host Code |
|
|
97 | (1) |
|
Different Forms of Parallel Kernels |
|
|
98 | (1) |
|
Basic Data-Parallel Kernels |
|
|
99 | (7) |
|
Understanding Basic Data-Parallel Kernels |
|
|
99 | (1) |
|
Writing Basic Data-Parallel Kernels |
|
|
100 | (3) |
|
Details of Basic Data-Parallel Kernels |
|
|
103 | (3) |
|
Explicit ND-Range Kernels |
|
|
106 | (12) |
|
Understanding Explicit ND-Range Parallel Kernels |
|
|
107 | (5) |
|
Writing Explicit ND-Range Data-Parallel Kernels |
|
|
112 | (1) |
|
Details of Explicit ND-Range Data-Parallel Kernels |
|
|
113 | (5) |
|
Hierarchical Parallel Kernels |
|
|
118 | (6) |
|
Understanding Hierarchical Data-Parallel Kernels |
|
|
119 | (1) |
|
Writing Hierarchical Data-Parallel Kernels |
|
|
119 | (3) |
|
Details of Hierarchical Data-Parallel Kernels |
|
|
122 | (2) |
|
Mapping Computation to Work-Items |
|
|
124 | (3) |
|
|
125 | (1) |
|
|
125 | (2) |
|
|
127 | (2) |
|
|
129 | (2) |
|
|
131 | (18) |
|
|
132 | (1) |
|
|
133 | (2) |
|
Let's Create Some Errors! |
|
|
135 | (3) |
|
|
135 | (1) |
|
|
136 | (2) |
|
Application Error Handling Strategy |
|
|
138 | (8) |
|
|
138 | (2) |
|
Synchronous Error Handling |
|
|
140 | (1) |
|
Asynchronous Error Handling |
|
|
141 | (5) |
|
|
146 | (1) |
|
|
147 | (2) |
|
Chapter 6 Unified Shared Memory |
|
|
149 | (24) |
|
|
150 | (1) |
|
|
150 | (2) |
|
|
151 | (1) |
|
|
151 | (1) |
|
|
151 | (1) |
|
|
152 | (8) |
|
|
153 | (1) |
|
|
154 | (5) |
|
|
159 | (1) |
|
|
159 | (1) |
|
|
160 | (8) |
|
|
160 | (1) |
|
|
161 | (7) |
|
|
168 | (2) |
|
|
170 | (3) |
|
|
173 | (22) |
|
|
174 | (8) |
|
|
175 | (6) |
|
What Can We Do with a Buffer? |
|
|
181 | (1) |
|
|
182 | (10) |
|
|
185 | (6) |
|
What Can We Do with an Accessor? |
|
|
191 | (1) |
|
|
192 | (3) |
|
Chapter 8 Scheduling Kernels and Data Movement |
|
|
195 | (18) |
|
What Is Graph Scheduling? |
|
|
196 | (1) |
|
|
197 | (9) |
|
|
198 | (1) |
|
How Command Groups Declare Dependences |
|
|
198 | (1) |
|
|
199 | (7) |
|
When Are the Parts of a CG Executed? |
|
|
206 | (1) |
|
|
206 | (3) |
|
|
207 | (1) |
|
|
208 | (1) |
|
Synchronizing with the Host |
|
|
209 | (2) |
|
|
211 | (2) |
|
Chapter 9 Communication and Synchronization |
|
|
213 | (28) |
|
Work-Groups and Work-Items |
|
|
214 | (1) |
|
Building Blocks for Efficient Communication |
|
|
215 | (4) |
|
Synchronization via Barriers |
|
|
215 | (2) |
|
|
217 | (2) |
|
Using Work-Group Barriers and Local Memory |
|
|
219 | (11) |
|
Work-Group Barriers and Local Memory in ND-Range Kernels |
|
|
223 | (3) |
|
Work-Group Barriers and Local Memory in Hierarchical Kernels |
|
|
226 | (4) |
|
|
230 | (4) |
|
Synchronization via Sub-Group Barriers |
|
|
230 | (1) |
|
Exchanging Data Within a Sub-Group |
|
|
231 | (2) |
|
A Full Sub-Group ND-Range Kernel Example |
|
|
233 | (1) |
|
|
234 | (5) |
|
|
234 | (1) |
|
|
235 | (1) |
|
|
235 | (3) |
|
|
238 | (1) |
|
|
239 | (2) |
|
Chapter 10 Defining Kernels |
|
|
241 | (18) |
|
Why Three Ways to Represent a Kernel? |
|
|
242 | (2) |
|
Kernels As Lambda Expressions |
|
|
244 | (4) |
|
Elements of a Kernel Lambda Expression |
|
|
244 | (3) |
|
Naming Kernel Lambda Expressions |
|
|
247 | (1) |
|
Kernels As Named Function Objects |
|
|
248 | (3) |
|
Elements of a Kernel Named Function Object |
|
|
249 | (2) |
|
Interoperability with Other APIs |
|
|
251 | (4) |
|
Interoperability with API-Defined Source Languages |
|
|
252 | (1) |
|
Interoperability with API-Defined Kernel Objects |
|
|
253 | (2) |
|
Kernels in Program Objects |
|
|
255 | (2) |
|
|
257 | (2) |
|
|
259 | (18) |
|
How to Think About Vectors |
|
|
260 | (3) |
|
|
263 | (1) |
|
|
264 | (6) |
|
Load and Store Member Functions |
|
|
267 | (2) |
|
|
269 | (1) |
|
Vector Execution Within a Parallel Kernel |
|
|
270 | (4) |
|
|
274 | (1) |
|
|
275 | (2) |
|
Chapter 12 Device Information |
|
|
277 | (20) |
|
Refining Kernel Code to Be More Prescriptive |
|
|
278 | (2) |
|
How to Enumerate Devices and Capabilities |
|
|
280 | (8) |
|
|
281 | (4) |
|
Being Curious: get info<> |
|
|
285 | (1) |
|
Being More Curious: Detailed Enumeration Code |
|
|
286 | (2) |
|
|
288 | (1) |
|
Device Information Descriptors |
|
|
288 | (1) |
|
Device-Specific Kernel Information Descriptors |
|
|
288 | (1) |
|
The Specifics: Those of "Correctness" |
|
|
289 | (4) |
|
|
290 | (2) |
|
|
292 | (1) |
|
The Specifics: Those of "Tuning/Optimization" |
|
|
293 | (1) |
|
|
293 | (1) |
|
|
294 | (1) |
|
Runtime vs. Compile-Time Properties |
|
|
294 | (1) |
|
|
295 | (2) |
|
Chapter 13 Practical Tips |
|
|
297 | (26) |
|
Getting a DPC++ Compiler and Code Samples |
|
|
297 | (1) |
|
Online Forum and Documentation |
|
|
298 | (1) |
|
|
298 | (5) |
|
Multiarchitecture Binaries |
|
|
300 | (1) |
|
|
300 | (3) |
|
Adding SYCL to Existing C++ Programs |
|
|
303 | (2) |
|
|
305 | (5) |
|
|
306 | (1) |
|
Debugging Runtime Failures |
|
|
307 | (3) |
|
Initializing Data and Accessing Kernel Outputs |
|
|
310 | (9) |
|
Multiple Translation Units |
|
|
319 | (1) |
|
Performance Implications of Multiple Translation Units |
|
|
320 | (1) |
|
When Anonymous Lambdas Need Names |
|
|
320 | (1) |
|
Migrating from CUDA to SYCL |
|
|
321 | (1) |
|
|
322 | (1) |
|
Chapter 14 Common Parallel Patterns |
|
|
323 | (30) |
|
Understanding the Patterns |
|
|
324 | (9) |
|
|
325 | (1) |
|
|
326 | (2) |
|
|
328 | (2) |
|
|
330 | (2) |
|
|
332 | (1) |
|
Using Built-in Functions and Libraries |
|
|
333 | (8) |
|
The DPC++ Reduction Library |
|
|
334 | (5) |
|
|
339 | (1) |
|
|
340 | (1) |
|
|
341 | (10) |
|
|
341 | (1) |
|
|
342 | (2) |
|
|
344 | (1) |
|
|
345 | (3) |
|
|
348 | (3) |
|
|
351 | (2) |
|
|
351 | (2) |
|
Chapter 15 Programming for GPUs |
|
|
353 | (34) |
|
|
354 | (1) |
|
|
354 | (15) |
|
|
354 | (2) |
|
Simpler Processors (but More of Them) |
|
|
356 | (5) |
|
Simplified Control Logic (SIMD Instructions) |
|
|
361 | (6) |
|
Switching Work to Hide Latency |
|
|
367 | (2) |
|
Offloading Kernels to GPUs |
|
|
369 | (5) |
|
|
369 | (1) |
|
|
370 | (1) |
|
|
371 | (1) |
|
Beware the Cost of Offloading! |
|
|
372 | (2) |
|
GPU Kernel Best Practices |
|
|
374 | (9) |
|
|
374 | (4) |
|
Accessing Work-Group Local Memory |
|
|
378 | (2) |
|
Avoiding Local Memory Entirely with Sub-Groups |
|
|
380 | (1) |
|
Optimizing Computation Using Small Data Types |
|
|
381 | (1) |
|
Optimizing Math Functions |
|
|
382 | (1) |
|
Specialized Functions and Extensions |
|
|
382 | (1) |
|
|
383 | (4) |
|
|
384 | (3) |
|
Chapter 16 Programming for CPUs |
|
|
387 | (32) |
|
|
388 | (1) |
|
The Basics of a General-Purpose CPU |
|
|
389 | (2) |
|
The Basics of SIMD Hardware |
|
|
391 | (7) |
|
Exploiting Thread-Level Parallelism |
|
|
398 | (8) |
|
|
401 | (4) |
|
Be Mindful of First Touch to Memory |
|
|
405 | (1) |
|
SIMD Vectorization on CPU |
|
|
406 | (11) |
|
Ensure SIMD Execution Legality |
|
|
407 | (2) |
|
|
409 | (2) |
|
Avoid Array-of-Struct for SIMD Efficiency |
|
|
411 | (2) |
|
Data Type Impact on SIMD Efficiency |
|
|
413 | (2) |
|
SIMD Execution Using singlejask |
|
|
415 | (2) |
|
|
417 | (2) |
|
Chapter 17 Programming for FPGAs |
|
|
419 | (52) |
|
|
420 | (1) |
|
|
420 | (8) |
|
|
424 | (3) |
|
Kernels Consume Chip "Area" |
|
|
427 | (1) |
|
|
428 | (5) |
|
|
428 | (1) |
|
Custom Operations or Operation Widths |
|
|
429 | (1) |
|
|
430 | (1) |
|
Low Latency and Rich Connectivity |
|
|
431 | (1) |
|
Customized Memory Systems |
|
|
432 | (1) |
|
|
433 | (7) |
|
|
435 | (5) |
|
Writing Kernels for FPGAs |
|
|
440 | (25) |
|
|
440 | (16) |
|
|
456 | (6) |
|
|
462 | (3) |
|
|
465 | (3) |
|
|
465 | (2) |
|
|
467 | (1) |
|
|
468 | (3) |
|
|
471 | (24) |
|
|
472 | (6) |
|
Use the sycl:: Prefix with Built-in Functions |
|
|
474 | (4) |
|
|
478 | (14) |
|
Standard C++ APIs in DPC++ |
|
|
479 | (4) |
|
|
483 | (9) |
|
Error Handling with DPC++ Execution Policies |
|
|
492 | (1) |
|
|
492 | (3) |
|
Chapter 19 Memory Model and Atomics |
|
|
495 | (36) |
|
What Is in a Memory Model? |
|
|
497 | (9) |
|
Data Races and Synchronization |
|
|
498 | (3) |
|
|
501 | (2) |
|
|
503 | (1) |
|
|
504 | (2) |
|
|
506 | (17) |
|
The memory order Enumeration Class |
|
|
508 | (3) |
|
The memory scope Enumeration Class |
|
|
511 | (1) |
|
Querying Device Capabilities |
|
|
512 | (2) |
|
|
514 | (1) |
|
Atomic Operations in DPC++ |
|
|
515 | (8) |
|
Using Atomics in Real Life |
|
|
523 | (5) |
|
|
523 | (2) |
|
Implementing Device-Wide Synchronization |
|
|
525 | (3) |
|
|
528 | (4) |
|
|
529 | (3) |
Epilogue: Future Direction of DPC++ |
|
531 | (10) |
|
Alignment with C++20 and C++23 |
|
|
532 | (2) |
|
|
534 | (2) |
|
Extension and Specialization Mechanism |
|
|
536 | (1) |
|
|
537 | (1) |
|
|
538 | (3) |
|
|
539 | (2) |
Index |
|
541 | |