Muutke küpsiste eelistusi

E-raamat: Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

  • Formaat: PDF+DRM
  • Ilmumisaeg: 02-Nov-2020
  • Kirjastus: APress
  • Keel: eng
  • ISBN-13: 9781484255742
  • Formaat - PDF+DRM
  • Hind: 4,08 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: PDF+DRM
  • Ilmumisaeg: 02-Nov-2020
  • Kirjastus: APress
  • Keel: eng
  • ISBN-13: 9781484255742

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Learn how to accelerate C++ programs using data parallelism. This open access book enables C++ programmers to be at the forefront of this exciting and important new development that is helping to push computing to new levels. It is full of practical advice, detailed explanations, and code examples to illustrate key topics. 

Data parallelism in C++ enables access to parallel resources in a modern heterogeneous system, freeing you from being locked into any particular computing device. Now a single C++ application can use any combination of devices—including GPUs, CPUs, FPGAs and AI ASICs—that are suitable to the problems at hand.

This book begins by introducing data parallelism and foundational topics for effective use of the SYCL standard from the Khronos Group and Data Parallel C++ (DPC++), the open source compiler used in this book.  Later chapters cover advanced topics including error handling, hardware-specific programming, communication and synchronization, and memory model considerations.

Data Parallel C++ provides you with everything needed to use SYCL for programming heterogeneous systems.

What You'll Learn

  • Accelerate C++ programs using data-parallel programming
  • Target multiple device types (e.g. CPU, GPU, FPGA)
  • Use SYCL and SYCL compilers 
  • Connect with computing’s heterogeneous future via Intel’s oneAPI initiative

Who This Book Is For

Those new data-parallel programming and computer programmers interested in data-parallel programming using C++.


About the Authors xvii
Preface xix
Acknowledgments xxiii
Chapter 1 Introduction
1(24)
Read the Book, Not the Spec
2(1)
SYCL1.2.1 vs. SYCL 2020, and DPC++
3(1)
Getting a DPC++ Compiler
4(1)
Book GitHub
4(1)
Hello, World! and a SYCL Program Dissection
5(1)
Queues and Actions
6(1)
It Is All About Parallelism
7(5)
Throughput
7(1)
Latency
8(1)
Think Parallel
8(1)
Amdahl and Gustafson
9(1)
Scaling
9(1)
Heterogeneous Systems
10(1)
Data-Parallel Programming
11(1)
Key Attributes of DPC++ and SYCL
12(10)
Single-Source
12(1)
Host
13(1)
Devices
13(1)
Kernel Code
14(1)
Asynchronous Task Graphs
15(3)
C++ Lambda Functions
18(3)
Portability and Direct Programming
21(1)
Concurrency vs. Parallelism
22(1)
Summary
23(2)
Chapter 2 Where Code Executes
25(36)
Single-Source
26(3)
Host Code
27(1)
Device Code
28(1)
Choosing Devices
29(1)
Method#1 Run on a Device of Any Type
30(5)
Queues
31(3)
Binding a Queue to a Device, When Any Device Will Do
34(1)
Method#2 Using the Host Device for Development and Debugging
35(3)
Method#3 Using a GPU (or Other Accelerators)
38(5)
Device Types
38(1)
Device Selectors
39(4)
Method#4 Using Multiple Devices
43(2)
Method#5 Custom (Very Specific) Device Selection
45(1)
Device selector Base Class
45(1)
Mechanisms to Score a Device
46(1)
Three Paths to Device Code Execution on CPU
46(2)
Creating Work on a Device
48(10)
Introducing the Task Graph
48(2)
Where Is the Device Code?
50(3)
Actions
53(3)
Fallback
56(2)
Summary
58(3)
Chapter 3 Data Management
61(30)
Introduction
62(1)
The Data Management Problem
63(1)
Device Local vs. Device Remote
63(1)
Managing Multiple Memories
64(2)
Explicit Data Movement
64(1)
Implicit Data Movement
65(1)
Selecting the Right Strategy
66(1)
USM, Buffers, and Images
66(1)
Unified Shared Memory
67(4)
Accessing Memory Through Pointers
67(1)
USM and Data Movement
68(3)
Buffers
71(4)
Creating Buffers
72(1)
Accessing Buffers
72(2)
Access Modes
74(1)
Ordering the Uses of Data
75(11)
In-order Queues
77(1)
Out-of-Order (OoO) Queues
78(1)
Explicit Dependences with Events
78(2)
Implicit Dependences with Accessors
80(6)
Choosing a Data Management Strategy
86(1)
Handler Class: Key Members
87(3)
Summary
90(1)
Chapter 4 Expressing Parallelism
91(40)
Parallelism Within Kernels
92(5)
Multidimensional Kernels
93(2)
Loops vs. Kernels
95(2)
Overview of Language Features
97(2)
Separating Kernels from Host Code
97(1)
Different Forms of Parallel Kernels
98(1)
Basic Data-Parallel Kernels
99(7)
Understanding Basic Data-Parallel Kernels
99(1)
Writing Basic Data-Parallel Kernels
100(3)
Details of Basic Data-Parallel Kernels
103(3)
Explicit ND-Range Kernels
106(12)
Understanding Explicit ND-Range Parallel Kernels
107(5)
Writing Explicit ND-Range Data-Parallel Kernels
112(1)
Details of Explicit ND-Range Data-Parallel Kernels
113(5)
Hierarchical Parallel Kernels
118(6)
Understanding Hierarchical Data-Parallel Kernels
119(1)
Writing Hierarchical Data-Parallel Kernels
119(3)
Details of Hierarchical Data-Parallel Kernels
122(2)
Mapping Computation to Work-Items
124(3)
One-to-One Mapping
125(1)
Many-to-One Mapping
125(2)
Choosing a Kernel Form
127(2)
Summary
129(2)
Chapter 5 Error Handling
131(18)
Safety First
132(1)
Types of Errors
133(2)
Let's Create Some Errors!
135(3)
Synchronous Error
135(1)
Asynchronous Error
136(2)
Application Error Handling Strategy
138(8)
Ignoring Error Handling
138(2)
Synchronous Error Handling
140(1)
Asynchronous Error Handling
141(5)
Errors on a Device
146(1)
Summary
147(2)
Chapter 6 Unified Shared Memory
149(24)
Why Should We Use USM?
150(1)
Allocation Types
150(2)
Device Allocations
151(1)
Host Allocations
151(1)
Shared Allocations
151(1)
Allocating Memory
152(8)
What Do We Need to Know?
153(1)
Multiple Styles
154(5)
Deallocating Memory
159(1)
Allocation Example
159(1)
Data Management
160(8)
Initialization
160(1)
Data Movement
161(7)
Queries
168(2)
Summary
170(3)
Chapter 7 Buffers
173(22)
Buffers
174(8)
Creation
175(6)
What Can We Do with a Buffer?
181(1)
Accessors
182(10)
Accessor Creation
185(6)
What Can We Do with an Accessor?
191(1)
Summary
192(3)
Chapter 8 Scheduling Kernels and Data Movement
195(18)
What Is Graph Scheduling?
196(1)
How Graphs Work in DPC++
197(9)
Command Group Actions
198(1)
How Command Groups Declare Dependences
198(1)
Examples
199(7)
When Are the Parts of a CG Executed?
206(1)
Data Movement
206(3)
Explicit
207(1)
Implicit
208(1)
Synchronizing with the Host
209(2)
Summary
211(2)
Chapter 9 Communication and Synchronization
213(28)
Work-Groups and Work-Items
214(1)
Building Blocks for Efficient Communication
215(4)
Synchronization via Barriers
215(2)
Work-Group Local Memory
217(2)
Using Work-Group Barriers and Local Memory
219(11)
Work-Group Barriers and Local Memory in ND-Range Kernels
223(3)
Work-Group Barriers and Local Memory in Hierarchical Kernels
226(4)
Sub-Groups
230(4)
Synchronization via Sub-Group Barriers
230(1)
Exchanging Data Within a Sub-Group
231(2)
A Full Sub-Group ND-Range Kernel Example
233(1)
Collective Functions
234(5)
Broadcast
234(1)
Votes
235(1)
Shuffles
235(3)
Loads and Stores
238(1)
Summary
239(2)
Chapter 10 Defining Kernels
241(18)
Why Three Ways to Represent a Kernel?
242(2)
Kernels As Lambda Expressions
244(4)
Elements of a Kernel Lambda Expression
244(3)
Naming Kernel Lambda Expressions
247(1)
Kernels As Named Function Objects
248(3)
Elements of a Kernel Named Function Object
249(2)
Interoperability with Other APIs
251(4)
Interoperability with API-Defined Source Languages
252(1)
Interoperability with API-Defined Kernel Objects
253(2)
Kernels in Program Objects
255(2)
Summary
257(2)
Chapter 11 Vectors
259(18)
How to Think About Vectors
260(3)
Vector Types
263(1)
Vector Interface
264(6)
Load and Store Member Functions
267(2)
Swizzle Operations
269(1)
Vector Execution Within a Parallel Kernel
270(4)
Vector Parallelism
274(1)
Summary
275(2)
Chapter 12 Device Information
277(20)
Refining Kernel Code to Be More Prescriptive
278(2)
How to Enumerate Devices and Capabilities
280(8)
Custom Device Selector
281(4)
Being Curious: get info<>
285(1)
Being More Curious: Detailed Enumeration Code
286(2)
Inquisitive: get info<>
288(1)
Device Information Descriptors
288(1)
Device-Specific Kernel Information Descriptors
288(1)
The Specifics: Those of "Correctness"
289(4)
Device Queries
290(2)
Kernel Queries
292(1)
The Specifics: Those of "Tuning/Optimization"
293(1)
Device Queries
293(1)
Kernel Queries
294(1)
Runtime vs. Compile-Time Properties
294(1)
Summary
295(2)
Chapter 13 Practical Tips
297(26)
Getting a DPC++ Compiler and Code Samples
297(1)
Online Forum and Documentation
298(1)
Platform Model
298(5)
Multiarchitecture Binaries
300(1)
Compilation Model
300(3)
Adding SYCL to Existing C++ Programs
303(2)
Debugging
305(5)
Debugging Kernel Code
306(1)
Debugging Runtime Failures
307(3)
Initializing Data and Accessing Kernel Outputs
310(9)
Multiple Translation Units
319(1)
Performance Implications of Multiple Translation Units
320(1)
When Anonymous Lambdas Need Names
320(1)
Migrating from CUDA to SYCL
321(1)
Summary
322(1)
Chapter 14 Common Parallel Patterns
323(30)
Understanding the Patterns
324(9)
Map
325(1)
Stencil
326(2)
Reduction
328(2)
Scan
330(2)
Pack and Unpack
332(1)
Using Built-in Functions and Libraries
333(8)
The DPC++ Reduction Library
334(5)
oneAPI DPC++ Library
339(1)
Group Functions
340(1)
Direct Programming
341(10)
Map
341(1)
Stencil
342(2)
Reduction
344(1)
Scan
345(3)
Pack and Unpack
348(3)
Summary
351(2)
For More Information
351(2)
Chapter 15 Programming for GPUs
353(34)
Performance Caveats
354(1)
How GPUs Work
354(15)
GPU Building Blocks
354(2)
Simpler Processors (but More of Them)
356(5)
Simplified Control Logic (SIMD Instructions)
361(6)
Switching Work to Hide Latency
367(2)
Offloading Kernels to GPUs
369(5)
SYCL Runtime Library
369(1)
GPU Software Drivers
370(1)
GPU Hardware
371(1)
Beware the Cost of Offloading!
372(2)
GPU Kernel Best Practices
374(9)
Accessing Global Memory
374(4)
Accessing Work-Group Local Memory
378(2)
Avoiding Local Memory Entirely with Sub-Groups
380(1)
Optimizing Computation Using Small Data Types
381(1)
Optimizing Math Functions
382(1)
Specialized Functions and Extensions
382(1)
Summary
383(4)
For More Information
384(3)
Chapter 16 Programming for CPUs
387(32)
Performance Caveats
388(1)
The Basics of a General-Purpose CPU
389(2)
The Basics of SIMD Hardware
391(7)
Exploiting Thread-Level Parallelism
398(8)
Thread Affinity Insight
401(4)
Be Mindful of First Touch to Memory
405(1)
SIMD Vectorization on CPU
406(11)
Ensure SIMD Execution Legality
407(2)
SIMD Masking and Cost
409(2)
Avoid Array-of-Struct for SIMD Efficiency
411(2)
Data Type Impact on SIMD Efficiency
413(2)
SIMD Execution Using singlejask
415(2)
Summary
417(2)
Chapter 17 Programming for FPGAs
419(52)
Performance Caveats
420(1)
How to Think About FPGAs
420(8)
Pipeline Parallelism
424(3)
Kernels Consume Chip "Area"
427(1)
When to Use an FPGA
428(5)
Lots and Lots of Work
428(1)
Custom Operations or Operation Widths
429(1)
Scalar Data Flow
430(1)
Low Latency and Rich Connectivity
431(1)
Customized Memory Systems
432(1)
Running on an FPGA
433(7)
Compile Times
435(5)
Writing Kernels for FPGAs
440(25)
Exposing Parallelism
440(16)
Pipes
456(6)
Custom Memory Systems
462(3)
Some Closing Topics
465(3)
FPGA Building Blocks
465(2)
Clock Frequency
467(1)
Summary
468(3)
Chapter 18 Libraries
471(24)
Built-in Functions
472(6)
Use the sycl:: Prefix with Built-in Functions
474(4)
DPC++Library
478(14)
Standard C++ APIs in DPC++
479(4)
DPC++ Parallel STL
483(9)
Error Handling with DPC++ Execution Policies
492(1)
Summary
492(3)
Chapter 19 Memory Model and Atomics
495(36)
What Is in a Memory Model?
497(9)
Data Races and Synchronization
498(3)
Barriers and Fences
501(2)
Atomic Operations
503(1)
Memory Ordering
504(2)
The Memory Model
506(17)
The memory order Enumeration Class
508(3)
The memory scope Enumeration Class
511(1)
Querying Device Capabilities
512(2)
Barriers and Fences
514(1)
Atomic Operations in DPC++
515(8)
Using Atomics in Real Life
523(5)
Computing a Histogram
523(2)
Implementing Device-Wide Synchronization
525(3)
Summary
528(4)
For More Information
529(3)
Epilogue: Future Direction of DPC++ 531(10)
Alignment with C++20 and C++23
532(2)
Address Spaces
534(2)
Extension and Specialization Mechanism
536(1)
Hierarchical Parallelism
537(1)
Summary
538(3)
For More Information
539(2)
Index 541
James Reinders is a consultant with more than three decades experience in Parallel Computing, and is an author/co-author/editor of nine technical books related to parallel programming.  He has had the great fortune to help make key contributions to two of the world's fastest computers (#1 on Top500 list) as well as many other supercomputers, and software developer tools. James finished 10,001 days (over 27 years) at Intel in mid-2016, and now continues to write, teach, program, and do consulting in areas related to parallel computing (HPC and AI).