Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

4.00/5 (9 hinnangut Goodreads-ist)

James Brodman, Ben Ashbaugh, Xinmin Tian, Michael Kinsner, John Pennycook, James Reinders

Formaat: PDF+DRM
Ilmumisaeg: 02-Nov-2020
Kirjastus: APress
Keel: eng
ISBN-13: 9781484255742

Teised raamatud teemal:

Formaat - PDF+DRM
Hind: 4,08 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: PDF+DRM
Ilmumisaeg: 02-Nov-2020
Kirjastus: APress
Keel: eng
ISBN-13: 9781484255742

Teised raamatud teemal:

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

Learn how to accelerate C++ programs using data parallelism. This open access book enables C++ programmers to be at the forefront of this exciting and important new development that is helping to push computing to new levels. It is full of practical advice, detailed explanations, and code examples to illustrate key topics. 

Data parallelism in C++ enables access to parallel resources in a modern heterogeneous system, freeing you from being locked into any particular computing device. Now a single C++ application can use any combination of devices—including GPUs, CPUs, FPGAs and AI ASICs—that are suitable to the problems at hand.

This book begins by introducing data parallelism and foundational topics for effective use of the SYCL standard from the Khronos Group and Data Parallel C++ (DPC++), the open source compiler used in this book.  Later chapters cover advanced topics including error handling, hardware-specific programming, communication and synchronization, and memory model considerations.

Data Parallel C++ provides you with everything needed to use SYCL for programming heterogeneous systems.

What You'll Learn

Accelerate C++ programs using data-parallel programming
Target multiple device types (e.g. CPU, GPU, FPGA)
Use SYCL and SYCL compilers 
Connect with computing’s heterogeneous future via Intel’s oneAPI initiative

Who This Book Is For

Those new data-parallel programming and computer programmers interested in data-parallel programming using C++.

About the Authors

xvii

Preface

xix

Acknowledgments

xxiii

Chapter 1 Introduction

(24)

Read the Book, Not the Spec

(1)

SYCL1.2.1 vs. SYCL 2020, and DPC++

(1)

Getting a DPC++ Compiler

(1)

Book GitHub

(1)

Hello, World! and a SYCL Program Dissection

(1)

Queues and Actions

(1)

It Is All About Parallelism

(5)

Throughput

(1)

Latency

(1)

Think Parallel

(1)

Amdahl and Gustafson

(1)

Scaling

(1)

Heterogeneous Systems

(1)

Data-Parallel Programming

(1)

Key Attributes of DPC++ and SYCL

(10)

Single-Source

(1)

Host

(1)

Devices

(1)

Kernel Code

(1)

Asynchronous Task Graphs

(3)

C++ Lambda Functions

(3)

Portability and Direct Programming

(1)

Concurrency vs. Parallelism

(1)

Summary

(2)

Chapter 2 Where Code Executes

(36)

Single-Source

(3)

Host Code

(1)

Device Code

(1)

Choosing Devices

(1)

Method#1 Run on a Device of Any Type

(5)

Queues

(3)

Binding a Queue to a Device, When Any Device Will Do

(1)

Method#2 Using the Host Device for Development and Debugging

(3)

Method#3 Using a GPU (or Other Accelerators)

(5)

Device Types

(1)

Device Selectors

(4)

Method#4 Using Multiple Devices

(2)

Method#5 Custom (Very Specific) Device Selection

(1)

Device selector Base Class

(1)

Mechanisms to Score a Device

(1)

Three Paths to Device Code Execution on CPU

(2)

Creating Work on a Device

(10)

Introducing the Task Graph

(2)

Where Is the Device Code?

(3)

Actions

(3)

Fallback

(2)

Summary

(3)

Chapter 3 Data Management

(30)

Introduction

(1)

The Data Management Problem

(1)

Device Local vs. Device Remote

(1)

Managing Multiple Memories

(2)

Explicit Data Movement

(1)

Implicit Data Movement

(1)

Selecting the Right Strategy

(1)

USM, Buffers, and Images

(1)

Unified Shared Memory

(4)

Accessing Memory Through Pointers

(1)

USM and Data Movement

(3)

Buffers

(4)

Creating Buffers

(1)

Accessing Buffers

(2)

Access Modes

(1)

Ordering the Uses of Data

(11)

In-order Queues

(1)

Out-of-Order (OoO) Queues

(1)

Explicit Dependences with Events

(2)

Implicit Dependences with Accessors

(6)

Choosing a Data Management Strategy

(1)

Handler Class: Key Members

(3)

Summary

(1)

Chapter 4 Expressing Parallelism

(40)

Parallelism Within Kernels

(5)

Multidimensional Kernels

(2)

Loops vs. Kernels

(2)

Overview of Language Features

(2)

Separating Kernels from Host Code

(1)

Different Forms of Parallel Kernels

(1)

Basic Data-Parallel Kernels

(7)

Understanding Basic Data-Parallel Kernels

(1)

Writing Basic Data-Parallel Kernels

100

(3)

Details of Basic Data-Parallel Kernels

103

(3)

Explicit ND-Range Kernels

106

(12)

Understanding Explicit ND-Range Parallel Kernels

107

(5)

Writing Explicit ND-Range Data-Parallel Kernels

112

(1)

Details of Explicit ND-Range Data-Parallel Kernels

113

(5)

Hierarchical Parallel Kernels

118

(6)

Understanding Hierarchical Data-Parallel Kernels

119

(1)

Writing Hierarchical Data-Parallel Kernels

119

(3)

Details of Hierarchical Data-Parallel Kernels

122

(2)

Mapping Computation to Work-Items

124

(3)

One-to-One Mapping

125

(1)

Many-to-One Mapping

125

(2)

Choosing a Kernel Form

127

(2)

Summary

129

(2)

Chapter 5 Error Handling

131

(18)

Safety First

132

(1)

Types of Errors

133

(2)

Let's Create Some Errors!

135

(3)

Synchronous Error

135

(1)

Asynchronous Error

136

(2)

Application Error Handling Strategy

138

(8)

Ignoring Error Handling

138

(2)

Synchronous Error Handling

140

(1)

Asynchronous Error Handling

141

(5)

Errors on a Device

146

(1)

Summary

147

(2)

Chapter 6 Unified Shared Memory

149

(24)

Why Should We Use USM?

150

(1)

Allocation Types

150

(2)

Device Allocations

151

(1)

Host Allocations

151

(1)

Shared Allocations

151

(1)

Allocating Memory

152

(8)

What Do We Need to Know?

153

(1)

Multiple Styles

154

(5)

Deallocating Memory

159

(1)

Allocation Example

159

(1)

Data Management

160

(8)

Initialization

160

(1)

Data Movement

161

(7)

Queries

168

(2)

Summary

170

(3)

Chapter 7 Buffers

173

(22)

Buffers

174

(8)

Creation

175

(6)

What Can We Do with a Buffer?

181

(1)

Accessors

182

(10)

Accessor Creation

185

(6)

What Can We Do with an Accessor?

191

(1)

Summary

192

(3)

Chapter 8 Scheduling Kernels and Data Movement

195

(18)

What Is Graph Scheduling?

196

(1)

How Graphs Work in DPC++

197

(9)

Command Group Actions

198

(1)

How Command Groups Declare Dependences

198

(1)

Examples

199

(7)

When Are the Parts of a CG Executed?

206

(1)

Data Movement

206

(3)

Explicit

207

(1)

Implicit

208

(1)

Synchronizing with the Host

209

(2)

Summary

211

(2)

Chapter 9 Communication and Synchronization

213

(28)

Work-Groups and Work-Items

214

(1)

Building Blocks for Efficient Communication

215

(4)

Synchronization via Barriers

215

(2)

Work-Group Local Memory

217

(2)

Using Work-Group Barriers and Local Memory

219

(11)

Work-Group Barriers and Local Memory in ND-Range Kernels

223

(3)

Work-Group Barriers and Local Memory in Hierarchical Kernels

226

(4)

Sub-Groups

230

(4)

Synchronization via Sub-Group Barriers

230

(1)

Exchanging Data Within a Sub-Group

231

(2)

A Full Sub-Group ND-Range Kernel Example

233

(1)

Collective Functions

234

(5)

Broadcast

234

(1)

Votes

235

(1)

Shuffles

235

(3)

Loads and Stores

238

(1)

Summary

239

(2)

Chapter 10 Defining Kernels

241

(18)

Why Three Ways to Represent a Kernel?

242

(2)

Kernels As Lambda Expressions

244

(4)

Elements of a Kernel Lambda Expression

244

(3)

Naming Kernel Lambda Expressions

247

(1)

Kernels As Named Function Objects

248

(3)

Elements of a Kernel Named Function Object

249

(2)

Interoperability with Other APIs

251

(4)

Interoperability with API-Defined Source Languages

252

(1)

Interoperability with API-Defined Kernel Objects

253

(2)

Kernels in Program Objects

255

(2)

Summary

257

(2)

Chapter 11 Vectors

259

(18)

How to Think About Vectors

260

(3)

Vector Types

263

(1)

Vector Interface

264

(6)

Load and Store Member Functions

267

(2)

Swizzle Operations

269

(1)

Vector Execution Within a Parallel Kernel

270

(4)

Vector Parallelism

274

(1)

Summary

275

(2)

Chapter 12 Device Information

277

(20)

Refining Kernel Code to Be More Prescriptive

278

(2)

How to Enumerate Devices and Capabilities

280

(8)

Custom Device Selector

281

(4)

Being Curious: get info<>

285

(1)

Being More Curious: Detailed Enumeration Code

286

(2)

Inquisitive: get info<>

288

(1)

Device Information Descriptors

288

(1)

Device-Specific Kernel Information Descriptors

288

(1)

The Specifics: Those of "Correctness"

289

(4)

Device Queries

290

(2)

Kernel Queries

292

(1)

The Specifics: Those of "Tuning/Optimization"

293

(1)

Device Queries

293

(1)

Kernel Queries

294

(1)

Runtime vs. Compile-Time Properties

294

(1)

Summary

295

(2)

Chapter 13 Practical Tips

297

(26)

Getting a DPC++ Compiler and Code Samples

297

(1)

Online Forum and Documentation

298

(1)

Platform Model

298

(5)

Multiarchitecture Binaries

300

(1)

Compilation Model

300

(3)

Adding SYCL to Existing C++ Programs

303

(2)

Debugging

305

(5)

Debugging Kernel Code

306

(1)

Debugging Runtime Failures

307

(3)

Initializing Data and Accessing Kernel Outputs

310

(9)

Multiple Translation Units

319

(1)

Performance Implications of Multiple Translation Units

320

(1)

When Anonymous Lambdas Need Names

320

(1)

Migrating from CUDA to SYCL

321

(1)

Summary

322

(1)

Chapter 14 Common Parallel Patterns

323

(30)

Understanding the Patterns

324

(9)

Map

325

(1)

Stencil

326

(2)

Reduction

328

(2)

Scan

330

(2)

Pack and Unpack

332

(1)

Using Built-in Functions and Libraries

333

(8)

The DPC++ Reduction Library

334

(5)

oneAPI DPC++ Library

339

(1)

Group Functions

340

(1)

Direct Programming

341

(10)

Map

341

(1)

Stencil

342

(2)

Reduction

344

(1)

Scan

345

(3)

Pack and Unpack

348

(3)

Summary

351

(2)

For More Information

351

(2)

Chapter 15 Programming for GPUs

353

(34)

Performance Caveats

354

(1)

How GPUs Work

354

(15)

GPU Building Blocks

354

(2)

Simpler Processors (but More of Them)

356

(5)

Simplified Control Logic (SIMD Instructions)

361

(6)

Switching Work to Hide Latency

367

(2)

Offloading Kernels to GPUs

369

(5)

SYCL Runtime Library

369

(1)

GPU Software Drivers

370

(1)

GPU Hardware

371

(1)

Beware the Cost of Offloading!

372

(2)

GPU Kernel Best Practices

374

(9)

Accessing Global Memory

374

(4)

Accessing Work-Group Local Memory

378

(2)

Avoiding Local Memory Entirely with Sub-Groups

380

(1)

Optimizing Computation Using Small Data Types

381

(1)

Optimizing Math Functions

382

(1)

Specialized Functions and Extensions

382

(1)

Summary

383

(4)

For More Information

384

(3)

Chapter 16 Programming for CPUs

387

(32)

Performance Caveats

388

(1)

The Basics of a General-Purpose CPU

389

(2)

The Basics of SIMD Hardware

391

(7)

Exploiting Thread-Level Parallelism

398

(8)

Thread Affinity Insight

401

(4)

Be Mindful of First Touch to Memory

405

(1)

SIMD Vectorization on CPU

406

(11)

Ensure SIMD Execution Legality

407

(2)

SIMD Masking and Cost

409

(2)

Avoid Array-of-Struct for SIMD Efficiency

411

(2)

Data Type Impact on SIMD Efficiency

413

(2)

SIMD Execution Using singlejask

415

(2)

Summary

417

(2)

Chapter 17 Programming for FPGAs

419

(52)

Performance Caveats

420

(1)

How to Think About FPGAs

420

(8)

Pipeline Parallelism

424

(3)

Kernels Consume Chip "Area"

427

(1)

When to Use an FPGA

428

(5)

Lots and Lots of Work

428

(1)

Custom Operations or Operation Widths

429

(1)

Scalar Data Flow

430

(1)

Low Latency and Rich Connectivity

431

(1)

Customized Memory Systems

432

(1)

Running on an FPGA

433

(7)

Compile Times

435

(5)

Writing Kernels for FPGAs

440

(25)

Exposing Parallelism

440

(16)

Pipes

456

(6)

Custom Memory Systems

462

(3)

Some Closing Topics

465

(3)

FPGA Building Blocks

465

(2)

Clock Frequency

467

(1)

Summary

468

(3)

Chapter 18 Libraries

471

(24)

Built-in Functions

472

(6)

Use the sycl:: Prefix with Built-in Functions

474

(4)

DPC++Library

478

(14)

Standard C++ APIs in DPC++

479

(4)

DPC++ Parallel STL

483

(9)

Error Handling with DPC++ Execution Policies

492

(1)

Summary

492

(3)

Chapter 19 Memory Model and Atomics

495

(36)

What Is in a Memory Model?

497

(9)

Data Races and Synchronization

498

(3)

Barriers and Fences

501

(2)

Atomic Operations

503

(1)

Memory Ordering

504

(2)

The Memory Model

506

(17)

The memory order Enumeration Class

508

(3)

The memory scope Enumeration Class

511

(1)

Querying Device Capabilities

512

(2)

Barriers and Fences

514

(1)

Atomic Operations in DPC++

515

(8)

Using Atomics in Real Life

523

(5)

Computing a Histogram

523

(2)

Implementing Device-Wide Synchronization

525

(3)

Summary

528

(4)

For More Information

529

(3)

Epilogue: Future Direction of DPC++

531

(10)

Alignment with C++20 and C++23

532

(2)

Address Spaces

534

(2)

Extension and Specialization Mechanism

536

(1)

Hierarchical Parallelism

537

(1)

Summary

538

(3)

For More Information

539

(2)

Index

541

James Reinders is a consultant with more than three decades experience in Parallel Computing, and is an author/co-author/editor of nine technical books related to parallel programming. He has had the great fortune to help make key contributions to two of the world's fastest computers (#1 on Top500 list) as well as many other supercomputers, and software developer tools. James finished 10,001 days (over 27 years) at Intel in mid-2016, and now continues to write, teach, program, and do consulting in areas related to parallel computing (HPC and AI).

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97814842557422e.html

E-raamat: Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv