Klienditugi: 7440010 (E-R 10-18)

E-raamat: Programming Massively Parallel Processors: A Hands-on Approach

4.02/5 (139 hinnangut Goodreads-ist)

David B. Kirk (NVIDIA Fellow), Izzat El Hajj (Assistant Professor, Department of), Wen-mei W. Hwu (CTO, MulticoreWare and professor specializing in compiler design, computer architecture, microarchitecture, and parallel processing, University of Illinois at Urbana-Champaign, USA)

Formaat: EPUB+DRM
Ilmumisaeg: 28-May-2022
Kirjastus: Morgan Kaufmann
Keel: eng
ISBN-13: 9780323984638

Teised raamatud teemal:

Formaat - EPUB+DRM
Hind: 75,06 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: EPUB+DRM
Ilmumisaeg: 28-May-2022
Kirjastus: Morgan Kaufmann
Keel: eng
ISBN-13: 9780323984638

Teised raamatud teemal:

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

Programming Massively Parallel Processors: A Hands-on Approach shows both student and professional alike the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in depth. For this new edition, the authors are updating their coverage of CUDA, including the concept of unified memory, and expanding content in areas such as threads, while still retaining its concise, intuitive, practical approach based on years of road-testing in the authors' own parallel computing courses.

Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing
Updated to utilize CUDA version 10.0, NVIDIA's software development tool created specifically for massively parallel environments
Features new content on unified memory, as well as expanded content on threads, streams, warp divergence, and OpenMP
Includes updated and new case studies

Foreword

Preface

xvii

Acknowledgments

xxvii

Chapter 1 Introduction

(22)

1.1 Heterogeneous parallel computing

(4)

1.2 Why more speed or parallelism?

(2)

1.3 Speeding up real applications

(2)

1.4 Challenges in parallel programming

(2)

1.5 Related parallel programming interfaces

(1)

1.6 Overarching goals

(1)

1.7 Organization of the book

(8)

References

(4)

Part I Fundamental Concepts

Chapter 2 Heterogeneous data parallel computing

(24)

David Luebke

2.1 Data parallelism

(4)

2.2 CUDAC program structure

(1)

2.3 A vector addition kernel

(3)

2.4 Device global memory and data transfer

(4)

2.5 Kernel functions and threading

(5)

2.6 Calling kernel functions

(2)

2.7 Compilation

(1)

2.8 Summary

(4)

Exercises

(2)

References

(1)

Chapter 3 Multidimensional grids and data

(22)

3.1 Multidimensional grid organization

(4)

3.2 Mapping threads to multidimensional data

(7)

3.3 Image blur: a more complex kernel

(4)

3.4 Matrix multiplication

(4)

3.5 Summary

(3)

Exercises

(2)

Chapter 4 Compute architecture and scheduling

(24)

4.1 Architecture of a modern GPU

(1)

4.2 Block scheduling

(1)

4.3 Synchronization and transparent scalability

(3)

4.4 Warps and SIMD hardware

(5)

4.5 Control divergence

(4)

4.6 Warp scheduling and latency tolerance

(2)

4.7 Resource partitioning and occupancy

(2)

4.8 Querying device properties

(3)

4.9 Summary

(3)

Exercises

(2)

References

(1)

Chapter 5 Memory architecture and data locality

(30)

5.1 Importance of memory access efficiency

(2)

5.2 CUDA memory types

(7)

5.3 Tiling for reduced memory traffic

103

(4)

5.4 A tiled matrix multiplication kernel

107

(5)

5.5 Boundary checks

112

(3)

5.6 Impact of memory usage on occupancy

115

(3)

5.7 Summary

118

(5)

Exercises

119

(4)

Chapter 6 Performance considerations

123

(28)

6.1 Memory coalescing

124

(9)

6.2 Hiding memory latency

133

(5)

6.3 Thread coarsening

138

(3)

6.4 A checklist of optimizations

141

(4)

6.5 Knowing your computation's bottleneck

145

(1)

6.6 Summary

146

(5)

Exercises

146

(1)

References

147

(4)

Part II Parallel Patterns

Chapter 7 Convolution An introduction to constant memory and caching

151

(22)

7.1 Background

152

(4)

7.2 Parallel convolution: a basic algorithm

156

(3)

7.3 Constant memory and caching

159

(4)

7.4 Tiled convolution with halo cells

163

(5)

7.5 Tiled convolution using caches for halo cells

168

(2)

7.6 Summary

170

(3)

Exercises

171

(2)

Chapter 8 Stencil

173

(18)

8.1 Background

174

(4)

8.2 Parallel stencil: a basic algorithm

178

(1)

8.3 Shared memory tiling for stencil sweep

179

(4)

8.4 Thread coarsening

183

(3)

8.5 Register tiling

186

(2)

8.6 Summary

188

(3)

Exercises

188

(3)

Chapter 9 Parallel histogram

191

(20)

9.1 Background

192

(2)

9.2 Atomic operations and a basic histogram kernel

194

(4)

9.3 Latency and throughput of atomic operations

198

(2)

9.4 Privatization

200

(3)

9.5 Coarsening

203

(3)

9.6 Aggregation

206

(2)

9.7 Summary

208

(3)

Exercises

209

(1)

References

210

(1)

Chapter 10 Reduction And minimizing divergence

211

(24)

10.1 Background

211

(2)

10.2 Reduction trees

213

(4)

10.3 A simple reduction kernel

217

(2)

10.4 Minimizing control divergence

219

(4)

10.5 Minimizing memory divergence

223

(2)

10.6 Minimizing global memory accesses

225

(1)

10.7 Hierarchical reduction for arbitrary input length

226

(2)

10.8 Thread coarsening for reduced overhead

228

(3)

10.9 Summary

231

(4)

Exercises

232

(3)

Chapter 11 Prefix sum (scan) An introduction to work efficiency in parallel algorithms

235

(28)

Li-Wen Chang

Juan Gomez-Luna

John Owens

11.1 Background

236

(2)

11.2 Parallel scan with the Kogge-Stone algorithm

238

(6)

11.3 Speed and work efficiency consideration

244

(2)

11.4 Parallel scan with the Brent-Kung algorithm

246

(5)

11.5 Coarsening for even more work efficiency

251

(2)

11.6 Segmented parallel scan for arbitrary-length inputs

253

(3)

11.7 Single-pass scan for memory access efficiency

256

(3)

11.8 Summary

259

(4)

Exercises

260

(1)

References

261

(2)

Chapter 12 Merge An introduction to dynamic input data identification

263

(30)

Li-Wen Chang

Jie Lv

12.1 Background

263

(2)

12.2 A sequential merge algorithm

265

(1)

12.3 A parallelization approach

266

(2)

12.4 Co-rank function implementation

268

(5)

12.5 A basic parallel merge kernel

273

(2)

12.6 A tiled merge kernel to improve coalescing

275

(7)

12.7 A circular buffer merge kernel

282

(6)

12.8 Thread coarsening for merge

288

(1)

12.9 Summary

288

(5)

Exercises

289

(1)

References

289

(4)

Part III Advanced Patterns and Applications

Chapter 13 Sorting

293

(18)

Michael Garland

13.1 Background

294

(1)

13.2 Radix sort

295

(1)

13.3 Parallel radix sort

296

(4)

13.4 Optimizing for memory coalescing

300

(2)

13.5 Choice of radix value

302

(3)

13.6 Thread coarsening to improve coalescing

305

(1)

13.7 Parallel merge sort

306

(2)

13.8 Other parallel sort methods

308

(1)

13.9 Summary

309

(2)

Exercises

310

(1)

References

310

(1)

Chapter 14 Sparse matrix computation

311

(20)

14.1 Background

312

(2)

14.2 A simple SpMV kernel with the COO format

314

(3)

14.3 Grouping row nonzeros with the CSR format

317

(3)

14.4 Improving memory coalescing with the ELL format

320

(4)

14.5 Regulating padding with the hybrid ELL-COO format

324

(1)

14.6 Reducing control divergence with the JDS format

325

(3)

14.7 Summary

328

(3)

Exercises

329

(1)

References

329

(2)

Chapter 15 Graph traversal

331

(24)

John Owens

Juan Gomez-Luna

15.1 Background

332

(3)

15.2 Breadth-first search

335

(3)

15.3 Vertex-centric parallelization of breadth-first search

338

(5)

15.4 Edge-centric parallelization of breadth-first search

343

(2)

15.5 Improving efficiency with frontiers

345

(3)

15.6 Reducing contention with privatization

348

(2)

15.7 Other optimizations

350

(2)

15.8 Summary

352

(3)

Exercises

353

(1)

References

354

(1)

Chapter 16 Deep learning

355

(36)

Carl Pearson

Boris Ginsburg

16.1 Background

356

(10)

16.2 Convolutional neural networks

366

(10)

16.3 Convolutional layer: a CUDA inference kernel

376

(3)

16.4 Formulating a convolutional layer as GEMM

379

(6)

16.5 Cudnn Library

385

(2)

16.6 Summary

387

(4)

Exercises

388

(1)

References

388

(3)

Chapter 17 Iterative magnetic resonance imaging reconstruction

391

(24)

17.1 Background

391

(3)

17.2 Iterative reconstruction

394

(2)

17.3 Computing FHD

396

(16)

17.4 Summary

412

(3)

Exercises

413

(1)

References

414

(1)

Chapter 18 Electrostatic potential map

415

(18)

John Stone

18.1 Background

415

(2)

18.2 Scatter versus gather in kernel design

417

(5)

18.3 Thread coarsening

422

(2)

18.4 Memory coalescing

424

(1)

18.5 Cutoff binning for data size scalability

425

(5)

18.6 Summary

430

(3)

Exercises

431

(1)

References

431

(2)

Chapter 19 Parallel programming and computational thinking

433

(16)

19.1 Goals of parallel computing

433

(3)

19.2 Algorithm selection

436

(4)

19.3 Problem decomposition

440

(4)

19.4 Computational thinking

444

(2)

19.5 Summary

446

(3)

References

446

(3)

Part IV Advanced Practices

Chapter 20 Programming a heterogeneous computing cluster An introduction to CUDA streams

449

(26)

Isaac Geladjo

Javier Cabezas

20.1 Background

449

(1)

20.2 A running example

450

(2)

20.3 Message passing interface basics

452

(3)

20.4 Message passing interface point-to-point communication

455

(7)

20.5 Overlapping computation and communication

462

(8)

20.6 Message passing interface collective communication

470

(1)

20.7 CUDA aware message passing interface

471

(1)

20.8 Summary

472

(3)

Exercises

472

(1)

References

473

(2)

Chapter 21 CUDA dynamic parallelism

475

(24)

Juan Gomez-Luna

21.1 Background

476

(2)

21.2 Dynamic parallelism overview

478

(3)

21.3 An example: Bezier curves

481

(3)

21.4 A recursive example: quadtrees

484

(6)

21.5 Important considerations

490

(2)

21.6 Summary

492

(3)

Exercises

493

(2)

A21.1 Support code for quadtree example

495

(4)

References

497

(2)

Chapter 22 Advanced practices and future evolution

499

(16)

Isaac Gelado

Mark Harris

22.1 Model of host/device interaction

500

(5)

22.2 Kernel execution control

505

(3)

22.3 Memory bandwidth and compute throughput

508

(2)

22.4 Programming environment

510

(3)

22.5 Future outlook

513

(2)

References

513

(2)

Chapter 23 Conclusion and outlook

515

(4)

23.1 Goals revisited

515

(1)

23.2 Future outlook

516

(3)

Appendix A Numerical considerations

519

(18)

Index

537

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97803239846386e.html

Märksõnad:

E-raamat: Programming Massively Parallel Processors: A Hands-on Approach

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv