Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Multicore and GPU Programming: An Integrated Approach 2nd edition [Pehme köide]

4.00/5 (7 hinnangut Goodreads-ist)

Gerassimos Barlas (Professor, Computer Science and Engineering Department, American University of Sharjah, UAE)

Formaat: Paperback / softback, 1024 pages, kõrgus x laius: 235x191 mm, kaal: 1520 g
Ilmumisaeg: 08-Aug-2022
Kirjastus: Morgan Kaufmann Publishers In
ISBN-10: 0128141204
ISBN-13: 9780128141205

Teised raamatud teemal:

Programming & scripting languages: general - (Hetkel poes: 1 nimetust)
Computer architecture & logic design

Pehme köide
Hind: 118,89 €
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 1024 pages, kõrgus x laius: 235x191 mm, kaal: 1520 g
Ilmumisaeg: 08-Aug-2022
Kirjastus: Morgan Kaufmann Publishers In
ISBN-10: 0128141204
ISBN-13: 9780128141205

Teised raamatud teemal:

Programming & scripting languages: general - (Hetkel poes: 1 nimetust)
Computer architecture & logic design

Püsilink: https://www.kriso.ee/db/9780128141205.html

Märksõnad:

Multicore and GPU Programming: An Integrated Approach offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, CUDA, and other current tools it teaches the design and development of software capable of taking advantage of today’s computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm. Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, readers can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines.

Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA, with coverage of OpenCL and OpenACC added
Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance
New features in the second edition include the use of the C++14 standard for all sample code, a new chapter on concurrent data structures, and the latest research on load balancing
Download source code, examples, and instructor support materials on the book’s companion website

List of tables

Preface

xvii

PART 1 Introduction

Chapter 1 Introduction

(28)

1.1 The era of multicore machines

(2)

1.2 A taxonomy of parallel machines

(2)

1.3 A glimpse of influential computing machines

(10)

1.3.1 The Cell BE processor

(1)

1.3.2 NVidia's Ampere

(3)

1.3.3 Multicore to many-core: TILERA's TILE-Gx8072 and Intel's Xeon Phi

(2)

1.3.4 AMD's Epyc Rome: scaling up with smaller chips

(2)

1.3.5 Fujitsu A64FX: compute and memory integration

(1)

1.4 Performance metrics

(4)

1.5 Predicting and measuring parallel program performance

(10)

1.5.1 Amdahl's law

(2)

1.5.2 Gustafson-Barsis' rebuttal

(1)

Exercises

(3)

Chapter 2 Multicore and parallel program design

(34)

2.1 Introduction

(1)

2.2 The PCAM methodology

(4)

2.3 Decomposition patterns

(18)

2.3.1 Task parallelism

(1)

2.3.2 Divide-and-conquer decomposition

(3)

2.3.3 Geometric decomposition

(3)

2.3.4 Recursive data decomposition

(6)

2.3.5 Pipeline decomposition

(4)

2.3.6 Event-based coordination decomposition

(1)

2.4 Program structure patterns

(6)

2.4.1 Single program, multiple data

(1)

2.4.2 Multiple program, multiple data

(1)

2.4.3 Master-worker

(1)

2.4.4 Map-reduce

(1)

2.4.5 Fork/join

(2)

2.4.6 Loop parallelism

(1)

2.5 Matching decomposition patterns with program structure patterns

(5)

Exercises

(4)

PART 2 Programming with threads and processes

Chapter 3 Threads and concurrency in standard C++

(116)

3.1 Introduction

(3)

3.2 Threads

(1)

3.2.1 What is a thread?

(1)

3.2.2 What are threads good for?

(1)

3.3 Thread creation and initialization

(8)

3.4 Sharing data between threads

(3)

3.5 Design concerns

(2)

3.6 Semaphores

(5)

3.7 Applying semaphores in classical problems

(30)

3.7.1 Producers-consumers

(4)

3.7.2 Dealing with termination

(12)

3.7.3 The barbershop problem - introducing fairness

105

(6)

3.7.4 Readers-writers

111

(6)

3.8 Atomic data types

117

(9)

3.8.1 Memory ordering

122

(4)

3.9 Monitors

126

(12)

3.9.1 Design approach #1: critical section inside the monitor

131

(1)

3.9.2 Design approach #2: monitor controls entry to critical section

132

(4)

3.9.3 General semaphores revisited

136

(2)

3.10 Applying monitors in classical problems

138

(14)

3.10.1 Producers-consumers revisited

138

(7)

3.10.2 Readers-writers

145

(7)

3.11 Asynchronous threads

152

(4)

3.12 Dynamic vs. static thread management

156

(9)

3.13 Threads and fibers

165

(7)

3.14 Debugging multi-threaded applications

172

(9)

Exercises

177

(4)

Chapter 4 Parallel data structures

181

(50)

4.1 Introduction

181

(4)

4.2 Lock-based structures

185

(18)

4.2.1 Queues

185

(4)

4.2.2 Lists

189

(14)

4.3 Lock-free structures

203

(24)

4.3.1 Lock-free stacks

204

(5)

4.3.2 A bounded lock-free queue: first attempt

209

(7)

4.3.3 The ABA problem

216

(2)

4.3.4 A fixed bounded lock-free queue

218

(4)

4.3.5 An unbounded lock-free queue

222

(5)

4.4 Closing remarks

227

(4)

Exercises

228

(3)

Chapter 5 Distributed memory programming

231

(158)

5.1 Introduction

231

(1)

5.2 MPI

232

(2)

5.3 Core concepts

234

(1)

5.4 Your first MPI program

234

(4)

5.5 Program architecture

238

(3)

5.5.1 SPMD

238

(2)

5.5.2 MPMD

240

(1)

5.6 Point-to-point communication

241

(4)

5.7 Alternative point-to-point communication modes

245

(3)

5.7.1 Buffered communications

246

(2)

5.8 Non-blocking communications

248

(4)

5.9 Point-to-point communications: summary

252

(1)

5.10 Error reporting & handling

252

(2)

5.11 Collective communications

254

(29)

5.11.1 Scattering

259

(6)

5.11.2 Gathering

265

(2)

5.11.3 Reduction

267

(4)

5.11.4 All-to-all gathering

271

(5)

5.11.5 All-to-all scattering

276

(6)

5.11.6 All-to-all reduction

282

(1)

5.11.7 Global synchronization

282

(1)

5.12 Persistent communications

283

(3)

5.13 Big-count communications in MPI 4.0

286

(1)

5.14 Partitioned communications

287

(2)

5.15 Communicating objects

289

(11)

5.15.1 Derived datatypes

291

(7)

5.15.2 Packing/unpacking

298

(2)

5.16 Node management: communicators and groups

300

(6)

5.16.1 Creating groups

301

(2)

5.16.2 Creating intracommunicators

303

(3)

5.17 One-sided communication

306

(12)

5.17.1 RMA communication functions

307

(2)

5.17.2 RMA synchronization functions

309

(9)

5.18 I/O considerations

318

(8)

5.19 Combining MPI processes with threads

326

(2)

5.20 Timing and performance measurements

328

(1)

5.21 Debugging, profiling, and tracing MPI programs

329

(7)

5.21.1 Brief introduction to Scalasca

330

(4)

5.21.2 Brief introduction to TAU

334

(2)

5.22 The Boost.MPI library

336

(13)

5.22.1 Blocking and non-blocking communications

337

(5)

5.22.2 Data serialization

342

(3)

5.22.3 Collective operations

345

(4)

5.23 A case study: diffusion-limited aggregation

349

(6)

5.24 A case study: brute-force encryption cracking

355

(6)

5.25 A case study: MPI implementation of the master-worker pattern

361

(28)

5.25.1 A simple master-worker setup

361

(8)

5.25.2 A multi-threaded master-worker setup

369

(15)

Exercises

384

(5)

Chapter 6 GPU programming: CUDA

389

(194)

6.1 Introduction

389

(3)

6.2 CUDA's programming model: threads, blocks, and grids

392

(6)

6.3 CUDA's execution model: streaming multiprocessors and warps

398

(3)

6.4 CUDA compilation process

401

(5)

6.5 Putting together a CUDA project

406

(3)

6.6 Memory hierarchy

409

(24)

6.6.1 Local memory/registers

416

(1)

6.6.2 Shared memory

417

(9)

6.6.3 Constant memory

426

(7)

6.6.4 Texture and surface memory

433

(1)

6.7 Optimization techniques

433

(49)

6.7.1 Block and grid design

433

(12)

6.7.2 Kernel structure

445

(8)

6.7.3 Shared memory access

453

(9)

6.7.4 Global memory access

462

(12)

6.7.5 Asynchronous execution and streams: overlapping GPU memory transfers and more

474

(8)

6.8 Graphs

482

(10)

6.8.1 Creating a graph using the CUDA graph API

483

(6)

6.8.2 Creating a graph by capturing a stream

489

(3)

6.9 Warp functions

492

(9)

6.10 Cooperative groups

501

(22)

6.10.1 Intrablock cooperative groups

501

(13)

6.10.2 Interblock cooperative groups

514

(5)

6.10.3 Grid-level reduction

519

(4)

6.11 Dynamic parallelism

523

(4)

6.12 Debugging CUDA programs

527

(2)

6.13 Profiling CUDA programs

529

(4)

6.14 CUDA and MPI

533

(6)

6.15 Case studies

539

(44)

6.15.1 Fractal set calculation

540

(11)

6.15.2 Block cipher encryption

551

(27)

Exercises

578

(5)

Chapter 7 GPU and accelerator programming: OpenCL

583

(100)

7.1 The OpenCL architecture

583

(2)

7.2 The platform model

585

(5)

7.3 The execution model

590

(3)

7.4 The programming model

593

(14)

7.4.1 Summarizing the structure of an OpenCL program

603

(4)

7.5 The memory model

607

(33)

7.5.1 Buffer objects

609

(9)

7.5.2 Local memory

618

(1)

7.5.3 Image objects

619

(12)

7.5.4 Pipe objects

631

(9)

7.6 Shared virtual memory

640

(4)

7.7 Atomics and synchronization

644

(5)

7.8 Work group functions

649

(6)

7.9 Events and profiling OpenCL programs

655

(2)

7.10 OpenCL and other parallel software platforms

657

(3)

7.11 Case study: Mandelbrot set

660

(23)

7.11.1 Calculating the Mandelbrot set using OpenCL

661

(7)

7.11.2 Hybrid calculation of the Mandelbrot set using OpenCL and C++11

668

(6)

7.11.3 Hybrid calculation of the Mandelbrot set using OpenCL on both host and device

674

(3)

7.11.4 Performance comparison

677

(1)

Exercises

677

(6)

PART 3 Higher-level parallel programming

Chapter 8 Shared-memory programming: OpenMP

683

(120)

8.1 Introduction

683

(1)

8.2 Your first OpenMP program

684

(5)

8.3 Variable scope

689

(10)

8.3.1 OpenMP integration V.0: manual partitioning

691

(2)

8.3.2 OpenMP integration V. 1: manual partitioning without a race condition

693

(1)

8.3.3 OpenMP integration V.2: implicit partitioning with locking

694

(2)

8.3.4 OpenMP integration V.3: implicit partitioning with reduction

696

(2)

8.3.5 Final words on variable scope

698

(1)

8.4 Loop-level parallelism

699

(17)

8.4.1 Data dependencies

701

(10)

8.4.2 Nested loops

711

(1)

8.4.3 Scheduling

712

(4)

8.5 Task parallelism

716

(23)

8.5.1 The secti ons directive

716

(6)

8.5.2 The task directive

722

(5)

8.5.3 Task dependencies

727

(4)

8.5.4 The taskl oop directive

731

(1)

8.5.5 The taskgroup directive and task-level reduction

732

(7)

8.6 Synchronization constructs

739

(7)

8.7 Cancellation constructs

746

(1)

8.8 SIMD extensions

747

(4)

8.9 Offloading to devices

751

(15)

8.9.1 Device work-sharing directives

753

(5)

8.9.2 Device memory management directives

758

(6)

8.9.3 CUDA interoperability

764

(2)

8.10 The loop construct

766

(1)

8.11 Thread affinity

767

(4)

8.12 Correctness and optimization issues

771

(13)

8.12.1 Thread safety

771

(7)

8.12.2 False-sharing

778

(6)

8.13 A case study: sorting in OpenMP

784

(9)

8.13.1 Bottom-up mergesort in OpenMP

784

(3)

8.13.2 Top-down mergesort in OpenMP

787

(6)

8.13.3 Performance comparison

793

(1)

8.14 A case study: brute-force encryption cracking, combining MPI and OpenMP

793

(10)

Exercises

797

(6)

Chapter 9 High-level multi-threaded programming with the Qt library

803

(32)

9.1 Introduction

803

(1)

9.2 Implicit thread creation

804

(2)

9.3 Qt's pool of threads

806

(2)

9.4 Higher-level constructs - multi-threaded programming without threads!

808

(27)

9.4.1 Concurrent map

808

(3)

9.4.2 Map-reduce

811

(2)

9.4.3 Concurrent filter

813

(2)

9.4.4 Filter-reduce

815

(1)

9.4.5 A case study: multi-threaded sorting

816

(9)

9.4.6 A case study: multi-threaded image matching

825

(8)

Exercises

833

(2)

Chapter 10 The Thrust template library

835

(52)

10.1 Introduction

835

(1)

10.2 First steps in Thrust

836

(4)

10.3 Working with Thrust datatypes

840

(4)

10.4 Thrust algorithms

844

(18)

10.4.1 Transformations

844

(5)

10.4.2 Sorting & searching

849

(5)

10.4.3 Reductions

854

(3)

10.4.4 Scans/prefix-sums

857

(2)

10.4.5 Data management and reordering

859

(3)

10.5 Fancy iterators

862

(6)

10.6 Switching device back-ends

868

(2)

10.7 Thrust execution policies and asynchronous execution

870

(2)

10.8 Case studies

872

(15)

10.8.1 Monte Carlo integration

872

(4)

10.8.2 DNA sequence alignment

876

(7)

Exercises

883

(4)

PART 4 Advanced topics

Chapter 11 Load balancing

887

(56)

11.1 Introduction

887

(1)

11.2 Dynamic load balancing: the Linda legacy

888

(2)

11.3 Static load balancing: the divisible load theory approach

890

(24)

11.3.1 Modeling costs

891

(7)

11.3.2 Communication configuration

898

(3)

11.3.3 Analysis

901

(10)

11.3.4 Summary - short literature review

911

(3)

11.4 DLTLib: a library for partitioning workloads

914

(3)

11.5 Case studies

917

(26)

11.5.1 Hybrid computation of a Mandelbrot set "movie": a case study in dynamic load balancing

917

(13)

11.5.2 Distributed block cipher encryption: a case study in static load balancing

930

(10)

Exercises

940

(3)

Appendix A Creating Qt programs

943

(2)

A.1 Using an IDE

943

(1)

A.2 The qmake utility

943

(2)

Appendix B Running MPI programs: preparatory and configuration steps

945

(4)

B.1 Preparatory steps

945

(1)

B.2 Computing nodes discovery for MPI program deployment

946

(3)

B.2.1 Host discovery with the nmap utility

946

(1)

B.2.2 Automatic generation of a hostfile

947

(2)

Appendix C Time measurement

949

(6)

C.1 Introduction

949

(1)

C.2 POSIX high-resolution timing

949

(2)

C.3 Timing in C++11

951

(1)

C.4 Timing in Qt

952

(1)

C.5 Timing in OpenMP

952

(1)

C.6 Timing in MPI

953

(1)

C.7 Timing in CUDA

953

(2)

Appendix D Boost.MPI

955

(2)

D.1 Mapping from MPI C to Boost.MPI

955

(2)

Appendix E Setting up CUDA

957

(4)

E.1 Installation

957

(1)

E.2 Issues with GCC

957

(1)

E.3 Combining CUDA with third-party libraries

958

(3)

Appendix F OpenCL helper functions

961

(6)

F.1 Function readCLFromFile

961

(1)

F.2 Function is Error

962

(1)

F.3 Function getCompi 1 ationError

963

(1)

F.4 Function handl eError

963

(1)

F.5 Function setupDevice

964

(1)

F.6 Function setupProgramAndKernel

965

(2)

Appendix G DLTlib

967

(10)

G.1 DLTlib functions

967

(8)

G.1.1 Class Network: generic methods

968

(2)

G.1.2 Class Network: query processing

970

(1)

G.1.3 Class Network: image processing

971

(2)

G.1.4 Class Network: image registration

973

(2)

G.2 DLTlib files

975

(2)

Glossary

977

(2)

Bibliography

979

(4)

Index

983

Gerassimos Barlas is a Professor with the Computer Science & Engineering Department, American University of Sharjah, Sharjah, UAE. His research interest includes parallel algorithms, development, analysis and modeling frameworks for load balancing, and distributed Video on-Demand. Prof. Barlas has taught parallel computing for more than 12 years, has been involved with parallel computing since the early 90s, and is active in the emerging field of Divisible Load Theory for parallel and distributed systems.

Multicore and GPU Programming: An Integrated Approach 2nd edition [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv