Customer Support: +372 7440010

Help | New account | Log In

E-book: Multicore and GPU Programming: An Integrated Approach

4.00/5 (11 ratings by Goodreads)

Gerassimos Barlas (Professor, Computer Science and Engineering Department, American University of Sharjah, UAE)

Format: PDF+DRM
Pub. Date: 16-Dec-2014
Publisher: Morgan Kaufmann Publishers In
Language: eng
ISBN-13: 9780124171404

Other books in subject:

Computer architecture & logic design

Format - PDF+DRM
Price: 75,06 €*
* the price is final i.e. no additional discount will apply
Add to basket
Add to Wishlist
This ebook is for personal use only. E-Books are non-refundable.

Format: PDF+DRM
Pub. Date: 16-Dec-2014
Publisher: Morgan Kaufmann Publishers In
Language: eng
ISBN-13: 9780124171404

Other books in subject:

Computer architecture & logic design

DRM restrictions

Copying (copy/paste):

not allowed
Printing:

not allowed
Usage:

Digital Rights Management (DRM)
The publisher has supplied this book in encrypted form, which means that you need to install free software in order to unlock and read it. To read this e-book you have to create Adobe ID More info here. Ebook can be read and downloaded up to 6 devices (single user with the same Adobe ID).

Required software
To read this ebook on a mobile device (phone or tablet) you'll need to install this free app: PocketBook Reader (iOS / Android)

To download and read this eBook on a PC or Mac you need Adobe Digital Editions (This is a free app specially developed for eBooks. It's not the same as Adobe Reader, which you probably already have on your computer.)

You can't read this ebook with Amazon Kindle

Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today’s computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm.

Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines.

Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA
Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance
Particular focus on the emerging area of divisible load theory and its impact on load balancing and distributed systems
Download source code, examples, and instructor support materials on the book's companion website

More info

The only book covering both traditional and massively parallel computing

List of Tables

xiii

Preface

Chapter 1 Introduction

(26)

1.1 The era of multicore machines

(2)

1.2 A taxonomy of parallel machines

(2)

1.3 A glimpse of contemporary computing machines

(9)

1.3.1 The cell BE processor

(1)

1.3.2 Nvidia's Kepler

(3)

1.3.3 AMD's APUs

(1)

1.3.4 Multicore to many-core: Tilera's TILE-Gx8072 and Intel's Xeon Phi

(3)

1.4 Performance metrics

(4)

1.5 Predicting and measuring parallel program performance

(8)

1.5.1 Amdahl's law

(3)

1.5.2 Gustafson-Barsis's rebuttal

(2)

Exercises

(1)

Chapter 2 Multicore and parallel program design

(28)

2.1 Introduction

(1)

2.2 The PCAM methodology

(4)

2.3 Decomposition patterns

(15)

2.3.1 Task parallelism

(1)

2.3.2 Divide-and-conquer decomposition

(2)

2.3.3 Geometric decomposition

(3)

2.3.4 Recursive data decomposition

(3)

2.3.5 Pipeline decomposition

(4)

2.3.6 Event-based coordination decomposition

(1)

2.4 Program structure patterns

(6)

2.4.1 Single-program, multiple-data

(1)

2.4.2 Multiple-program, multiple-data

(1)

2.4.3 Master-worker

(1)

2.4.4 Map-reduce

(1)

2.4.5 Fork/join

(2)

2.4.6 Loop parallelism

(1)

2.5 Matching decomposition patterns with program structure patterns

(1)

Exercises

(1)

Chapter 3 Shared-memory programming: threads

(110)

3.1 Introduction

(3)

3.2 Threads

(10)

3.2.1 What is a thread?

(1)

3.2.2 What are threads good for?

(1)

3.2.3 Thread creation and initialization

(6)

3.2.4 Sharing data between threads

(3)

3.3 Design concerns

(2)

3.4 Semaphores

(5)

3.5 Applying semaphores in classical problems

(24)

3.5.1 Producers-consumers

(4)

3.5.2 Dealing with termination

(11)

3.5.3 The barbershop problem: introducing fairness

(5)

3.5.4 Readers-writers

(4)

3.6 Monitors

(8)

3.6.1 Design approach 1: critical section inside the monitor

103

(1)

3.6.2 Design approach 2: monitor controls entry to critical section

104

(3)

3.7 Applying monitors in classical problems

107

(13)

3.7.1 Producers-consumers revisited

107

(6)

3.7.2 Readers-writers

113

(7)

3.8 Dynamic vs. static thread management

120

(10)

3.8.1 Qt's thread pool

120

(1)

3.8.2 Creating and managing a pool of threads

121

(9)

3.9 Debugging multithreaded applications

130

(5)

3.10 Higher-level constructs: multithreaded programming without threads

135

(25)

3.10.1 Concurrent map

136

(2)

3.10.2 Map-reduce

138

(2)

3.10.3 Concurrent filter

140

(2)

3.10.4 Filter-reduce

142

(1)

3.10.5 A case study: multithreaded sorting

143

(9)

3.10.6 A case study: multithreaded image matching

152

(8)

Exercises

160

(5)

Chapter 4 Shared-memory programming: OpenMP

165

(74)

4.1 Introduction

165

(1)

4.2 Your first OpenMP program

166

(3)

4.3 Variable scope

169

(10)

4.3.1 OpenMP integration V.0: manual partitioning

171

(2)

4.3.2 OpenMP integration V.1: manual partitioning without a race condition

173

(2)

4.3.3 OpenMP integration V.2: implicit partitioning with locking

175

(1)

4.3.4 OpenMP integration V.3: implicit partitioning with reduction

176

(2)

4.3.5 Final words on variable scope

178

(1)

4.4 Loop-level parallelism

179

(16)

4.4.1 Data dependencies

181

(10)

4.4.2 Nested loops

191

(1)

4.4.3 Scheduling

192

(3)

4.5 Task parallelism

195

(13)

4.5.1 The sections directive

196

(6)

4.5.2 The task directive

202

(6)

4.6 Synchronization constructs

208

(8)

4.7 Correctness and optimization issues

216

(10)

4.7.1 Thread safety

216

(4)

4.7.2 False sharing

220

(6)

4.8 A Case study: sorting in OpenMP

226

(11)

4.8.1 Bottom-up mergesort in OpenMP

227

(3)

4.8.2 Top-down mergesort in OpenMP

230

(5)

4.8.3 Performance comparison

235

(2)

Exercises

237

(2)

Chapter 5 Distributed memory programming

239

(152)

5.1 Communicating processes

239

(1)

5.2 MPI

240

(1)

5.3 Core concepts

241

(1)

5.4 Your first MPI program

242

(4)

5.5 Program architecture

246

(2)

5.5.1 SPMD

246

(1)

5.5.2 MPMD

246

(2)

5.6 Point-to-Point communication

248

(4)

5.7 Alternative Point-to-Point communication modes

252

(3)

5.7.1 Buffered communications

253

(2)

5.8 Non blocking communications

255

(4)

5.9 Point-to-Point communications: summary

259

(1)

5.10 Error reporting and handling

259

(2)

5.11 Collective communications

261

(28)

5.11.1 Scattering

266

(6)

5.11.2 Gathering

272

(2)

5.11.3 Reduction

274

(5)

5.11.4 All-to-all gathering

279

(4)

5.11.5 All-to-all scattering

283

(5)

5.11.6 All-to-all reduction

288

(1)

5.11.7 Global synchronization

289

(1)

5.12 Communicating objects

289

(11)

5.12.1 Derived datatypes

290

(7)

5.12.2 Packing/unpacking

297

(3)

5.13 Node management: communicators and groups

300

(5)

5.13.1 Creating groups

300

(2)

5.13.2 Creating intra-communicators

302

(3)

5.14 One-sided communications

305

(12)

5.14.1 RMA communication functions

307

(1)

5.14.2 RMA synchronization functions

308

(9)

5.15 I/O considerations

317

(8)

5.16 Combining MPI processes with threads

325

(3)

5.17 Timing and performance measurements

328

(1)

5.18 Debugging and profiling MPI programs

329

(4)

5.19 The Boost.MPI library

333

(14)

5.19.1 Blocking and non blocking communications

335

(5)

5.19.2 Data serialization

340

(3)

5.19.3 Collective operations

343

(4)

5.20 A case study: diffusion-limited aggregation

347

(5)

5.21 A case study: brute-force encryption cracking

352

(10)

5.21.1 Version #1: "plain-vanilla" MPI

352

(6)

5.21.2 Version #2: combining MPI and OpenMP

358

(4)

5.22 A Case study: MPI implementation of the master-worker pattern

362

(24)

5.22.1 A Simple master-worker setup

363

(8)

5.22.2 A Multithreaded master-worker setup

371

(15)

Exercises

386

(5)

Chapter 6 GPU programming

391

(136)

6.1 GPU programming

391

(3)

6.2 CUDA's programming model: threads, blocks, and grids

394

(6)

6.3 CUDA's execution model: streaming multiprocessors and warps

400

(3)

6.4 CUDA compilation process

403

(4)

6.5 Putting together a CUDA project

407

(3)

6.6 Memory hierarchy

410

(22)

6.6.1 Local memory/registers

416

(1)

6.6.2 Shared memory

417

(8)

6.6.3 Constant memory

425

(7)

6.6.4 Texture and surface memory

432

(1)

6.7 Optimization techniques

432

(39)

6.7.1 Block and grid design

432

(10)

6.7.2 Kernel structure

442

(4)

6.7.3 Shared memory access

446

(8)

6.7.4 Global memory access

454

(4)

6.7.5 Page-locked and zero-copy memory

458

(3)

6.7.6 Unified memory

461

(3)

6.7.7 Asynchronous execution and streams

464

(7)

6.8 Dynamic parallelism

471

(4)

6.9 Debugging CUDA programs

475

(1)

6.10 Profiling CUDA programs

476

(4)

6.11 CUDA and MPI

480

(5)

6.12 Case studies

485

(38)

6.12.1 Fractal set calculation

486

(10)

6.12.2 Block cipher encryption

496

(27)

Exercises

523

(4)

Chapter 7 The Thrust template library

527

(48)

7.1 Introduction

527

(1)

7.2 First steps in Thrust

528

(4)

7.3 Working with Thrust datatypes

532

(3)

7.4 Thrust algorithms

535

(18)

7.4.1 Transformations

536

(4)

7.4.2 Sorting and searching

540

(6)

7.4.3 Reductions

546

(2)

7.4.4 Scans/prefix sums

548

(2)

7.4.5 Data management and manipulation

550

(3)

7.5 Fancy iterators

553

(6)

7.6 Switching device back ends

559

(2)

7.7 Case studies

561

(10)

7.7.1 Monte Carlo integration

561

(3)

7.7.2 DNA sequence alignment

564

(7)

Exercises

571

(4)

Chapter 8 Load balancing

575

(54)

8.1 Introduction

575

(1)

8.2 Dynamic load balancing: the Linda legacy

576

(2)

8.3 Static load balancing: the divisible load theory approach

578

(23)

8.3.1 Modeling costs

579

(7)

8.3.2 Communication configuration

586

(3)

8.3.3 Analysis

589

(9)

8.3.4 Summary - short literature review

598

(3)

8.4 DLTlib: A library for partitioning workloads

601

(3)

8.5 Case studies

604

(23)

8.5.1 Hybrid computation of a mandelbrot set "movie": a case study in dynamic load balancing

604

(13)

8.5.2 Distributed block cipher encryption: a case study in static load balancing

617

(10)

Exercises

627

(2)

Appendix A Compiling Qt programs

629

(2)

A.1 Using an IDE

629

(1)

A.2 The qma ke Utility

629

(2)

Appendix B Running MPI programs: preparatory and configuration steps

631

(4)

B.1 Preparatory steps

631

(1)

B.2 Computing nodes discovery for MPI program deployment

632

(3)

B.2.1 Host discovery with the nmap utility

632

(1)

B.2.2 Automatic generation of a hostfile

633

(2)

Appendix C lime measurement

635

(6)

C.1 Introduction

635

(1)

C.2 POSIX high-resolution timing

635

(2)

C.3 Timing in Qt

637

(1)

C.4 Timing in OpenMP

638

(1)

C.5 Timing in MPI

638

(1)

C.6 Timing in CUDA

638

(3)

Appendix D Boost.MPI

641

(2)

D.1 Mapping from MPI C to Boost.MPI

641

(2)

Appendix E Setting up CUDA

643

(6)

E.1 Installation

643

(1)

E.2 Issues with GCC

643

(1)

E.3 Running CUDA without an Nvidia GPU

644

(1)

E.4 Running CUDA on optimus-equipped laptops

645

(1)

E.5 Combining CUDA with third-party libraries

646

(3)

Appendix F DLTlib

649

(10)

F.1 DLTlib Functions

649

(8)

F.1.1 Class Network: generic methods

650

(2)

F.1.2 Class Network: query processing

652

(1)

F.1.3 Class Network: image processing

653

(1)

F.1.4 Class Network: image registration

654

(3)

F.2 DLTlib Files

657

(2)

Glossary

659

(2)

Bibliography

661

(4)

Index

665

Gerassimos Barlas is a Professor with the Computer Science & Engineering Department, American University of Sharjah, Sharjah, UAE. His research interest includes parallel algorithms, development, analysis and modeling frameworks for load balancing, and distributed Video on-Demand. Prof. Barlas has taught parallel computing for more than 12 years, has been involved with parallel computing since the early 90s, and is active in the emerging field of Divisible Load Theory for parallel and distributed systems.

More information about ebooks

Permanent link: https://www.kriso.ee/db/97801241714042e.html

Keywords:

E-book: Multicore and GPU Programming: An Integrated Approach

DRM restrictions

Copying (copy/paste):

Printing:

Usage:

More info

Account & settings

Search

Search database

Refine By

Subjects Ebook Subjects

Choose shopping cart