Update cookies preferences

E-book: Multicore and GPU Programming: An Integrated Approach

4.00/5 (11 ratings by Goodreads)
(Professor, Computer Science and Engineering Department, American University of Sharjah, UAE)
  • Format: PDF+DRM
  • Pub. Date: 16-Dec-2014
  • Publisher: Morgan Kaufmann Publishers In
  • Language: eng
  • ISBN-13: 9780124171404
Other books in subject:
  • Format - PDF+DRM
  • Price: 75,06 €*
  • * the price is final i.e. no additional discount will apply
  • Add to basket
  • Add to Wishlist
  • This ebook is for personal use only. E-Books are non-refundable.
  • Format: PDF+DRM
  • Pub. Date: 16-Dec-2014
  • Publisher: Morgan Kaufmann Publishers In
  • Language: eng
  • ISBN-13: 9780124171404
Other books in subject:

DRM restrictions

  • Copying (copy/paste):

    not allowed

  • Printing:

    not allowed

  • Usage:

    Digital Rights Management (DRM)
    The publisher has supplied this book in encrypted form, which means that you need to install free software in order to unlock and read it.  To read this e-book you have to create Adobe ID More info here. Ebook can be read and downloaded up to 6 devices (single user with the same Adobe ID).

    Required software
    To read this ebook on a mobile device (phone or tablet) you'll need to install this free app: PocketBook Reader (iOS / Android)

    To download and read this eBook on a PC or Mac you need Adobe Digital Editions (This is a free app specially developed for eBooks. It's not the same as Adobe Reader, which you probably already have on your computer.)

    You can't read this ebook with Amazon Kindle

Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today’s computing platforms incorporating CPU and GPU hardware and explains how to transition from sequential programming to a parallel computing paradigm.

Presenting material refined over more than a decade of teaching parallel computing, author Gerassimos Barlas minimizes the challenge with multiple examples, extensive case studies, and full source code. Using this book, you can develop programs that run over distributed memory machines using MPI, create multi-threaded applications with either libraries or directives, write optimized applications that balance the workload between available computing resources, and profile and debug programs targeting multicore machines.

  • Comprehensive coverage of all major multicore programming tools, including threads, OpenMP, MPI, and CUDA
  • Demonstrates parallel programming design patterns and examples of how different tools and paradigms can be integrated for superior performance
  • Particular focus on the emerging area of divisible load theory and its impact on load balancing and distributed systems
  • Download source code, examples, and instructor support materials on the book's companion website

More info

The only book covering both traditional and massively parallel computing
List of Tables xiii
Preface xv
Chapter 1 Introduction 1(26)
1.1 The era of multicore machines
1(2)
1.2 A taxonomy of parallel machines
3(2)
1.3 A glimpse of contemporary computing machines
5(9)
1.3.1 The cell BE processor
6(1)
1.3.2 Nvidia's Kepler
7(3)
1.3.3 AMD's APUs
10(1)
1.3.4 Multicore to many-core: Tilera's TILE-Gx8072 and Intel's Xeon Phi
11(3)
1.4 Performance metrics
14(4)
1.5 Predicting and measuring parallel program performance
18(8)
1.5.1 Amdahl's law
21(3)
1.5.2 Gustafson-Barsis's rebuttal
24(2)
Exercises
26(1)
Chapter 2 Multicore and parallel program design 27(28)
2.1 Introduction
27(1)
2.2 The PCAM methodology
28(4)
2.3 Decomposition patterns
32(15)
2.3.1 Task parallelism
33(1)
2.3.2 Divide-and-conquer decomposition
34(2)
2.3.3 Geometric decomposition
36(3)
2.3.4 Recursive data decomposition
39(3)
2.3.5 Pipeline decomposition
42(4)
2.3.6 Event-based coordination decomposition
46(1)
2.4 Program structure patterns
47(6)
2.4.1 Single-program, multiple-data
48(1)
2.4.2 Multiple-program, multiple-data
48(1)
2.4.3 Master-worker
49(1)
2.4.4 Map-reduce
50(1)
2.4.5 Fork/join
51(2)
2.4.6 Loop parallelism
53(1)
2.5 Matching decomposition patterns with program structure patterns
53(1)
Exercises
54(1)
Chapter 3 Shared-memory programming: threads 55(110)
3.1 Introduction
55(3)
3.2 Threads
58(10)
3.2.1 What is a thread?
58(1)
3.2.2 What are threads good for?
59(1)
3.2.3 Thread creation and initialization
59(6)
3.2.4 Sharing data between threads
65(3)
3.3 Design concerns
68(2)
3.4 Semaphores
70(5)
3.5 Applying semaphores in classical problems
75(24)
3.5.1 Producers-consumers
75(4)
3.5.2 Dealing with termination
79(11)
3.5.3 The barbershop problem: introducing fairness
90(5)
3.5.4 Readers-writers
95(4)
3.6 Monitors
99(8)
3.6.1 Design approach 1: critical section inside the monitor
103(1)
3.6.2 Design approach 2: monitor controls entry to critical section
104(3)
3.7 Applying monitors in classical problems
107(13)
3.7.1 Producers-consumers revisited
107(6)
3.7.2 Readers-writers
113(7)
3.8 Dynamic vs. static thread management
120(10)
3.8.1 Qt's thread pool
120(1)
3.8.2 Creating and managing a pool of threads
121(9)
3.9 Debugging multithreaded applications
130(5)
3.10 Higher-level constructs: multithreaded programming without threads
135(25)
3.10.1 Concurrent map
136(2)
3.10.2 Map-reduce
138(2)
3.10.3 Concurrent filter
140(2)
3.10.4 Filter-reduce
142(1)
3.10.5 A case study: multithreaded sorting
143(9)
3.10.6 A case study: multithreaded image matching
152(8)
Exercises
160(5)
Chapter 4 Shared-memory programming: OpenMP 165(74)
4.1 Introduction
165(1)
4.2 Your first OpenMP program
166(3)
4.3 Variable scope
169(10)
4.3.1 OpenMP integration V.0: manual partitioning
171(2)
4.3.2 OpenMP integration V.1: manual partitioning without a race condition
173(2)
4.3.3 OpenMP integration V.2: implicit partitioning with locking
175(1)
4.3.4 OpenMP integration V.3: implicit partitioning with reduction
176(2)
4.3.5 Final words on variable scope
178(1)
4.4 Loop-level parallelism
179(16)
4.4.1 Data dependencies
181(10)
4.4.2 Nested loops
191(1)
4.4.3 Scheduling
192(3)
4.5 Task parallelism
195(13)
4.5.1 The sections directive
196(6)
4.5.2 The task directive
202(6)
4.6 Synchronization constructs
208(8)
4.7 Correctness and optimization issues
216(10)
4.7.1 Thread safety
216(4)
4.7.2 False sharing
220(6)
4.8 A Case study: sorting in OpenMP
226(11)
4.8.1 Bottom-up mergesort in OpenMP
227(3)
4.8.2 Top-down mergesort in OpenMP
230(5)
4.8.3 Performance comparison
235(2)
Exercises
237(2)
Chapter 5 Distributed memory programming 239(152)
5.1 Communicating processes
239(1)
5.2 MPI
240(1)
5.3 Core concepts
241(1)
5.4 Your first MPI program
242(4)
5.5 Program architecture
246(2)
5.5.1 SPMD
246(1)
5.5.2 MPMD
246(2)
5.6 Point-to-Point communication
248(4)
5.7 Alternative Point-to-Point communication modes
252(3)
5.7.1 Buffered communications
253(2)
5.8 Non blocking communications
255(4)
5.9 Point-to-Point communications: summary
259(1)
5.10 Error reporting and handling
259(2)
5.11 Collective communications
261(28)
5.11.1 Scattering
266(6)
5.11.2 Gathering
272(2)
5.11.3 Reduction
274(5)
5.11.4 All-to-all gathering
279(4)
5.11.5 All-to-all scattering
283(5)
5.11.6 All-to-all reduction
288(1)
5.11.7 Global synchronization
289(1)
5.12 Communicating objects
289(11)
5.12.1 Derived datatypes
290(7)
5.12.2 Packing/unpacking
297(3)
5.13 Node management: communicators and groups
300(5)
5.13.1 Creating groups
300(2)
5.13.2 Creating intra-communicators
302(3)
5.14 One-sided communications
305(12)
5.14.1 RMA communication functions
307(1)
5.14.2 RMA synchronization functions
308(9)
5.15 I/O considerations
317(8)
5.16 Combining MPI processes with threads
325(3)
5.17 Timing and performance measurements
328(1)
5.18 Debugging and profiling MPI programs
329(4)
5.19 The Boost.MPI library
333(14)
5.19.1 Blocking and non blocking communications
335(5)
5.19.2 Data serialization
340(3)
5.19.3 Collective operations
343(4)
5.20 A case study: diffusion-limited aggregation
347(5)
5.21 A case study: brute-force encryption cracking
352(10)
5.21.1 Version #1: "plain-vanilla" MPI
352(6)
5.21.2 Version #2: combining MPI and OpenMP
358(4)
5.22 A Case study: MPI implementation of the master-worker pattern
362(24)
5.22.1 A Simple master-worker setup
363(8)
5.22.2 A Multithreaded master-worker setup
371(15)
Exercises
386(5)
Chapter 6 GPU programming 391(136)
6.1 GPU programming
391(3)
6.2 CUDA's programming model: threads, blocks, and grids
394(6)
6.3 CUDA's execution model: streaming multiprocessors and warps
400(3)
6.4 CUDA compilation process
403(4)
6.5 Putting together a CUDA project
407(3)
6.6 Memory hierarchy
410(22)
6.6.1 Local memory/registers
416(1)
6.6.2 Shared memory
417(8)
6.6.3 Constant memory
425(7)
6.6.4 Texture and surface memory
432(1)
6.7 Optimization techniques
432(39)
6.7.1 Block and grid design
432(10)
6.7.2 Kernel structure
442(4)
6.7.3 Shared memory access
446(8)
6.7.4 Global memory access
454(4)
6.7.5 Page-locked and zero-copy memory
458(3)
6.7.6 Unified memory
461(3)
6.7.7 Asynchronous execution and streams
464(7)
6.8 Dynamic parallelism
471(4)
6.9 Debugging CUDA programs
475(1)
6.10 Profiling CUDA programs
476(4)
6.11 CUDA and MPI
480(5)
6.12 Case studies
485(38)
6.12.1 Fractal set calculation
486(10)
6.12.2 Block cipher encryption
496(27)
Exercises
523(4)
Chapter 7 The Thrust template library 527(48)
7.1 Introduction
527(1)
7.2 First steps in Thrust
528(4)
7.3 Working with Thrust datatypes
532(3)
7.4 Thrust algorithms
535(18)
7.4.1 Transformations
536(4)
7.4.2 Sorting and searching
540(6)
7.4.3 Reductions
546(2)
7.4.4 Scans/prefix sums
548(2)
7.4.5 Data management and manipulation
550(3)
7.5 Fancy iterators
553(6)
7.6 Switching device back ends
559(2)
7.7 Case studies
561(10)
7.7.1 Monte Carlo integration
561(3)
7.7.2 DNA sequence alignment
564(7)
Exercises
571(4)
Chapter 8 Load balancing 575(54)
8.1 Introduction
575(1)
8.2 Dynamic load balancing: the Linda legacy
576(2)
8.3 Static load balancing: the divisible load theory approach
578(23)
8.3.1 Modeling costs
579(7)
8.3.2 Communication configuration
586(3)
8.3.3 Analysis
589(9)
8.3.4 Summary - short literature review
598(3)
8.4 DLTlib: A library for partitioning workloads
601(3)
8.5 Case studies
604(23)
8.5.1 Hybrid computation of a mandelbrot set "movie": a case study in dynamic load balancing
604(13)
8.5.2 Distributed block cipher encryption: a case study in static load balancing
617(10)
Exercises
627(2)
Appendix A Compiling Qt programs 629(2)
A.1 Using an IDE
629(1)
A.2 The qma ke Utility
629(2)
Appendix B Running MPI programs: preparatory and configuration steps 631(4)
B.1 Preparatory steps
631(1)
B.2 Computing nodes discovery for MPI program deployment
632(3)
B.2.1 Host discovery with the nmap utility
632(1)
B.2.2 Automatic generation of a hostfile
633(2)
Appendix C lime measurement 635(6)
C.1 Introduction
635(1)
C.2 POSIX high-resolution timing
635(2)
C.3 Timing in Qt
637(1)
C.4 Timing in OpenMP
638(1)
C.5 Timing in MPI
638(1)
C.6 Timing in CUDA
638(3)
Appendix D Boost.MPI 641(2)
D.1 Mapping from MPI C to Boost.MPI
641(2)
Appendix E Setting up CUDA 643(6)
E.1 Installation
643(1)
E.2 Issues with GCC
643(1)
E.3 Running CUDA without an Nvidia GPU
644(1)
E.4 Running CUDA on optimus-equipped laptops
645(1)
E.5 Combining CUDA with third-party libraries
646(3)
Appendix F DLTlib 649(10)
F.1 DLTlib Functions
649(8)
F.1.1 Class Network: generic methods
650(2)
F.1.2 Class Network: query processing
652(1)
F.1.3 Class Network: image processing
653(1)
F.1.4 Class Network: image registration
654(3)
F.2 DLTlib Files
657(2)
Glossary 659(2)
Bibliography 661(4)
Index 665
Gerassimos Barlas is a Professor with the Computer Science & Engineering Department, American University of Sharjah, Sharjah, UAE. His research interest includes parallel algorithms, development, analysis and modeling frameworks for load balancing, and distributed Video on-Demand. Prof. Barlas has taught parallel computing for more than 12 years, has been involved with parallel computing since the early 90s, and is active in the emerging field of Divisible Load Theory for parallel and distributed systems.