Muutke küpsiste eelistusi

Introduction to Parallel Programming 2nd edition [Pehme köide]

(Assistant Professor, Department of Computer Science, University of San Francisco, CA, USA), (University of San Francisco, USA)
  • Formaat: Paperback / softback, 496 pages, kõrgus x laius: 235x191 mm, kaal: 770 g, 65 illustrations; Illustrations
  • Ilmumisaeg: 15-Nov-2021
  • Kirjastus: Morgan Kaufmann Publishers In
  • ISBN-10: 0128046058
  • ISBN-13: 9780128046050
  • Formaat: Paperback / softback, 496 pages, kõrgus x laius: 235x191 mm, kaal: 770 g, 65 illustrations; Illustrations
  • Ilmumisaeg: 15-Nov-2021
  • Kirjastus: Morgan Kaufmann Publishers In
  • ISBN-10: 0128046058
  • ISBN-13: 9780128046050

An Introduction to Parallel Programming, Second Edition presents a tried-and-true tutorial approach that shows students how to develop effective parallel programs with MPI, Pthreads and OpenMP.

As the first undergraduate text to directly address compiling and running parallel programs on multi-core and cluster architecture, this second edition carries forward its clear explanations for designing, debugging and evaluating the performance of distributed and shared-memory programs while adding coverage of accelerators via new content on GPU programming and heterogeneous programming. New and improved user-friendly exercises teach students how to compile, run and modify example programs.

  • Takes a tutorial approach, starting with small programming examples and building progressively to more challenging examples
  • Explains how to develop parallel programs using MPI, Pthreads and OpenMP programming models
  • A robust package of online ancillaries for instructors and students includes lecture slides, solutions manual, downloadable source code, and an image bank

New to this edition:

  • New chapters on GPU programming and heterogeneous programming
  • New examples and exercises related to parallel algorithms

Muu info

Leading undergraduate text in parallel programming, covering OpenMP, MPI and Pthreads, three of the most widely used parallel programming environments
Preface xv
Chapter 1 Why parallel computing
1(16)
1.1 Why we need ever-increasing performance
2(1)
1.2 Why we're building parallel systems
3(1)
1.3 Why we need to write parallel programs
3(3)
1.4 How do we write parallel programs?
6(2)
1.5 What we'll be doing
8(2)
1.6 Concurrent, parallel, distributed
10(1)
1.7 The rest of the book
11(1)
1.8 A word of warning
11(1)
1.9 Typographical conventions
12(1)
1.10 Summary
12(2)
1.11 Exercises
14(3)
Chapter 2 Parallel hardware and parallel software
17(72)
2.1 Some background
17(3)
2.1.1 The von Neumann architecture
17(2)
2.1.2 Processes, multitasking, and threads
19(1)
2.2 Modifications to the von Neumann model
20(11)
2.2.1 The basics of caching
21(2)
2.2.2 Cache mappings
23(1)
2.2.3 Caches and programs: an example
24(1)
2.2.4 Virtual memory
25(2)
2.2.5 Instruction-level parallelism
27(3)
2.2.6 Hardware multithreading
30(1)
2.3 Parallel hardware
31(18)
2.3.1 Classifications of parallel computers
31(1)
2.3.2 SIMD systems
31(3)
2.3.3 MIMD systems
34(3)
2.3.4 Interconnection networks
37(8)
2.3.5 Cache coherence
45(4)
2.3.6 Shared-memory vs. distributed-memory
49(1)
2.4 Parallel software
49(11)
2.4.1 Caveats
50(1)
2.4.2 Coordinating the processes/threads
50(1)
2.4.3 Shared-memory
51(4)
2.4.4 Distributed-memory
55(3)
2.4.5 GPU programming
58(2)
2.4.6 Programming hybrid systems
60(1)
2.5 Input and output
60(1)
2.5.1 MIMD systems
60(1)
2.5.2 GPUs
61(1)
2.6 Performance
61(10)
2.6.1 Speedup and efficiency in MIMD systems
61(4)
2.6.2 Amdahl's law
65(1)
2.6.3 Scalability in MIMD systems
66(1)
2.6.4 Taking timings of MIMD programs
67(3)
2.6.5 GPU performance
70(1)
2.7 Parallel program design
71(4)
2.7.1 An example
71(4)
2.8 Writing and running parallel programs
75(2)
2.9 Assumptions
77(1)
2.10 Summary
78(6)
2.10.1 Serial systems
78(1)
2.10.2 Parallel hardware
79(2)
2.10.3 Parallel software
81(1)
2.10.4 Input and output
82(1)
2.10.5 Performance
82(1)
2.10.6 Parallel program design
83(1)
2.10.7 Assumptions
84(1)
2.11 Exercises
84(5)
Chapter 3 Distributed memory programming with MPI
89(70)
3.1 Getting started
90(10)
3.1.1 Compilation and execution
91(1)
3.1.2 MPI programs
92(1)
3.1.3 MPI_Init and MPI_Finalize
93(1)
3.1.4 Communicators, MPI_Comm_size, and MPI_Comm_rank
94(1)
3.1.5 SPMD programs
94(1)
3.1.6 Communication
95(1)
3.1.7 MPI_Send
95(2)
3.1.8 MPI_Recv
97(1)
3.1.9 Message matching
97(1)
3.1.10 The status_p argument
98(1)
3.1.11 Semantics of MPI_Send and MPI_Recv
99(1)
3.1.12 Some potential pitfalls
100(1)
3.2 The trapezoidal rule in MPI
100(4)
3.2.1 The trapezoidal rule
101(1)
3.2.2 Parallelizing the trapezoidal rule
102(2)
3.3 Dealing with I/O
104(4)
3.3.1 Output
105(2)
3.3.2 Input
107(1)
3.4 Collective communication
108(15)
3.4.1 Tree-structured communication
108(2)
3.4.2 MPI_Reduce
110(2)
3.4.3 Collective vs. point-to-point communications
112(1)
3.4.4 MPI_All reduce
113(1)
3.4.5 Broadcast
113(3)
3.4.6 Data distributions
116(1)
3.4.7 Scatter
117(2)
3.4.8 Gather
119(2)
3.4.9 Allgather
121(2)
3.5 MPI-derived datatypes
123(4)
3.6 Performance evaluation of MPI programs
127(8)
3.6.1 Taking timings
127(3)
3.6.2 Results
130(3)
3.6.3 Speedup and efficiency
133(1)
3.6.4 Scalability
134(1)
3.7 A parallel sorting algorithm
135(9)
3.7.1 Some simple serial sorting algorithms
135(2)
3.7.2 Parallel odd-even transposition sort
137(3)
3.7.3 Safety in MPI programs
140(3)
3.7.4 Final details of parallel odd-even sort
143(1)
3.8 Summary
144(4)
3.9 Exercises
148(7)
3.10 Programming assignments
155(4)
Chapter 4 Shared-memory programming with Pthreads
159(62)
4.1 Processes, threads, and Pthreads
160(1)
4.2 Hello, world
161(8)
4.2.1 Execution
161(2)
4.2.2 Preliminaries
163(1)
4.2.3 Starting the threads
164(2)
4.2.4 Running the threads
166(1)
4.2.5 Stopping the threads
167(1)
4.2.6 Error checking
168(1)
4.2.7 Other approaches to thread startup
168(1)
4.3 Matrix-vector multiplication
169(2)
4.4 Critical sections
171(4)
4.5 Busy-waiting
175(3)
4.6 Mutexes
178(4)
4.7 Producer--consumer synchronization and semaphores
182(5)
4.8 Barriers and condition variables
187(6)
4.8.1 Busy-waiting and a mutex
188(1)
4.8.2 Semaphores
189(1)
4.8.3 Condition variables
190(3)
4.8.4 Pthreads barriers
193(1)
4.9 Read-write locks
193(9)
4.9.1 Sorted linked list functions
193(1)
4.9.2 A multithreaded linked list
194(4)
4.9.3 Pthreads read-write locks
198(2)
4.9.4 Performance of the various implementations
200(1)
4.9.5 Implementing read-write locks
201(1)
4.10 Caches, cache-coherence, and false sharing
202(5)
4.11 Thread-safety
207(3)
4.11.1 Incorrect programs can produce correct output
210(1)
4.12 Summary
210(2)
4.13 Exercises
212(7)
4.14 Programming assignments
219(2)
Chapter 5 Shared-memory programming with OpenMP
221(70)
5.1 Getting started
222(6)
5.1.1 Compiling and running OpenMP programs
223(1)
5.1.2 The program
224(3)
5.1.3 Error checking
227(1)
5.2 The trapezoidal rule
228(5)
5.2.1 A first OpenMP version
229(4)
5.3 Scope of variables
233(1)
5.4 The reduction clause
234(3)
5.5 The parallel for directive
237(8)
5.5.1 Caveats
238(1)
5.5.2 Data dependences
239(2)
5.5.3 Finding loop-carried dependences
241(1)
5.5.4 Estimating it
241(3)
5.5.5 More on scope
244(1)
5.6 More about loops in OpenMP: sorting
245(4)
5.6.1 Bubble sort
245(1)
5.6.2 Odd-even transposition sort
246(3)
5.7 Scheduling loops
249(7)
5.7.1 The schedule clause
250(2)
5.7.2 The static schedule type
252(1)
5.7.3 The dynamic and guided schedule types
252(1)
5.7.4 The runtime schedule type
253(2)
5.7.5 Which schedule?
255(1)
5.8 Producers and consumers
256(9)
5.8.1 Queues
256(1)
5.8.2 Message-passing
256(1)
5.8.3 Sending messages
257(1)
5.8.4 Receiving messages
257(1)
5.8.5 Termination detection
258(1)
5.8.6 Startup
259(1)
5.8.7 The atomic directive
259(1)
5.8.8 Critical sections and locks
260(2)
5.8.9 Using locks in the message-passing program
262(1)
5.8.10 Critical directives, atomic directives, or locks?
263(1)
5.8.11 Some caveats
264(1)
5.9 Caches, cache coherence, and false sharing
265(5)
5.10 Tasking
270(4)
5.11 Thread-safety
274(3)
5.11.1 Incorrect programs can produce correct output
276(1)
5.12 Summary
277(4)
5.13 Exercises
281(5)
5.14 Programming assignments
286(5)
Chapter 6 GPU programming with CUDA
291(70)
6.1 GPUs and GPGPU
291(1)
6.2 GPU architectures
292(1)
6.3 Heterogeneous computing
293(2)
6.4 CUDA hello
295(3)
6.4.1 The source code
296(1)
6.4.2 Compiling and running the program
296(2)
6.5 A closer look
298(1)
6.6 Threads, blocks, and grids
299(2)
6.7 Nvidia compute capabilities and device architectures
301(1)
6.8 Vector addition
302(10)
6.8.1 The kernel
303(2)
6.8.2 Get_args
305(1)
6.8.3 Allocate_vectors and managed memory
305(2)
6.8.4 Other functions called from main
307(2)
6.8.5 Explicit memory transfers
309(3)
6.9 Returning results from CUDA kernels
312(2)
6.10 CUDA trapezoidal rule I
314(6)
6.10.1 The trapezoidal rule
314(1)
6.10.2 A CUDA implementation
315(1)
6.10.3 Initialization, return value, and final update
316(2)
6.10.4 Using the correct threads
318(1)
6.10.5 Updating the return value and atomicAdd
318(1)
6.10.6 Performance of the CUDA trapezoidal rule
319(1)
6.11 CUDA trapezoidal rule II: improving performance
320(8)
6.11.1 Tree-structured communication
320(2)
6.11.2 Local variables, registers, shared and global memory
322(2)
6.11.3 Warps and warp shuffles
324(1)
6.11.4 Implementing tree-structured global sum with a warp shuffle
325(1)
6.11.5 Shared memory and an alternative to the warp shuffle
326(2)
6.12 Implementation of trapezoidal rule with warp Size thread blocks
328(3)
6.12.1 Host code
329(1)
6.12.2 Kernel with warp shuffle
329(1)
6.12.3 Kernel with shared memory
329(1)
6.12.4 Performance
330(1)
6.13 CUDA trapezoidal rule III: blocks with more than one warp
331(7)
6.13.1 _Syncthreads
331(2)
6.13.2 More shared memory
333(1)
6.13.3 Shared memory warp sums
333(1)
6.13.4 Shared memory banks
334(2)
6.13.5 Finishing up
336(1)
6.13.6 Performance
336(2)
6.14 Bitonic sort
338(11)
6.14.1 Serial bitonic sort
338(4)
6.14.2 Butterflies and binary representations
342(3)
6.14.3 Parallel bitonic sort I
345(2)
6.14.4 Parallel bitonic sort II
347(1)
6.14.5 Performance of CUDA bitonic sort
348(1)
6.15 Summary
349(6)
6.16 Exercises
355(3)
6.17 Programming assignments
358(3)
Chapter 7 Parallel program development
361(94)
7.1 Two n-body solvers
361(37)
7.1.1 The problem
361(2)
7.1.2 Two serial programs
363(5)
7.1.3 Parallelizing the n-body solvers
368(3)
7.1.4 A word about I/O
371(1)
7.1.5 Parallelizing the basic solver using OpenMP
371(4)
7.1.6 Parallelizing the reduced solver using OpenMP
375(4)
7.1.7 Evaluating the OpenMP codes
379(1)
7.1.8 Parallelizing the solvers using Pthreads
380(1)
7.1.9 Parallelizing the basic solver using MPI
381(2)
7.1.10 Parallelizing the reduced solver using MPI
383(6)
7.1.11 Performance of the MPI solvers
389(1)
7.1.12 Parallelizing the basic solver using CUDA
390(3)
7.1.13 A note on cooperative groups in CUDA
393(1)
7.1.14 Performance of the basic CUDA n-body solver
393(1)
7.1.15 Improving the performance of the CUDA n-body solver
394(1)
7.1.16 Using shared memory in the n-body solver
395(3)
7.2 Sample sort
398(41)
7.2.1 Sample sort and bucket sort
398(2)
7.2.2 Choosing the sample
400(1)
7.2.3 A simple implementation of the Map function
401(1)
7.2.4 An alternative implementation of Map
402(5)
7.2.5 Parallelizing sample sort
407(3)
7.2.6 Implementing sample sort with OpenMP
410(5)
7.2.7 Implementing sample sort with Pthreads
415(3)
7.2.8 Implementing sample sort with MPI
418(11)
7.2.9 Implementing sample sort with CUDA
429(10)
7.3 A word of caution
439(1)
7.4 Which API?
439(1)
7.5 Summary
440(4)
7.5.1 MPI
442(2)
7.6 Exercises
444(7)
7.7 Programming assignments
451(4)
Chapter 8 Where to go from here
455(4)
Bibliography 459(4)
Index 463
Peter Pacheco received a PhD in mathematics from Florida State University. After completing graduate school, he became one of the first professors in UCLAs Program in Computing,” which teaches basic computer science to students at the College of Letters and Sciences there. Since leaving UCLA, he has been on the faculty of the University of San Francisco. At USF Peter has served as chair of the computer science department and is currently chair of the mathematics department.His research is in parallel scientific computing. He has worked on the development of parallel software for circuit simulation, speech recognition, and the simulation of large networks of biologically accurate neurons. Peter has been teaching parallel computing at both the undergraduate and graduate levels for nearly twenty years. He is the author of Parallel Programming with MPI, published by Morgan Kaufmann Publishers. Matthew Malensek is an Assistant Professor in the Department of Computer Science at the University of San Francisco. His research interests are centered around big data, parallel/distributed systems, and cloud computing. This includes systems approaches for processing and managing data at scale in a variety of domains, including fog computing and Internet of Things (IoT) devices.