| Preface |
|
xiii | |
|
|
|
xix | |
|
|
|
1 | (28) |
|
|
|
1 | (1) |
|
1.2 Toward Automating Parallel Programming |
|
|
2 | (2) |
|
|
|
4 | (8) |
|
1.4 Parallel Computing Design Considerations |
|
|
12 | (1) |
|
1.5 Parallel Algorithms and Parallel Architectures |
|
|
13 | (1) |
|
1.6 Relating Parallel Algorithm and Parallel Architecture |
|
|
14 | (1) |
|
1.7 Implementation of Algorithms: A Two-Sided Problem |
|
|
14 | (1) |
|
1.8 Measuring Benefits of Parallel Computing |
|
|
15 | (4) |
|
1.9 Amdahl's Law for Multiprocessor Systems |
|
|
19 | (2) |
|
1.10 Gustafson-Barsis's Law |
|
|
21 | (1) |
|
1.11 Applications of Parallel Computing |
|
|
22 | (7) |
|
2 Enhancing Uniprocessor Performance |
|
|
29 | (24) |
|
|
|
29 | (1) |
|
2.2 Increasing Processor Clock Frequency |
|
|
30 | (1) |
|
2.3 Parallelizing ALU Structure |
|
|
30 | (3) |
|
2.4 Using Memory Hierarchy |
|
|
33 | (6) |
|
|
|
39 | (5) |
|
2.6 Very Long Instruction Word (VLIW) Processors |
|
|
44 | (1) |
|
2.7 Instruction-Level Parallelism (ILP) and Superscalar Processors |
|
|
45 | (4) |
|
2.8 Multithreaded Processor |
|
|
49 | (4) |
|
|
|
53 | (16) |
|
|
|
53 | (1) |
|
|
|
53 | (1) |
|
3.3 Shared-Memory Multiprocessors (Uniform Memory Access [ UMA]) |
|
|
54 | (2) |
|
3.4 Distributed-Memory Multiprocessor (Nonuniform Memory Access [ NUMA]) |
|
|
56 | (1) |
|
|
|
57 | (1) |
|
|
|
57 | (3) |
|
|
|
60 | (1) |
|
3.8 Grid (Cloud) Computing |
|
|
60 | (1) |
|
|
|
61 | (1) |
|
|
|
62 | (2) |
|
3.11 Communication Between Parallel Processors |
|
|
64 | (3) |
|
3.12 Summary of Parallel Architectures |
|
|
67 | (2) |
|
4 Shared-Memory Multiprocessors |
|
|
69 | (14) |
|
|
|
69 | (1) |
|
4.2 Cache Coherence and Memory Consistency |
|
|
70 | (6) |
|
4.3 Synchronization and Mutual Exclusion |
|
|
76 | (7) |
|
5 Interconnection Networks |
|
|
83 | (22) |
|
|
|
83 | (1) |
|
5.2 Classification of Interconnection Networks by Logical Topologies |
|
|
84 | (7) |
|
5.3 Interconnection Network Switch Architecture |
|
|
91 | (14) |
|
|
|
105 | (26) |
|
|
|
105 | (1) |
|
6.2 Concurrency Platforms |
|
|
105 | (1) |
|
|
|
106 | (6) |
|
|
|
112 | (10) |
|
6.5 Compute Unified Device Architecture (CUDA) |
|
|
122 | (9) |
|
7 Ad Hoc Techniques for Parallel Algorithms |
|
|
131 | (12) |
|
|
|
131 | (2) |
|
7.2 Defining Algorithm Variables |
|
|
133 | (1) |
|
7.3 Independent Loop Scheduling |
|
|
133 | (1) |
|
|
|
134 | (1) |
|
7.5 Loop Spreading for Simple Dependent Loops |
|
|
135 | (1) |
|
|
|
135 | (1) |
|
|
|
136 | (1) |
|
7.8 Divide-and-Conquer (Recursive Partitioning) Strategies |
|
|
137 | (2) |
|
|
|
139 | (4) |
|
8 Nonserial-Parallel Algorithms |
|
|
143 | (16) |
|
|
|
143 | (1) |
|
8.2 Comparing DAG and DCG Algorithms |
|
|
143 | (2) |
|
8.3 Parallelizing NSPA Algorithms Represented by a DAG |
|
|
145 | (2) |
|
8.4 Formal Technique for Analyzing NSPAs |
|
|
147 | (3) |
|
8.5 Detecting Cycles in the Algorithm |
|
|
150 | (1) |
|
8.6 Extracting Serial and Parallel Algorithm Performance Parameters |
|
|
151 | (2) |
|
|
|
153 | (3) |
|
8.8 Performance of Serial and Parallel Algorithms on Parallel Computers |
|
|
156 | (3) |
|
|
|
159 | (8) |
|
|
|
159 | (1) |
|
9.2 Definition of z-Transform |
|
|
159 | (1) |
|
9.3 The I-D FIR Digital Filter Algorithm |
|
|
160 | (1) |
|
9.4 Software and Hardware Implementations of the z-Transform |
|
|
161 | (1) |
|
9.5 Design 1: Using Horner's Rule for Broadcast Input and Pipelined Output |
|
|
162 | (1) |
|
9.6 Design 2: Pipelined Input and Broadcast Output |
|
|
163 | (1) |
|
9.7 Design 3: Pipelined Input and Output |
|
|
164 | (3) |
|
10 Dependence Graph Analysis |
|
|
167 | (18) |
|
|
|
167 | (1) |
|
10.2 The 1-D FIR Digital Filter Algorithm |
|
|
167 | (1) |
|
10.3 The Dependence Graph of an Algorithm |
|
|
168 | (1) |
|
10.4 Deriving the Dependence Graph for an Algorithm |
|
|
168 | (3) |
|
10.5 The Scheduling Function for the 1-D FIR Filter |
|
|
171 | (6) |
|
10.6 Node Projection Operation |
|
|
177 | (2) |
|
10.7 Nonlinear Projection Operation |
|
|
179 | (1) |
|
10.8 Software and Hardware Implementations of the DAG Technique |
|
|
180 | (5) |
|
11 Computational Geometry Analysis |
|
|
185 | (24) |
|
|
|
185 | (1) |
|
11.2 Matrix Multiplication Algorithm |
|
|
185 | (1) |
|
11.3 The 3-D Dependence Graph and Computation Domain D |
|
|
186 | (2) |
|
11.4 The Facets and Vertices of D |
|
|
188 | (1) |
|
11.5 The Dependence Matrices of the Algorithm Variables |
|
|
188 | (1) |
|
11.6 Nullspace of Dependence Matrix: The Broadcast Subdomain B |
|
|
189 | (3) |
|
11.7 Design Space Exploration: Choice of Broadcasting versus Pipelining Variables |
|
|
192 | (3) |
|
|
|
195 | (5) |
|
11.9 Projection Operation Using the Linear Projection Operator |
|
|
200 | (5) |
|
11.10 Effect of Projection Operation on Data |
|
|
205 | (1) |
|
11.11 The Resulting Multithreaded/Multiprocessor Architecture |
|
|
206 | (1) |
|
11.12 Summary of Work Done in this Chapter |
|
|
207 | (2) |
|
12 Case Study: One-Dimensional IIR Digital Filters |
|
|
209 | (10) |
|
|
|
209 | (1) |
|
12.2 The 1-D IIR Digital Filter Algorithm |
|
|
209 | (1) |
|
12.3 The IIR Filter Dependence Graph |
|
|
209 | (7) |
|
12.4 z-Domain Analysis of 1-D IIR Digital Filter Algorithm |
|
|
216 | (3) |
|
13 Case Study: Two- and Three-Dimensional Digital Filters |
|
|
219 | (8) |
|
|
|
219 | (1) |
|
13.2 Line and Frame Wraparound Problems |
|
|
219 | (2) |
|
13.3 2-D Recursive Filters |
|
|
221 | (2) |
|
|
|
223 | (4) |
|
14 Case Study: Multirate Decimators and Interpolators |
|
|
227 | (18) |
|
|
|
227 | (1) |
|
14.2 Decimator Structures |
|
|
227 | (1) |
|
14.3 Decimator Dependence Graph |
|
|
228 | (2) |
|
14.4 Decimator Scheduling |
|
|
230 | (1) |
|
14.5 Decimator DAG for s1 = [ 1 0] |
|
|
231 | (2) |
|
14.6 Decimator DAG for s2 = [ 1 -1] |
|
|
233 | (2) |
|
14.7 Decimator DAG for s3 = [ 1 1] |
|
|
235 | (1) |
|
14.8 Polyphase Decimator Implementations |
|
|
235 | (1) |
|
14.9 Interpolator Structures |
|
|
236 | (1) |
|
14.10 Interpolator Dependence Graph |
|
|
237 | (1) |
|
14.11 Interpolator Scheduling |
|
|
238 | (1) |
|
14.12 Interpolator DAG for s1 = [ 1 0] |
|
|
239 | (2) |
|
14.13 Interpolator DAG for s2 = [ 1 -1] |
|
|
241 | (2) |
|
14.14 Interpolator DAG for s3 = [ 1 1] |
|
|
243 | (1) |
|
14.15 Polyphase Interpolator Implementations |
|
|
243 | (2) |
|
15 Case Study: Pattern Matching |
|
|
245 | (10) |
|
|
|
245 | (1) |
|
15.2 Expressing the Algorithm as a Regular Iterative Algorithm (RIA) |
|
|
245 | (1) |
|
15.3 Obtaining the Algorithm Dependence Graph |
|
|
246 | (1) |
|
|
|
247 | (1) |
|
|
|
248 | (1) |
|
15.6 DESIGN 1: Design Space Exploration When s = [ 1 1]t |
|
|
249 | (3) |
|
15.7 DESIGN 2: Design Space Exploration When s = [ 1 -1]t |
|
|
252 | (1) |
|
15.8 DESIGN 3: Design Space Exploration When s = [ 1 0]t |
|
|
253 | (2) |
|
16 Case Study: Motion Estimation for Video Compression |
|
|
255 | (12) |
|
|
|
255 | (1) |
|
|
|
256 | (1) |
|
16.3 Data Buffering Requirements |
|
|
257 | (1) |
|
16.4 Formulation of the FBMA |
|
|
258 | (1) |
|
16.5 Hierarchical Formulation of Motion Estimation |
|
|
259 | (2) |
|
16.6 Hardware Design of the Hierarchy Blocks |
|
|
261 | (6) |
|
17 Case Study: Multiplication over GF(2m) |
|
|
267 | (12) |
|
|
|
267 | (1) |
|
17.2 The Multiplication Algorithm in GF(2m) |
|
|
268 | (2) |
|
17.3 Expressing Field Multiplication as an RIA |
|
|
270 | (1) |
|
17.4 Field Multiplication Dependence Graph |
|
|
270 | (1) |
|
|
|
271 | (2) |
|
|
|
273 | (2) |
|
17.7 Design 1: Using d1 = [ 1 0]t |
|
|
275 | (1) |
|
17.8 Design 2: Using d2 = [ 1 1]t |
|
|
275 | (2) |
|
17.9 Design 3: Using d3 = [ 1 -1]t |
|
|
277 | (1) |
|
17.10 Applications of Finite Field Multipliers |
|
|
277 | (2) |
|
18 Case Study: Polynomial Division over GF(2) |
|
|
279 | (14) |
|
|
|
279 | (1) |
|
18.2 The Polynomial Division Algorithm |
|
|
279 | (2) |
|
18.3 The LFSR Dependence Graph |
|
|
281 | (1) |
|
|
|
282 | (1) |
|
|
|
283 | (1) |
|
18.6 Design 1: Design Space Exploration When s1 = [ 1 -1] |
|
|
284 | (2) |
|
18.7 Design 2: Design Space Exploration When s2 = [ 1 0] |
|
|
286 | (3) |
|
18.8 Design 3: Design Space Exploration When s3 = [ 1 -0.5] |
|
|
289 | (2) |
|
18.9 Comparing the Three Designs |
|
|
291 | (2) |
|
19 The Fast Fourier Transform |
|
|
293 | (12) |
|
|
|
293 | (2) |
|
19.2 Decimation-in-Time FFT |
|
|
295 | (3) |
|
19.3 Pipeline Radix-2 Decimation-in-Time FFT Processor |
|
|
298 | (1) |
|
19.4 Decimation-in-Frequency FFT |
|
|
299 | (4) |
|
19.5 Pipeline Radix-2 Decimation-in-Frequency FFT Processor |
|
|
303 | (2) |
|
20 Solving Systems of Linear Equations |
|
|
305 | (18) |
|
|
|
305 | (1) |
|
20.2 Special Matrix Structures |
|
|
305 | (4) |
|
20.3 Forward Substitution (Direct Technique) |
|
|
309 | (3) |
|
|
|
312 | (1) |
|
20.5 Matrix Triangularization Algorithm |
|
|
312 | (5) |
|
20.6 Successive over Relaxation (SOR) (Iterative Technique) |
|
|
317 | (4) |
|
|
|
321 | (2) |
|
21 Solving Partial Differential Equations Using Finite Difference Method |
|
|
323 | (8) |
|
|
|
323 | (1) |
|
|
|
324 | (7) |
| References |
|
331 | (6) |
| Index |
|
337 | |