|
|
|
xvii | |
|
|
|
xxv | |
| Preface |
|
xxvii | |
| About the Editors |
|
xxix | |
| Contributors |
|
xxxi | |
|
|
|
1 | (80) |
|
1 Implementing Matrix Multiplication on the Cell B. E. |
|
|
3 | (18) |
|
|
|
|
|
|
|
|
|
3 | (2) |
|
1.1.1 Performance Considerations |
|
|
4 | (1) |
|
1.1.2 Code Size Considerations |
|
|
4 | (1) |
|
|
|
5 | (11) |
|
|
|
5 | (1) |
|
1.2.2 C = C - A × B trans |
|
|
6 | (4) |
|
|
|
10 | (2) |
|
1.2.4 Advancing Tile Pointers |
|
|
12 | (4) |
|
|
|
16 | (1) |
|
|
|
17 | (1) |
|
|
|
18 | (1) |
|
|
|
19 | (2) |
|
2 Implementing Matrix Factorizations on the Cell B. E. |
|
|
21 | (16) |
|
|
|
|
|
|
|
21 | (1) |
|
2.2 Cholesky Factorization |
|
|
22 | (1) |
|
2.3 Tile QR Factorization |
|
|
23 | (3) |
|
|
|
26 | (2) |
|
2.5 Parallelization---Single Cell B. E. |
|
|
28 | (2) |
|
2.6 Parallelization---Dual Cell B. E. |
|
|
30 | (1) |
|
|
|
31 | (1) |
|
|
|
32 | (1) |
|
|
|
33 | (1) |
|
|
|
34 | (3) |
|
3 Dense Linear Algebra for Hybrid GPU-Based Systems |
|
|
37 | (20) |
|
|
|
|
|
|
|
37 | (2) |
|
3.1.1 Linear Algebra (LA)---Enabling New Architectures |
|
|
38 | (1) |
|
3.1.2 MAGMA---LA Libraries for Hybrid Architectures |
|
|
38 | (1) |
|
3.2 Hybrid DLA Algorithms |
|
|
39 | (11) |
|
3.2.1 How to Code DLA for GPUs? |
|
|
39 | (2) |
|
3.2.2 The Approach---Hybridization of DLA Algorithms |
|
|
41 | (2) |
|
3.2.3 One-Sided Factorizations |
|
|
43 | (3) |
|
3.2.4 Two-Sided Factorizations |
|
|
46 | (4) |
|
|
|
50 | (3) |
|
|
|
53 | (1) |
|
|
|
54 | (3) |
|
|
|
57 | (24) |
|
|
|
|
|
|
|
|
|
57 | (1) |
|
4.2 BLAS Kernels Development |
|
|
58 | (10) |
|
|
|
60 | (1) |
|
|
|
61 | (1) |
|
|
|
61 | (2) |
|
|
|
63 | (1) |
|
|
|
64 | (1) |
|
|
|
65 | (1) |
|
|
|
66 | (1) |
|
|
|
67 | (1) |
|
4.3 Generic Kernel Optimizations |
|
|
68 | (9) |
|
4.3.1 Pointer Redirecting |
|
|
68 | (4) |
|
|
|
72 | (1) |
|
|
|
72 | (5) |
|
|
|
77 | (2) |
|
|
|
79 | (2) |
|
|
|
81 | (30) |
|
5 Sparse Matrix-Vector Multiplication on Multicore and Accelerators |
|
|
83 | (28) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 | (1) |
|
5.2 Sparse Matrix-Vector Multiplication: Overview and Intuition |
|
|
84 | (2) |
|
5.3 Architectures, Programming Models, and Matrices |
|
|
86 | (5) |
|
5.3.1 Hardware Architectures |
|
|
86 | (3) |
|
5.3.2 Parallel Programming Models |
|
|
89 | (1) |
|
|
|
90 | (1) |
|
5.4 Implications of Architecture on SpMV |
|
|
91 | (2) |
|
|
|
91 | (1) |
|
|
|
92 | (1) |
|
5.5 Optimization Principles for SpMV |
|
|
93 | (6) |
|
5.5.1 Reorganization for Efficient Parallelization |
|
|
93 | (2) |
|
5.5.2 Orchestrating Data Movement |
|
|
95 | (1) |
|
5.5.3 Reducing Memory Traffic |
|
|
96 | (1) |
|
5.5.4 Putting It All Together: Implementations |
|
|
97 | (2) |
|
|
|
99 | (6) |
|
5.6.1 Xeon X5550 (Nehalem) |
|
|
100 | (2) |
|
|
|
102 | (1) |
|
|
|
103 | (2) |
|
5.7 Summary: Cross-Study Comparison |
|
|
105 | (2) |
|
|
|
107 | (1) |
|
|
|
108 | (3) |
|
|
|
111 | (38) |
|
6 Hardware-Oriented Multigrid Finite Element Solvers on GPU-Accelerated Clusters |
|
|
113 | (18) |
|
|
|
|
|
|
|
|
|
6.1 Introduction and Motivation |
|
|
113 | (3) |
|
6.2 FEAST---Finite Element Analysis and Solution Tools |
|
|
116 | (4) |
|
6.2.1 Separation of Structured and Unstructured Data |
|
|
117 | (1) |
|
6.2.2 Parallel Multigrid Solvers |
|
|
117 | (1) |
|
6.2.3 Scalar and Multivariate Problems |
|
|
118 | (1) |
|
6.2.4 Co-Processor Acceleration |
|
|
119 | (1) |
|
6.3 Two FEAST Applications: FEASTSOLID and FEASTFLOW |
|
|
120 | (4) |
|
6.3.1 Computational Solid Mechanics |
|
|
120 | (1) |
|
6.3.2 Computational Fluid Dynamics |
|
|
121 | (1) |
|
6.3.3 Solving CSM and CFD Problems with FEAST |
|
|
122 | (2) |
|
6.4 Performance Assessments |
|
|
124 | (4) |
|
6.4.1 GPU-Based Multigrid on a Single Subdomain |
|
|
124 | (1) |
|
|
|
125 | (1) |
|
6.4.3 Application Speedup |
|
|
125 | (3) |
|
|
|
128 | (1) |
|
|
|
128 | (1) |
|
|
|
128 | (3) |
|
7 Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers |
|
|
131 | (18) |
|
|
|
|
|
|
|
131 | (3) |
|
7.1.1 Numerical Solution of Partial Differential Equations |
|
|
132 | (1) |
|
7.1.2 Hardware-Oriented Discretization of Large Domains |
|
|
132 | (1) |
|
7.1.3 Mixed-Precision Iterative Refinement Multigrid |
|
|
133 | (1) |
|
7.2 Fine-Grained Parallelization of Multigrid Solvers |
|
|
134 | (7) |
|
7.2.1 Smoothers on the CPU |
|
|
134 | (2) |
|
7.2.2 Exact Parallelization: Jacobi and Tridiagonal Solvers |
|
|
136 | (2) |
|
7.2.3 Multicolor Parallelization: Gauß-Seidel Solvers |
|
|
138 | (1) |
|
7.2.4 Combination of Tridiagonal and Gauß-Seidel Smoothers |
|
|
139 | (1) |
|
7.2.5 Alternating Direction Implicit Method |
|
|
140 | (1) |
|
7.3 Numerical Evaluation and Performance Results |
|
|
141 | (4) |
|
|
|
141 | (1) |
|
7.3.2 Solver Configuration and Hardware Details |
|
|
142 | (1) |
|
7.3.3 Numerical Evaluation |
|
|
142 | (1) |
|
|
|
143 | (2) |
|
|
|
145 | (1) |
|
7.4 Summary and Conclusions |
|
|
145 | (1) |
|
|
|
145 | (1) |
|
|
|
146 | (3) |
|
IV Fast Fourier Transforms |
|
|
149 | (44) |
|
8 Designing Fast Fourier Transform for the IBM Cell Broad-band Engine |
|
|
151 | (20) |
|
|
|
|
|
|
|
151 | (1) |
|
|
|
152 | (2) |
|
8.3 Fast Fourier Transform |
|
|
154 | (1) |
|
8.4 Cell Broadband Engine Architecture |
|
|
155 | (3) |
|
8.5 FFTC: Our FFT Algorithm for the Cell/B. E. Processor |
|
|
158 | (5) |
|
8.5.1 Parallelizing FFTC for the Cell |
|
|
158 | (1) |
|
8.5.2 Optimizing FFTC for the SPEs |
|
|
159 | (4) |
|
8.6 Performance Analysis of FFTC |
|
|
163 | (3) |
|
|
|
166 | (1) |
|
|
|
167 | (1) |
|
|
|
167 | (4) |
|
9 Implementing FFTs on Multicore Architectures |
|
|
171 | (22) |
|
|
|
|
|
|
|
|
|
172 | (1) |
|
9.2 Computational Aspects of FFT Algorithms |
|
|
173 | (2) |
|
9.2.1 An Upper Bound on FFT Performance |
|
|
174 | (1) |
|
9.3 Data Movement and Preparation of FFT Algorithms |
|
|
175 | (2) |
|
9.4 Multicore FFT Performance Optimization |
|
|
177 | (1) |
|
|
|
178 | (7) |
|
9.5.1 Registers and Load and Store Operations |
|
|
180 | (1) |
|
9.5.1.1 Applying SIMD Operations |
|
|
180 | (1) |
|
9.5.1.2 Instruction Pipeline |
|
|
181 | (1) |
|
9.5.1.3 Multi-Issue Instructions |
|
|
182 | (1) |
|
9.5.2 Private and Shared Core Memory, and Their Data Movement |
|
|
183 | (1) |
|
|
|
183 | (1) |
|
9.5.2.2 Parallel Computation on Shared Core Memory |
|
|
183 | (1) |
|
|
|
184 | (1) |
|
9.5.3.1 Index-Bit Reversal of Data Block Addresses |
|
|
184 | (1) |
|
9.5.3.2 Transposition of the Elements |
|
|
184 | (1) |
|
|
|
185 | (1) |
|
9.6 Generic FFT Generators and Tooling |
|
|
185 | (2) |
|
9.6.1 A Platform-Independent Expression of Performance Planning |
|
|
185 | (1) |
|
9.6.2 Reducing the Mapping Space |
|
|
186 | (1) |
|
|
|
187 | (1) |
|
9.7 Case Study: Large, Multi-Dimensional FFT on a Network Clustered System |
|
|
187 | (3) |
|
|
|
190 | (1) |
|
|
|
190 | (3) |
|
V Combinatorial Algorithms |
|
|
193 | (24) |
|
10 Combinatorial Algorithm Design on the Cell/B. E. Processor |
|
|
195 | (22) |
|
|
|
|
|
|
|
|
|
|
|
195 | (3) |
|
10.2 Algorithm Design and Analysis on the Cell/B. E. |
|
|
198 | (3) |
|
10.2.1 A Complexity Model |
|
|
198 | (1) |
|
10.2.2 Analyzing Algorithms |
|
|
199 | (2) |
|
|
|
201 | (12) |
|
10.3.1 A Parallelization Strategy |
|
|
201 | (1) |
|
10.3.2 Complexity Analysis |
|
|
202 | (1) |
|
10.3.3 A Novel Latency-Hiding Technique for Irregular Applications |
|
|
203 | (2) |
|
10.3.4 Cell/B. E. Implementation |
|
|
205 | (1) |
|
10.3.5 Performance Results |
|
|
206 | (7) |
|
|
|
213 | (1) |
|
|
|
214 | (1) |
|
|
|
214 | (3) |
|
|
|
217 | (62) |
|
11 Auto-Tuning Stencil Computations on Multicore and Accelerators |
|
|
219 | (36) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
220 | (1) |
|
|
|
221 | (1) |
|
11.3 Experimental Testbed |
|
|
222 | (1) |
|
11.4 Performance Expectation |
|
|
223 | (8) |
|
11.4.1 Stencil Characteristics |
|
|
225 | (1) |
|
11.4.2 A Brief Introduction to the Roofline Model |
|
|
225 | (2) |
|
11.4.3 Roofline Model-Based Performance Expectations |
|
|
227 | (4) |
|
11.5 Stencil Optimizations |
|
|
231 | (5) |
|
11.5.1 Parallelization and Problem Decomposition |
|
|
232 | (1) |
|
|
|
233 | (1) |
|
11.5.3 Bandwidth Optimizations |
|
|
233 | (1) |
|
11.5.4 In-Core Optimizations |
|
|
234 | (1) |
|
11.5.5 Algorithmic Transformations |
|
|
235 | (1) |
|
11.6 Auto-tuning Methodology |
|
|
236 | (3) |
|
11.6.1 Architecture-Specific Exceptions |
|
|
238 | (1) |
|
11.7 Results and Analysis |
|
|
239 | (10) |
|
11.7.1 Nehalem Performance |
|
|
240 | (2) |
|
11.7.2 Barcelona Performance |
|
|
242 | (1) |
|
11.7.3 Clovertown Performance |
|
|
243 | (1) |
|
11.7.4 Blue Gene/P Performance |
|
|
244 | (1) |
|
11.7.5 Victoria Falls Performance |
|
|
244 | (1) |
|
|
|
245 | (1) |
|
11.7.7 GTX280 Performance |
|
|
246 | (1) |
|
11.7.8 Cross-Platform Performance and Power Comparison |
|
|
247 | (2) |
|
|
|
249 | (2) |
|
|
|
251 | (1) |
|
|
|
251 | (4) |
|
12 Manycore Stencil Computations in Hyperthermia Applications |
|
|
255 | (24) |
|
|
|
|
|
|
|
|
|
|
|
|
|
255 | (1) |
|
12.2 Hyperthermia Applications |
|
|
256 | (3) |
|
12.3 Bandwidth-Saving Stencil Computations |
|
|
259 | (7) |
|
12.3.1 Spatial Blocking and Parallelization |
|
|
259 | (2) |
|
|
|
261 | (2) |
|
12.3.2.1 Temporally Blocking the Hyperthermia Stencil |
|
|
263 | (1) |
|
12.3.2.2 Speedup for the Hyperthermia Stencil |
|
|
264 | (2) |
|
12.4 Experimental Performance Results |
|
|
266 | (7) |
|
|
|
268 | (3) |
|
12.4.2 Application Benchmarks |
|
|
271 | (2) |
|
|
|
273 | (1) |
|
|
|
273 | (1) |
|
|
|
274 | (1) |
|
|
|
274 | (5) |
|
|
|
279 | (50) |
|
13 Enabling Bioinformatics Algorithms on the Cell/B. E. Processor |
|
|
281 | (16) |
|
|
|
|
|
|
|
13.1 Computational Biology and High-Performance Computing |
|
|
281 | (2) |
|
13.2 The Cell/B. E. Processor |
|
|
283 | (1) |
|
13.2.1 Cache Implementation on Cell/B. E. |
|
|
283 | (1) |
|
13.3 Sequence Analysis and Its Applications |
|
|
284 | (2) |
|
13.4 Sequence Analysis on the Cell/B. E. Processor |
|
|
286 | (3) |
|
|
|
286 | (2) |
|
|
|
288 | (1) |
|
|
|
289 | (5) |
|
13.5.1 Experimental Setup |
|
|
289 | (1) |
|
13.5.2 ClustalW Results and Analysis |
|
|
289 | (4) |
|
13.5.3 FASTA Results and Analysis |
|
|
293 | (1) |
|
13.6 Conclusions and Future Work |
|
|
294 | (1) |
|
|
|
295 | (2) |
|
14 Pairwise Computations on the Cell Processor |
|
|
297 | (32) |
|
|
|
|
|
|
|
|
|
298 | (1) |
|
14.2 Scheduling Pairwise Computations |
|
|
299 | (9) |
|
|
|
300 | (1) |
|
|
|
301 | (1) |
|
|
|
302 | (1) |
|
14.2.3.1 Fetching Input Vectors |
|
|
303 | (1) |
|
14.2.3.2 Shifting Column Vectors |
|
|
303 | (1) |
|
14.2.3.3 Transferring Output Data |
|
|
303 | (1) |
|
14.2.3.4 Minimizing Number of DMA Transfers |
|
|
304 | (1) |
|
14.2.4 Extending Tiling across Multiple Cell Processors |
|
|
305 | (1) |
|
14.2.5 Extending Tiling to Large Number of Dimensions |
|
|
306 | (2) |
|
14.3 Reconstructing Gene Regulatory Networks |
|
|
308 | (4) |
|
14.3.1 Computing Pairwise Mutual Information on the Cell |
|
|
309 | (1) |
|
14.3.2 Performance of Pairwise MI Computations on One Cell Blade |
|
|
310 | (1) |
|
14.3.3 Performance of MI Computations on Multiple Cell Blades |
|
|
311 | (1) |
|
14.4 Pairwise Genomic Alignments |
|
|
312 | (11) |
|
14.4.1 Computing Alignments |
|
|
312 | (1) |
|
14.4.1.1 Global/Local Alignment |
|
|
313 | (1) |
|
14.4.1.2 Spliced Alignment |
|
|
314 | (1) |
|
14.4.1.3 Syntenic Alignment |
|
|
315 | (1) |
|
14.4.2 A Parallel Alignment Algorithm for the Cell BE |
|
|
316 | (1) |
|
14.4.2.1 Parallel Alignment using Prefix Computations |
|
|
316 | (1) |
|
14.4.2.2 Wavefont Communication Scheme |
|
|
317 | (1) |
|
14.4.2.3 A Hybrid Parallel Algorithm |
|
|
318 | (2) |
|
14.4.2.4 Hirschberg's Technique for Linear Space |
|
|
320 | (1) |
|
14.4.2.5 Algorithms for Specialized Alignments |
|
|
321 | (1) |
|
|
|
321 | (1) |
|
14.4.3 Performance of the Hybrid Alignment Algorithms |
|
|
321 | (2) |
|
|
|
323 | (1) |
|
|
|
324 | (1) |
|
|
|
324 | (5) |
|
|
|
329 | (44) |
|
15 Drug Design on the Cell BE |
|
|
331 | (20) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
332 | (1) |
|
15.2 Bioinformatics and Drug Design |
|
|
333 | (4) |
|
15.2.1 Protein-Ligand Docking |
|
|
335 | (1) |
|
15.2.2 Protein-Protein Docking |
|
|
336 | (1) |
|
15.2.3 Molecular Mechanics |
|
|
337 | (1) |
|
15.3 Cell BE Porting Analysis |
|
|
337 | (2) |
|
|
|
339 | (1) |
|
15.5 Case Study: Docking with FTDock |
|
|
339 | (4) |
|
15.5.1 Algorithm Description |
|
|
339 | (1) |
|
15.5.2 Profiling and Implementation |
|
|
340 | (1) |
|
15.5.3 Performance Evaluation |
|
|
341 | (2) |
|
15.6 Case Study: Molecular Dynamics with Moldy |
|
|
343 | (2) |
|
15.6.1 Algorithm Description |
|
|
343 | (1) |
|
15.6.2 Profiling and Implementation |
|
|
343 | (1) |
|
15.6.3 Performance Evaluation |
|
|
344 | (1) |
|
|
|
345 | (1) |
|
|
|
346 | (1) |
|
|
|
346 | (5) |
|
16 GPU Algorithms for Molecular Modeling |
|
|
351 | (22) |
|
|
|
|
|
|
|
|
|
|
|
352 | (1) |
|
16.2 Computational Challenges of Molecular Modeling |
|
|
352 | (1) |
|
|
|
353 | (4) |
|
16.3.1 GPU Hardware Organization |
|
|
354 | (1) |
|
16.3.2 GPU Programming Model |
|
|
355 | (2) |
|
16.4 GPU Particle-Grid Algorithms |
|
|
357 | (4) |
|
16.4.1 Electrostatic Potential |
|
|
357 | (1) |
|
16.4.2 Direct Summation on GPUs |
|
|
358 | (2) |
|
16.4.3 Cutoff Summation on GPUs |
|
|
360 | (1) |
|
16.4.4 Floating-Point Precision Effects |
|
|
361 | (1) |
|
16.5 GPU N-Body Algorithms |
|
|
361 | (4) |
|
|
|
361 | (1) |
|
16.5.2 N-Body Forces on GPUs |
|
|
362 | (2) |
|
16.5.3 Long-Range Electrostatic Forces |
|
|
364 | (1) |
|
16.6 Adapting Software for GPU Acceleration |
|
|
365 | (3) |
|
16.6.1 Case Study: NAMD Parallel Molecular Dynamics |
|
|
365 | (2) |
|
16.6.2 Case Study: VMD Molecular Graphics and Analysis |
|
|
367 | (1) |
|
|
|
368 | (1) |
|
|
|
369 | (1) |
|
|
|
369 | (4) |
|
|
|
373 | (88) |
|
17 Dataflow Frameworks for Emerging Heterogeneous Architectures and Their Application to Biomedicine |
|
|
375 | (18) |
|
|
|
|
|
|
|
|
|
|
|
|
|
375 | (2) |
|
17.2 Dataflow Computing Model and Runtime Support |
|
|
377 | (1) |
|
17.3 Use Case Application: Neuroblastoma Image Analysis System |
|
|
378 | (2) |
|
17.4 Middleware for Multi-Granularity Dataflow |
|
|
380 | (8) |
|
17.4.1 Coarse-grained on Distributed GPU Clusters |
|
|
381 | (1) |
|
17.4.1.1 Supporting Heterogeneous Resources |
|
|
381 | (2) |
|
17.4.1.2 Experimental Evaluation |
|
|
383 | (3) |
|
17.4.2 Fine-Grained on Cell |
|
|
386 | (1) |
|
17.4.2.1 DCL for Cell---Design and Architecture |
|
|
386 | (1) |
|
|
|
386 | (2) |
|
17.5 Conclusions and Future Work |
|
|
388 | (1) |
|
|
|
389 | (1) |
|
|
|
389 | (4) |
|
18 Accelerator Support in the Charm++ Parallel Programming Model |
|
|
393 | (20) |
|
|
|
|
|
|
|
|
|
393 | (1) |
|
18.2 Motivations and Goals of Our Work |
|
|
394 | (2) |
|
18.3 The Charm++ Parallel Programming Model |
|
|
396 | (2) |
|
18.3.1 General Description of Charm++ |
|
|
396 | (1) |
|
18.3.2 Suitability of Charm++ for Exploiting Accelerators |
|
|
397 | (1) |
|
18.4 Support for Cell and Larrabee in Charm++ |
|
|
398 | (7) |
|
18.4.1 SIMD Instruction Abstraction |
|
|
400 | (1) |
|
18.4.2 Accelerated Entry Methods |
|
|
401 | (2) |
|
18.4.3 Support for Heterogeneous Systems |
|
|
403 | (1) |
|
|
|
404 | (1) |
|
18.5 Support for CUDA-Based GPUs |
|
|
405 | (2) |
|
|
|
407 | (1) |
|
|
|
408 | (1) |
|
|
|
408 | (5) |
|
19 Efficient Parallel Scan Algorithms for Manycore GPUs |
|
|
413 | (30) |
|
|
|
|
|
|
|
|
|
|
|
414 | (2) |
|
19.2 CUDA---A General-Purpose Parallel Computing Architecture for Graphics Processors |
|
|
416 | (1) |
|
19.3 Scan: An Algorithmic Primitive for Efficient Data-Parallel Computation |
|
|
417 | (4) |
|
|
|
417 | (1) |
|
19.3.1.1 A Serial Implementation |
|
|
418 | (1) |
|
19.3.1.2 A Basic Parallel Implementation |
|
|
418 | (2) |
|
|
|
420 | (1) |
|
19.4 Design of an Efficient Scan Algorithm |
|
|
421 | (4) |
|
19.4.1 Hierarchy of the Scan Algorithm |
|
|
421 | (1) |
|
19.4.2 Intra-Warp Scan Algorithm |
|
|
422 | (1) |
|
19.4.3 Intra-Block Scan Algorithm |
|
|
423 | (1) |
|
19.4.4 Global Scan Algorithm |
|
|
423 | (2) |
|
19.5 Design of an Efficient Segmented Scan Algorithm |
|
|
425 | (8) |
|
19.5.1 Operator Transformation |
|
|
425 | (1) |
|
19.5.2 Direct Intra-Warp Segmented Scan |
|
|
426 | (4) |
|
19.5.3 Block and Global Segmented Scan Algorithms |
|
|
430 | (3) |
|
19.6 Algorithmic Complexity |
|
|
433 | (1) |
|
19.7 Some Alternative Designs for Scan Algorithms |
|
|
434 | (3) |
|
19.7.1 Saving Bandwidth by Performing a Reduction |
|
|
434 | (1) |
|
19.7.2 Eliminating Recursion by Performing More Work per Block |
|
|
435 | (2) |
|
19.8 Optimizations in CUDPP |
|
|
437 | (1) |
|
19.9 Performance Analysis |
|
|
437 | (3) |
|
|
|
440 | (1) |
|
|
|
440 | (1) |
|
|
|
441 | (2) |
|
20 High Performance Topology-Aware Communication in Multicore Processors |
|
|
443 | (18) |
|
|
|
|
|
|
|
|
|
|
|
444 | (1) |
|
|
|
445 | (3) |
|
|
|
445 | (1) |
|
|
|
446 | (1) |
|
|
|
447 | (1) |
|
|
|
447 | (1) |
|
|
|
448 | (1) |
|
20.3.1 Basic Memory-Based Copy |
|
|
448 | (1) |
|
20.3.2 Vector Instructions |
|
|
448 | (1) |
|
20.3.3 Streaming Instructions |
|
|
449 | (1) |
|
20.3.4 Kernel-Based Direct Copy |
|
|
449 | (1) |
|
20.4 Experimental Results |
|
|
449 | (9) |
|
20.4.1 Intra-Socket Performance Results |
|
|
450 | (4) |
|
20.4.2 Inter-Socket Performance Results |
|
|
454 | (1) |
|
20.4.3 Comparison with MPI |
|
|
455 | (2) |
|
20.4.4 Performance Comparison of Different Multicore Architectures |
|
|
457 | (1) |
|
|
|
458 | (1) |
|
20.6 Conclusion and Future Work |
|
|
458 | (1) |
|
|
|
459 | (2) |
| Index |
|
461 | |