Muutke küpsiste eelistusi

Scientific Computing with Multicore and Accelerators [Kõva köide]

Edited by (Georgia Institute of Technology, Atlanta, USA), Edited by (University of Tennessee, Knoxville, USA), Edited by (University of Tennessee, Knoxville, USA)
  • Formaat: Hardback, 514 pages, kõrgus x laius: 234x156 mm, kaal: 839 g, 37 Tables, black and white; 163 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Computational Science
  • Ilmumisaeg: 07-Dec-2010
  • Kirjastus: CRC Press Inc
  • ISBN-10: 143982536X
  • ISBN-13: 9781439825365
  • Formaat: Hardback, 514 pages, kõrgus x laius: 234x156 mm, kaal: 839 g, 37 Tables, black and white; 163 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Computational Science
  • Ilmumisaeg: 07-Dec-2010
  • Kirjastus: CRC Press Inc
  • ISBN-10: 143982536X
  • ISBN-13: 9781439825365
The hybrid/heterogeneous nature of future microprocessors and large high-performance computing systems will result in a reliance on two major types of components: multicore/manycore central processing units and special purpose hardware/massively parallel accelerators. While these technologies have numerous benefits, they also pose substantial performance challenges for developers, including scalability, software tuning, and programming issues.

Researchers at the Forefront Reveal Results from Their Own State-of-the-Art Work Edited by some of the top researchers in the field and with contributions from a variety of international experts, Scientific Computing with Multicore and Accelerators focuses on the architectural design and implementation of multicore and manycore processors and accelerators, including graphics processing units (GPUs) and the Sony Toshiba IBM (STI) Cell Broadband Engine (BE) currently used in the Sony PlayStation 3. The book explains how numerical libraries, such as LAPACK, help solve computational science problems; explores the emerging area of hardware-oriented numerics; and presents the design of a fast Fourier transform (FFT) and a parallel list ranking algorithm for the Cell BE. It covers stencil computations, auto-tuning, optimizations of a computational kernel, sequence alignment and homology, and pairwise computations. The book also evaluates the portability of drug design applications to the Cell BE and illustrates how to successfully exploit the computational capabilities of GPUs for scientific applications. It concludes with chapters on dataflow frameworks, the Charm++ programming model, scan algorithms, and a portable intracore communication framework.

Explores the New Computational Landscape of Hybrid Processors By offering insight into the process of constructing and effectively using the technology, this volume provides a thorough and practical introduction to the area of hybrid computing. It discusses introductory concepts and simple examples of parallel computing, logical and performance debugging for parallel computing, and advanced topics and issues related to the use and building of many applications.
List of Figures
xvii
List of Tables
xxv
Preface xxvii
About the Editors xxix
Contributors xxxi
I Dense Linear Algebra
1(80)
1 Implementing Matrix Multiplication on the Cell B. E.
3(18)
Wesley Alvaro
Jakub Kurzak
Jack Dongarra
1.1 Introduction
3(2)
1.1.1 Performance Considerations
4(1)
1.1.2 Code Size Considerations
4(1)
1.2 Implementation
5(11)
1.2.1 Loop Construction
5(1)
1.2.2 C = C - A × B trans
6(4)
1.2.3 C = C - A × B
10(2)
1.2.4 Advancing Tile Pointers
12(4)
1.3 Results
16(1)
1.4 Summary
17(1)
1.5 Code
18(1)
Bibliography
19(2)
2 Implementing Matrix Factorizations on the Cell B. E.
21(16)
Jakub Kurzak
Jack Dongarra
2.1 Introduction
21(1)
2.2 Cholesky Factorization
22(1)
2.3 Tile QR Factorization
23(3)
2.4 SIMD Vectorization
26(2)
2.5 Parallelization---Single Cell B. E.
28(2)
2.6 Parallelization---Dual Cell B. E.
30(1)
2.7 Results
31(1)
2.8 Summary
32(1)
2.9 Code
33(1)
Bibliography
34(3)
3 Dense Linear Algebra for Hybrid GPU-Based Systems
37(20)
Stanimire Tomov
Jack Dongarra
3.1 Introduction
37(2)
3.1.1 Linear Algebra (LA)---Enabling New Architectures
38(1)
3.1.2 MAGMA---LA Libraries for Hybrid Architectures
38(1)
3.2 Hybrid DLA Algorithms
39(11)
3.2.1 How to Code DLA for GPUs?
39(2)
3.2.2 The Approach---Hybridization of DLA Algorithms
41(2)
3.2.3 One-Sided Factorizations
43(3)
3.2.4 Two-Sided Factorizations
46(4)
3.3 Performance Results
50(3)
3.4 Summary
53(1)
Bibliography
54(3)
4 BLAS for GPUs
57(24)
Rajib Nath
Stanimire Tomov
Jack Dongarra
4.1 Introduction
57(1)
4.2 BLAS Kernels Development
58(10)
4.2.1 Level 1 BLAS
60(1)
4.2.2 Level 2 BLAS
61(1)
4.2.2.1 xGEMV
61(2)
4.2.2.2 xSYMV
63(1)
4.2.3 Level 3 BLAS
64(1)
4.2.3.1 xGEMM
65(1)
4.2.3.2 xSYRK
66(1)
4.2.3.3 xTRSM
67(1)
4.3 Generic Kernel Optimizations
68(9)
4.3.1 Pointer Redirecting
68(4)
4.3.2 Padding
72(1)
4.3.3 Auto-Tuning
72(5)
4.4 Summary
77(2)
Bibliography
79(2)
II Sparse Linear Algebra
81(30)
5 Sparse Matrix-Vector Multiplication on Multicore and Accelerators
83(28)
Samuel Williams
Nathan Bell
Jee Whan Choi
Michael Garland
Leonid Oliker
Richard Vuduc
5.1 Introduction
84(1)
5.2 Sparse Matrix-Vector Multiplication: Overview and Intuition
84(2)
5.3 Architectures, Programming Models, and Matrices
86(5)
5.3.1 Hardware Architectures
86(3)
5.3.2 Parallel Programming Models
89(1)
5.3.3 Matrices
90(1)
5.4 Implications of Architecture on SpMV
91(2)
5.4.1 Memory Subsystem
91(1)
5.4.2 Processor Core
92(1)
5.5 Optimization Principles for SpMV
93(6)
5.5.1 Reorganization for Efficient Parallelization
93(2)
5.5.2 Orchestrating Data Movement
95(1)
5.5.3 Reducing Memory Traffic
96(1)
5.5.4 Putting It All Together: Implementations
97(2)
5.6 Results and Analysis
99(6)
5.6.1 Xeon X5550 (Nehalem)
100(2)
5.6.2 QS22 PowerXCell 8i
102(1)
5.6.3 GTX 285
103(2)
5.7 Summary: Cross-Study Comparison
105(2)
Acknowledgments
107(1)
Bibliography
108(3)
III Multigrid Methods
111(38)
6 Hardware-Oriented Multigrid Finite Element Solvers on GPU-Accelerated Clusters
113(18)
Stefan Turek
Dominik Goddeke
Sven H.M. Buijssen
Hilmar Wobker
6.1 Introduction and Motivation
113(3)
6.2 FEAST---Finite Element Analysis and Solution Tools
116(4)
6.2.1 Separation of Structured and Unstructured Data
117(1)
6.2.2 Parallel Multigrid Solvers
117(1)
6.2.3 Scalar and Multivariate Problems
118(1)
6.2.4 Co-Processor Acceleration
119(1)
6.3 Two FEAST Applications: FEASTSOLID and FEASTFLOW
120(4)
6.3.1 Computational Solid Mechanics
120(1)
6.3.2 Computational Fluid Dynamics
121(1)
6.3.3 Solving CSM and CFD Problems with FEAST
122(2)
6.4 Performance Assessments
124(4)
6.4.1 GPU-Based Multigrid on a Single Subdomain
124(1)
6.4.2 Scalability
125(1)
6.4.3 Application Speedup
125(3)
6.5 Summary
128(1)
Acknowledgments
128(1)
Bibliography
128(3)
7 Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers
131(18)
Dominik Goddeke
Robert Strzodka
7.1 Introduction
131(3)
7.1.1 Numerical Solution of Partial Differential Equations
132(1)
7.1.2 Hardware-Oriented Discretization of Large Domains
132(1)
7.1.3 Mixed-Precision Iterative Refinement Multigrid
133(1)
7.2 Fine-Grained Parallelization of Multigrid Solvers
134(7)
7.2.1 Smoothers on the CPU
134(2)
7.2.2 Exact Parallelization: Jacobi and Tridiagonal Solvers
136(2)
7.2.3 Multicolor Parallelization: Gauß-Seidel Solvers
138(1)
7.2.4 Combination of Tridiagonal and Gauß-Seidel Smoothers
139(1)
7.2.5 Alternating Direction Implicit Method
140(1)
7.3 Numerical Evaluation and Performance Results
141(4)
7.3.1 Test Procedure
141(1)
7.3.2 Solver Configuration and Hardware Details
142(1)
7.3.3 Numerical Evaluation
142(1)
7.3.4 Runtime Efficiency
143(2)
7.3.5 Smoother Selection
145(1)
7.4 Summary and Conclusions
145(1)
Acknowledgments
145(1)
Bibliography
146(3)
IV Fast Fourier Transforms
149(44)
8 Designing Fast Fourier Transform for the IBM Cell Broad-band Engine
151(20)
Virat Agarwal
David A. Bader
8.1 Introduction
151(1)
8.2 Related Work
152(2)
8.3 Fast Fourier Transform
154(1)
8.4 Cell Broadband Engine Architecture
155(3)
8.5 FFTC: Our FFT Algorithm for the Cell/B. E. Processor
158(5)
8.5.1 Parallelizing FFTC for the Cell
158(1)
8.5.2 Optimizing FFTC for the SPEs
159(4)
8.6 Performance Analysis of FFTC
163(3)
8.7 Conclusions
166(1)
Acknowledgments
167(1)
Bibliography
167(4)
9 Implementing FFTs on Multicore Architectures
171(22)
Alex Chunghen Chow
Gordon C. Fossum
Daniel A. Brokenshire
9.1 Introduction
172(1)
9.2 Computational Aspects of FFT Algorithms
173(2)
9.2.1 An Upper Bound on FFT Performance
174(1)
9.3 Data Movement and Preparation of FFT Algorithms
175(2)
9.4 Multicore FFT Performance Optimization
177(1)
9.5 Memory Hierarchy
178(7)
9.5.1 Registers and Load and Store Operations
180(1)
9.5.1.1 Applying SIMD Operations
180(1)
9.5.1.2 Instruction Pipeline
181(1)
9.5.1.3 Multi-Issue Instructions
182(1)
9.5.2 Private and Shared Core Memory, and Their Data Movement
183(1)
9.5.2.1 Factorization
183(1)
9.5.2.2 Parallel Computation on Shared Core Memory
183(1)
9.5.3 System Memory
184(1)
9.5.3.1 Index-Bit Reversal of Data Block Addresses
184(1)
9.5.3.2 Transposition of the Elements
184(1)
9.5.3.3 Load Balancing
185(1)
9.6 Generic FFT Generators and Tooling
185(2)
9.6.1 A Platform-Independent Expression of Performance Planning
185(1)
9.6.2 Reducing the Mapping Space
186(1)
9.6.3 Code Generation
187(1)
9.7 Case Study: Large, Multi-Dimensional FFT on a Network Clustered System
187(3)
9.8 Conclusion
190(1)
Bibliography
190(3)
V Combinatorial Algorithms
193(24)
10 Combinatorial Algorithm Design on the Cell/B. E. Processor
195(22)
David A. Bader
Virat Agarwal
Kamesh Madduri
Fabrizio Petrini
10.1 Introduction
195(3)
10.2 Algorithm Design and Analysis on the Cell/B. E.
198(3)
10.2.1 A Complexity Model
198(1)
10.2.2 Analyzing Algorithms
199(2)
10.3 List Ranking
201(12)
10.3.1 A Parallelization Strategy
201(1)
10.3.2 Complexity Analysis
202(1)
10.3.3 A Novel Latency-Hiding Technique for Irregular Applications
203(2)
10.3.4 Cell/B. E. Implementation
205(1)
10.3.5 Performance Results
206(7)
10.4 Conclusions
213(1)
Acknowledgments
214(1)
Bibliography
214(3)
VI Stencil Algorithms
217(62)
11 Auto-Tuning Stencil Computations on Multicore and Accelerators
219(36)
Kaushik Datta
Samuel Williams
Vasily Volkov
Jonathan Carter
Leonid Oliker
John Shalf
Katherine Yelick
11.1 Introduction
220(1)
11.2 Stencil Overview
221(1)
11.3 Experimental Testbed
222(1)
11.4 Performance Expectation
223(8)
11.4.1 Stencil Characteristics
225(1)
11.4.2 A Brief Introduction to the Roofline Model
225(2)
11.4.3 Roofline Model-Based Performance Expectations
227(4)
11.5 Stencil Optimizations
231(5)
11.5.1 Parallelization and Problem Decomposition
232(1)
11.5.2 Data Allocation
233(1)
11.5.3 Bandwidth Optimizations
233(1)
11.5.4 In-Core Optimizations
234(1)
11.5.5 Algorithmic Transformations
235(1)
11.6 Auto-tuning Methodology
236(3)
11.6.1 Architecture-Specific Exceptions
238(1)
11.7 Results and Analysis
239(10)
11.7.1 Nehalem Performance
240(2)
11.7.2 Barcelona Performance
242(1)
11.7.3 Clovertown Performance
243(1)
11.7.4 Blue Gene/P Performance
244(1)
11.7.5 Victoria Falls Performance
244(1)
11.7.6 Cell Performance
245(1)
11.7.7 GTX280 Performance
246(1)
11.7.8 Cross-Platform Performance and Power Comparison
247(2)
11.8 Conclusions
249(2)
Acknowledgments
251(1)
Bibliography
251(4)
12 Manycore Stencil Computations in Hyperthermia Applications
255(24)
Matthias Christen
Olaf Schenk
Esra Neufeld
Maarten Paulides
Helmar Burkhart
12.1 Introduction
255(1)
12.2 Hyperthermia Applications
256(3)
12.3 Bandwidth-Saving Stencil Computations
259(7)
12.3.1 Spatial Blocking and Parallelization
259(2)
12.3.2 Temporal Blocking
261(2)
12.3.2.1 Temporally Blocking the Hyperthermia Stencil
263(1)
12.3.2.2 Speedup for the Hyperthermia Stencil
264(2)
12.4 Experimental Performance Results
266(7)
12.4.1 Kernel Benchmarks
268(3)
12.4.2 Application Benchmarks
271(2)
12.5 Related Work
273(1)
12.6 Conclusion
273(1)
Acknowledgments
274(1)
Bibliography
274(5)
VII Bioinformatics
279(50)
13 Enabling Bioinformatics Algorithms on the Cell/B. E. Processor
281(16)
Vipin Sachdeva
Michael Kistler
Tzy-Hwa Kathy Tzeng
13.1 Computational Biology and High-Performance Computing
281(2)
13.2 The Cell/B. E. Processor
283(1)
13.2.1 Cache Implementation on Cell/B. E.
283(1)
13.3 Sequence Analysis and Its Applications
284(2)
13.4 Sequence Analysis on the Cell/B. E. Processor
286(3)
13.4.1 ClustalW
286(2)
13.4.2 FASTA
288(1)
13.5 Results
289(5)
13.5.1 Experimental Setup
289(1)
13.5.2 ClustalW Results and Analysis
289(4)
13.5.3 FASTA Results and Analysis
293(1)
13.6 Conclusions and Future Work
294(1)
Bibliography
295(2)
14 Pairwise Computations on the Cell Processor
297(32)
Abhinav Sarje
Jaroslaw Zola
Srinivas Aluru
14.1 Introduction
298(1)
14.2 Scheduling Pairwise Computations
299(9)
14.2.1 Tiling
300(1)
14.2.2 Tile Ordering
301(1)
14.2.3 Tile Size
302(1)
14.2.3.1 Fetching Input Vectors
303(1)
14.2.3.2 Shifting Column Vectors
303(1)
14.2.3.3 Transferring Output Data
303(1)
14.2.3.4 Minimizing Number of DMA Transfers
304(1)
14.2.4 Extending Tiling across Multiple Cell Processors
305(1)
14.2.5 Extending Tiling to Large Number of Dimensions
306(2)
14.3 Reconstructing Gene Regulatory Networks
308(4)
14.3.1 Computing Pairwise Mutual Information on the Cell
309(1)
14.3.2 Performance of Pairwise MI Computations on One Cell Blade
310(1)
14.3.3 Performance of MI Computations on Multiple Cell Blades
311(1)
14.4 Pairwise Genomic Alignments
312(11)
14.4.1 Computing Alignments
312(1)
14.4.1.1 Global/Local Alignment
313(1)
14.4.1.2 Spliced Alignment
314(1)
14.4.1.3 Syntenic Alignment
315(1)
14.4.2 A Parallel Alignment Algorithm for the Cell BE
316(1)
14.4.2.1 Parallel Alignment using Prefix Computations
316(1)
14.4.2.2 Wavefont Communication Scheme
317(1)
14.4.2.3 A Hybrid Parallel Algorithm
318(2)
14.4.2.4 Hirschberg's Technique for Linear Space
320(1)
14.4.2.5 Algorithms for Specialized Alignments
321(1)
14.4.2.6 Memory Usage
321(1)
14.4.3 Performance of the Hybrid Alignment Algorithms
321(2)
14.5 Ending Notes
323(1)
Acknowledgment
324(1)
Bibliography
324(5)
VIII Molecular Modeling
329(44)
15 Drug Design on the Cell BE
331(20)
Cecilia Gonzalez-Alvarez
Harald Servat
Daniel Cabrera-Benitez
Xavier Aguilar
Carles Pons
Juan Fernandez-Recio
Daniel Jimenez-Gonzalez
15.1 Introduction
332(1)
15.2 Bioinformatics and Drug Design
333(4)
15.2.1 Protein-Ligand Docking
335(1)
15.2.2 Protein-Protein Docking
336(1)
15.2.3 Molecular Mechanics
337(1)
15.3 Cell BE Porting Analysis
337(2)
15.4 Experimental Setup
339(1)
15.5 Case Study: Docking with FTDock
339(4)
15.5.1 Algorithm Description
339(1)
15.5.2 Profiling and Implementation
340(1)
15.5.3 Performance Evaluation
341(2)
15.6 Case Study: Molecular Dynamics with Moldy
343(2)
15.6.1 Algorithm Description
343(1)
15.6.2 Profiling and Implementation
343(1)
15.6.3 Performance Evaluation
344(1)
15.7 Conclusions
345(1)
Acknowledgments
346(1)
Bibliography
346(5)
16 GPU Algorithms for Molecular Modeling
351(22)
John E. Stone
David J. Hardy
Barry Isralewitz
Klaus Schulten
16.1 Introduction
352(1)
16.2 Computational Challenges of Molecular Modeling
352(1)
16.3 GPU Overview
353(4)
16.3.1 GPU Hardware Organization
354(1)
16.3.2 GPU Programming Model
355(2)
16.4 GPU Particle-Grid Algorithms
357(4)
16.4.1 Electrostatic Potential
357(1)
16.4.2 Direct Summation on GPUs
358(2)
16.4.3 Cutoff Summation on GPUs
360(1)
16.4.4 Floating-Point Precision Effects
361(1)
16.5 GPU N-Body Algorithms
361(4)
16.5.1 N-Body Forces
361(1)
16.5.2 N-Body Forces on GPUs
362(2)
16.5.3 Long-Range Electrostatic Forces
364(1)
16.6 Adapting Software for GPU Acceleration
365(3)
16.6.1 Case Study: NAMD Parallel Molecular Dynamics
365(2)
16.6.2 Case Study: VMD Molecular Graphics and Analysis
367(1)
16.7 Concluding Remarks
368(1)
Acknowledgments
369(1)
Bibliography
369(4)
IX Complementary Topics
373(88)
17 Dataflow Frameworks for Emerging Heterogeneous Architectures and Their Application to Biomedicine
375(18)
Fmit V. Catalyurek
Renato Ferreira
Timothy D. R. Hartley
George Teodoro
Rafael Sachetto
17.1 Motivation
375(2)
17.2 Dataflow Computing Model and Runtime Support
377(1)
17.3 Use Case Application: Neuroblastoma Image Analysis System
378(2)
17.4 Middleware for Multi-Granularity Dataflow
380(8)
17.4.1 Coarse-grained on Distributed GPU Clusters
381(1)
17.4.1.1 Supporting Heterogeneous Resources
381(2)
17.4.1.2 Experimental Evaluation
383(3)
17.4.2 Fine-Grained on Cell
386(1)
17.4.2.1 DCL for Cell---Design and Architecture
386(1)
17.4.2.2 NBIA with DCL
386(2)
17.5 Conclusions and Future Work
388(1)
Acknowledgments
389(1)
Bibliography
389(4)
18 Accelerator Support in the Charm++ Parallel Programming Model
393(20)
Laxmikant V. Kale
David M. Kunzman
Lukasz Wesolowski
18.1 Introduction
393(1)
18.2 Motivations and Goals of Our Work
394(2)
18.3 The Charm++ Parallel Programming Model
396(2)
18.3.1 General Description of Charm++
396(1)
18.3.2 Suitability of Charm++ for Exploiting Accelerators
397(1)
18.4 Support for Cell and Larrabee in Charm++
398(7)
18.4.1 SIMD Instruction Abstraction
400(1)
18.4.2 Accelerated Entry Methods
401(2)
18.4.3 Support for Heterogeneous Systems
403(1)
18.4.4 Performance
404(1)
18.5 Support for CUDA-Based GPUs
405(2)
18.6 Related Work
407(1)
18.7 Concluding Remarks
408(1)
Bibliography
408(5)
19 Efficient Parallel Scan Algorithms for Manycore GPUs
413(30)
Shubhabrata Sengupta
Mark Harris
Michael Garland
John D. Owens
19.1 Introduction
414(2)
19.2 CUDA---A General-Purpose Parallel Computing Architecture for Graphics Processors
416(1)
19.3 Scan: An Algorithmic Primitive for Efficient Data-Parallel Computation
417(4)
19.3.1 Scan
417(1)
19.3.1.1 A Serial Implementation
418(1)
19.3.1.2 A Basic Parallel Implementation
418(2)
19.3.2 Segmented Scan
420(1)
19.4 Design of an Efficient Scan Algorithm
421(4)
19.4.1 Hierarchy of the Scan Algorithm
421(1)
19.4.2 Intra-Warp Scan Algorithm
422(1)
19.4.3 Intra-Block Scan Algorithm
423(1)
19.4.4 Global Scan Algorithm
423(2)
19.5 Design of an Efficient Segmented Scan Algorithm
425(8)
19.5.1 Operator Transformation
425(1)
19.5.2 Direct Intra-Warp Segmented Scan
426(4)
19.5.3 Block and Global Segmented Scan Algorithms
430(3)
19.6 Algorithmic Complexity
433(1)
19.7 Some Alternative Designs for Scan Algorithms
434(3)
19.7.1 Saving Bandwidth by Performing a Reduction
434(1)
19.7.2 Eliminating Recursion by Performing More Work per Block
435(2)
19.8 Optimizations in CUDPP
437(1)
19.9 Performance Analysis
437(3)
19.10 Conclusions
440(1)
Acknowledgments
440(1)
Bibliography
441(2)
20 High Performance Topology-Aware Communication in Multicore Processors
443(18)
Hari Subramoni
Fabrizio Petrini
Virat Agarwal
Davide Pasetto
20.1 Introduction
444(1)
20.2 Background
445(3)
20.2.1 Intel Nehalem
445(1)
20.2.2 Sun Niagara
446(1)
20.2.3 AMD Opteron
447(1)
20.2.4 MPI
447(1)
20.3 Methodology
448(1)
20.3.1 Basic Memory-Based Copy
448(1)
20.3.2 Vector Instructions
448(1)
20.3.3 Streaming Instructions
449(1)
20.3.4 Kernel-Based Direct Copy
449(1)
20.4 Experimental Results
449(9)
20.4.1 Intra-Socket Performance Results
450(4)
20.4.2 Inter-Socket Performance Results
454(1)
20.4.3 Comparison with MPI
455(2)
20.4.4 Performance Comparison of Different Multicore Architectures
457(1)
20.5 Related Work
458(1)
20.6 Conclusion and Future Work
458(1)
Bibliography
459(2)
Index 461
Jakub Kurzak is a research director in the Innovative Computing Laboratory in the Department of Electrical Engineering and Computer Science at the University of Tennessee. Dr. Kurzak is a program committee member for several international conferences and a reviewer for a number of top-ranking journals. His research focuses on utilizing multicore systems and accelerators for scientific computing.

David A. Bader is a professor in the School of Computational Science and Engineering, College of Computing, and executive director for High Performance Computing at the Georgia Institute of Technology. He is a lead scientist in the DARPA Ubiquitous High Performance Computing (UHPC) program, an associate editor for several high-impact journals, and editor of the book Petascale Computing: Algorithms and Applications (CRC Press, 2008). An IEEE Fellow and member of the ACM, Dr. Bader has been an NSF CAREER Award recipient and has received awards from IBM, NVIDIA, Intel, Sun Microsystems, and Microsoft Research. His main areas of research are in parallel algorithms, combinatorial optimization, and computational biology and genomics.

Jack Dongarra is a University Distinguished Professor of Electrical Engineering and Computer Science at the University of Tennessee, where he is the director of the Innovative Computing Laboratory and the director of the Center for Information Technology Research. He also is a member of the Distinguished Research Staff in the Computer Science and Mathematics Division at Oak Ridge National Laboratory, a Turing Fellow at the University of Manchester, and an adjunct professor in the Department of Computer Science at Rice University. A Fellow of the AAAS, ACM, IEEE, and SIAM, Dr. Dongarra has received numerous awards, including the first SIAM Special Interest Group on Supercomputing award for Career Achievement, the first IEEE Medal of Excellence in Scalable Computing, and the IEEE Sidney Fernbach Award. His research encompasses numerical algorithms in linear algebra, parallel computing, the use of advanced computer architectures, programming methodology, and tools for parallel computers.