Tasuta saatmine! | Klienditugi: 7440010 (E-R 10-18)

Scientific Computing with Multicore and Accelerators [Kõva köide]

Edited by David A. Bader (Georgia Institute of Technology, Atlanta, USA), Edited by Jakub Kurzak (University of Tennessee, Knoxville, USA), Edited by Jack Dongarra (University of Tennessee, Knoxville, USA)

Formaat: Hardback, 514 pages, kõrgus x laius: 234x156 mm, kaal: 839 g, 37 Tables, black and white; 163 Illustrations, black and white
Sari: Chapman & Hall/CRC Computational Science
Ilmumisaeg: 07-Dec-2010
Kirjastus: CRC Press Inc
ISBN-10: 143982536X
ISBN-13: 9781439825365

Teised raamatud teemal:

Kõva köide
Hind: 306,75 €
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 3-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Hardback, 514 pages, kõrgus x laius: 234x156 mm, kaal: 839 g, 37 Tables, black and white; 163 Illustrations, black and white
Sari: Chapman & Hall/CRC Computational Science
Ilmumisaeg: 07-Dec-2010
Kirjastus: CRC Press Inc
ISBN-10: 143982536X
ISBN-13: 9781439825365

Teised raamatud teemal:

Püsilink: https://www.kriso.ee/db/9781439825365.html

Märksõnad:

The hybrid/heterogeneous nature of future microprocessors and large high-performance computing systems will result in a reliance on two major types of components: multicore/manycore central processing units and special purpose hardware/massively parallel accelerators. While these technologies have numerous benefits, they also pose substantial performance challenges for developers, including scalability, software tuning, and programming issues.

Researchers at the Forefront Reveal Results from Their Own State-of-the-Art Work Edited by some of the top researchers in the field and with contributions from a variety of international experts, Scientific Computing with Multicore and Accelerators focuses on the architectural design and implementation of multicore and manycore processors and accelerators, including graphics processing units (GPUs) and the Sony Toshiba IBM (STI) Cell Broadband Engine (BE) currently used in the Sony PlayStation 3. The book explains how numerical libraries, such as LAPACK, help solve computational science problems; explores the emerging area of hardware-oriented numerics; and presents the design of a fast Fourier transform (FFT) and a parallel list ranking algorithm for the Cell BE. It covers stencil computations, auto-tuning, optimizations of a computational kernel, sequence alignment and homology, and pairwise computations. The book also evaluates the portability of drug design applications to the Cell BE and illustrates how to successfully exploit the computational capabilities of GPUs for scientific applications. It concludes with chapters on dataflow frameworks, the Charm++ programming model, scan algorithms, and a portable intracore communication framework.

Explores the New Computational Landscape of Hybrid Processors By offering insight into the process of constructing and effectively using the technology, this volume provides a thorough and practical introduction to the area of hybrid computing. It discusses introductory concepts and simple examples of parallel computing, logical and performance debugging for parallel computing, and advanced topics and issues related to the use and building of many applications.

List of Figures

xvii

List of Tables

xxv

Preface

xxvii

About the Editors

xxix

Contributors

xxxi

I Dense Linear Algebra

(80)

1 Implementing Matrix Multiplication on the Cell B. E.

(18)

Wesley Alvaro

Jakub Kurzak

Jack Dongarra

1.1 Introduction

(2)

1.1.1 Performance Considerations

(1)

1.1.2 Code Size Considerations

(1)

1.2 Implementation

(11)

1.2.1 Loop Construction

(1)

1.2.2 C = C - A × B trans

(4)

1.2.3 C = C - A × B

(2)

1.2.4 Advancing Tile Pointers

(4)

1.3 Results

(1)

1.4 Summary

(1)

1.5 Code

(1)

Bibliography

(2)

2 Implementing Matrix Factorizations on the Cell B. E.

(16)

Jakub Kurzak

Jack Dongarra

2.1 Introduction

(1)

2.2 Cholesky Factorization

(1)

2.3 Tile QR Factorization

(3)

2.4 SIMD Vectorization

(2)

2.5 Parallelization---Single Cell B. E.

(2)

2.6 Parallelization---Dual Cell B. E.

(1)

2.7 Results

(1)

2.8 Summary

(1)

2.9 Code

(1)

Bibliography

(3)

3 Dense Linear Algebra for Hybrid GPU-Based Systems

(20)

Stanimire Tomov

Jack Dongarra

3.1 Introduction

(2)

3.1.1 Linear Algebra (LA)---Enabling New Architectures

(1)

3.1.2 MAGMA---LA Libraries for Hybrid Architectures

(1)

3.2 Hybrid DLA Algorithms

(11)

3.2.1 How to Code DLA for GPUs?

(2)

3.2.2 The Approach---Hybridization of DLA Algorithms

(2)

3.2.3 One-Sided Factorizations

(3)

3.2.4 Two-Sided Factorizations

(4)

3.3 Performance Results

(3)

3.4 Summary

(1)

Bibliography

(3)

4 BLAS for GPUs

(24)

Rajib Nath

Stanimire Tomov

Jack Dongarra

4.1 Introduction

(1)

4.2 BLAS Kernels Development

(10)

4.2.1 Level 1 BLAS

(1)

4.2.2 Level 2 BLAS

(1)

4.2.2.1 xGEMV

(2)

4.2.2.2 xSYMV

(1)

4.2.3 Level 3 BLAS

(1)

4.2.3.1 xGEMM

(1)

4.2.3.2 xSYRK

(1)

4.2.3.3 xTRSM

(1)

4.3 Generic Kernel Optimizations

(9)

4.3.1 Pointer Redirecting

(4)

4.3.2 Padding

(1)

4.3.3 Auto-Tuning

(5)

4.4 Summary

(2)

Bibliography

(2)

II Sparse Linear Algebra

(30)

5 Sparse Matrix-Vector Multiplication on Multicore and Accelerators

(28)

Samuel Williams

Nathan Bell

Jee Whan Choi

Michael Garland

Leonid Oliker

Richard Vuduc

5.1 Introduction

(1)

5.2 Sparse Matrix-Vector Multiplication: Overview and Intuition

(2)

5.3 Architectures, Programming Models, and Matrices

(5)

5.3.1 Hardware Architectures

(3)

5.3.2 Parallel Programming Models

(1)

5.3.3 Matrices

(1)

5.4 Implications of Architecture on SpMV

(2)

5.4.1 Memory Subsystem

(1)

5.4.2 Processor Core

(1)

5.5 Optimization Principles for SpMV

(6)

5.5.1 Reorganization for Efficient Parallelization

(2)

5.5.2 Orchestrating Data Movement

(1)

5.5.3 Reducing Memory Traffic

(1)

5.5.4 Putting It All Together: Implementations

(2)

5.6 Results and Analysis

(6)

5.6.1 Xeon X5550 (Nehalem)

100

(2)

5.6.2 QS22 PowerXCell 8i

102

(1)

5.6.3 GTX 285

103

(2)

5.7 Summary: Cross-Study Comparison

105

(2)

Acknowledgments

107

(1)

Bibliography

108

(3)

III Multigrid Methods

111

(38)

6 Hardware-Oriented Multigrid Finite Element Solvers on GPU-Accelerated Clusters

113

(18)

Stefan Turek

Dominik Goddeke

Sven H.M. Buijssen

Hilmar Wobker

6.1 Introduction and Motivation

113

(3)

6.2 FEAST---Finite Element Analysis and Solution Tools

116

(4)

6.2.1 Separation of Structured and Unstructured Data

117

(1)

6.2.2 Parallel Multigrid Solvers

117

(1)

6.2.3 Scalar and Multivariate Problems

118

(1)

6.2.4 Co-Processor Acceleration

119

(1)

6.3 Two FEAST Applications: FEASTSOLID and FEASTFLOW

120

(4)

6.3.1 Computational Solid Mechanics

120

(1)

6.3.2 Computational Fluid Dynamics

121

(1)

6.3.3 Solving CSM and CFD Problems with FEAST

122

(2)

6.4 Performance Assessments

124

(4)

6.4.1 GPU-Based Multigrid on a Single Subdomain

124

(1)

6.4.2 Scalability

125

(1)

6.4.3 Application Speedup

125

(3)

6.5 Summary

128

(1)

Acknowledgments

128

(1)

Bibliography

128

(3)

7 Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers

131

(18)

Dominik Goddeke

Robert Strzodka

7.1 Introduction

131

(3)

7.1.1 Numerical Solution of Partial Differential Equations

132

(1)

7.1.2 Hardware-Oriented Discretization of Large Domains

132

(1)

7.1.3 Mixed-Precision Iterative Refinement Multigrid

133

(1)

7.2 Fine-Grained Parallelization of Multigrid Solvers

134

(7)

7.2.1 Smoothers on the CPU

134

(2)

7.2.2 Exact Parallelization: Jacobi and Tridiagonal Solvers

136

(2)

7.2.3 Multicolor Parallelization: Gauß-Seidel Solvers

138

(1)

7.2.4 Combination of Tridiagonal and Gauß-Seidel Smoothers

139

(1)

7.2.5 Alternating Direction Implicit Method

140

(1)

7.3 Numerical Evaluation and Performance Results

141

(4)

7.3.1 Test Procedure

141

(1)

7.3.2 Solver Configuration and Hardware Details

142

(1)

7.3.3 Numerical Evaluation

142

(1)

7.3.4 Runtime Efficiency

143

(2)

7.3.5 Smoother Selection

145

(1)

7.4 Summary and Conclusions

145

(1)

Acknowledgments

145

(1)

Bibliography

146

(3)

IV Fast Fourier Transforms

149

(44)

8 Designing Fast Fourier Transform for the IBM Cell Broad-band Engine

151

(20)

Virat Agarwal

David A. Bader

8.1 Introduction

151

(1)

8.2 Related Work

152

(2)

8.3 Fast Fourier Transform

154

(1)

8.4 Cell Broadband Engine Architecture

155

(3)

8.5 FFTC: Our FFT Algorithm for the Cell/B. E. Processor

158

(5)

8.5.1 Parallelizing FFTC for the Cell

158

(1)

8.5.2 Optimizing FFTC for the SPEs

159

(4)

8.6 Performance Analysis of FFTC

163

(3)

8.7 Conclusions

166

(1)

Acknowledgments

167

(1)

Bibliography

167

(4)

9 Implementing FFTs on Multicore Architectures

171

(22)

Alex Chunghen Chow

Gordon C. Fossum

Daniel A. Brokenshire

9.1 Introduction

172

(1)

9.2 Computational Aspects of FFT Algorithms

173

(2)

9.2.1 An Upper Bound on FFT Performance

174

(1)

9.3 Data Movement and Preparation of FFT Algorithms

175

(2)

9.4 Multicore FFT Performance Optimization

177

(1)

9.5 Memory Hierarchy

178

(7)

9.5.1 Registers and Load and Store Operations

180

(1)

9.5.1.1 Applying SIMD Operations

180

(1)

9.5.1.2 Instruction Pipeline

181

(1)

9.5.1.3 Multi-Issue Instructions

182

(1)

9.5.2 Private and Shared Core Memory, and Their Data Movement

183

(1)

9.5.2.1 Factorization

183

(1)

9.5.2.2 Parallel Computation on Shared Core Memory

183

(1)

9.5.3 System Memory

184

(1)

9.5.3.1 Index-Bit Reversal of Data Block Addresses

184

(1)

9.5.3.2 Transposition of the Elements

184

(1)

9.5.3.3 Load Balancing

185

(1)

9.6 Generic FFT Generators and Tooling

185

(2)

9.6.1 A Platform-Independent Expression of Performance Planning

185

(1)

9.6.2 Reducing the Mapping Space

186

(1)

9.6.3 Code Generation

187

(1)

9.7 Case Study: Large, Multi-Dimensional FFT on a Network Clustered System

187

(3)

9.8 Conclusion

190

(1)

Bibliography

190

(3)

V Combinatorial Algorithms

193

(24)

10 Combinatorial Algorithm Design on the Cell/B. E. Processor

195

(22)

David A. Bader

Virat Agarwal

Kamesh Madduri

Fabrizio Petrini

10.1 Introduction

195

(3)

10.2 Algorithm Design and Analysis on the Cell/B. E.

198

(3)

10.2.1 A Complexity Model

198

(1)

10.2.2 Analyzing Algorithms

199

(2)

10.3 List Ranking

201

(12)

10.3.1 A Parallelization Strategy

201

(1)

10.3.2 Complexity Analysis

202

(1)

10.3.3 A Novel Latency-Hiding Technique for Irregular Applications

203

(2)

10.3.4 Cell/B. E. Implementation

205

(1)

10.3.5 Performance Results

206

(7)

10.4 Conclusions

213

(1)

Acknowledgments

214

(1)

Bibliography

214

(3)

VI Stencil Algorithms

217

(62)

11 Auto-Tuning Stencil Computations on Multicore and Accelerators

219

(36)

Kaushik Datta

Samuel Williams

Vasily Volkov

Jonathan Carter

Leonid Oliker

John Shalf

Katherine Yelick

11.1 Introduction

220

(1)

11.2 Stencil Overview

221

(1)

11.3 Experimental Testbed

222

(1)

11.4 Performance Expectation

223

(8)

11.4.1 Stencil Characteristics

225

(1)

11.4.2 A Brief Introduction to the Roofline Model

225

(2)

11.4.3 Roofline Model-Based Performance Expectations

227

(4)

11.5 Stencil Optimizations

231

(5)

11.5.1 Parallelization and Problem Decomposition

232

(1)

11.5.2 Data Allocation

233

(1)

11.5.3 Bandwidth Optimizations

233

(1)

11.5.4 In-Core Optimizations

234

(1)

11.5.5 Algorithmic Transformations

235

(1)

11.6 Auto-tuning Methodology

236

(3)

11.6.1 Architecture-Specific Exceptions

238

(1)

11.7 Results and Analysis

239

(10)

11.7.1 Nehalem Performance

240

(2)

11.7.2 Barcelona Performance

242

(1)

11.7.3 Clovertown Performance

243

(1)

11.7.4 Blue Gene/P Performance

244

(1)

11.7.5 Victoria Falls Performance

244

(1)

11.7.6 Cell Performance

245

(1)

11.7.7 GTX280 Performance

246

(1)

11.7.8 Cross-Platform Performance and Power Comparison

247

(2)

11.8 Conclusions

249

(2)

Acknowledgments

251

(1)

Bibliography

251

(4)

12 Manycore Stencil Computations in Hyperthermia Applications

255

(24)

Matthias Christen

Olaf Schenk

Esra Neufeld

Maarten Paulides

Helmar Burkhart

12.1 Introduction

255

(1)

12.2 Hyperthermia Applications

256

(3)

12.3 Bandwidth-Saving Stencil Computations

259

(7)

12.3.1 Spatial Blocking and Parallelization

259

(2)

12.3.2 Temporal Blocking

261

(2)

12.3.2.1 Temporally Blocking the Hyperthermia Stencil

263

(1)

12.3.2.2 Speedup for the Hyperthermia Stencil

264

(2)

12.4 Experimental Performance Results

266

(7)

12.4.1 Kernel Benchmarks

268

(3)

12.4.2 Application Benchmarks

271

(2)

12.5 Related Work

273

(1)

12.6 Conclusion

273

(1)

Acknowledgments

274

(1)

Bibliography

274

(5)

VII Bioinformatics

279

(50)

13 Enabling Bioinformatics Algorithms on the Cell/B. E. Processor

281

(16)

Vipin Sachdeva

Michael Kistler

Tzy-Hwa Kathy Tzeng

13.1 Computational Biology and High-Performance Computing

281

(2)

13.2 The Cell/B. E. Processor

283

(1)

13.2.1 Cache Implementation on Cell/B. E.

283

(1)

13.3 Sequence Analysis and Its Applications

284

(2)

13.4 Sequence Analysis on the Cell/B. E. Processor

286

(3)

13.4.1 ClustalW

286

(2)

13.4.2 FASTA

288

(1)

13.5 Results

289

(5)

13.5.1 Experimental Setup

289

(1)

13.5.2 ClustalW Results and Analysis

289

(4)

13.5.3 FASTA Results and Analysis

293

(1)

13.6 Conclusions and Future Work

294

(1)

Bibliography

295

(2)

14 Pairwise Computations on the Cell Processor

297

(32)

Abhinav Sarje

Jaroslaw Zola

Srinivas Aluru

14.1 Introduction

298

(1)

14.2 Scheduling Pairwise Computations

299

(9)

14.2.1 Tiling

300

(1)

14.2.2 Tile Ordering

301

(1)

14.2.3 Tile Size

302

(1)

14.2.3.1 Fetching Input Vectors

303

(1)

14.2.3.2 Shifting Column Vectors

303

(1)

14.2.3.3 Transferring Output Data

303

(1)

14.2.3.4 Minimizing Number of DMA Transfers

304

(1)

14.2.4 Extending Tiling across Multiple Cell Processors

305

(1)

14.2.5 Extending Tiling to Large Number of Dimensions

306

(2)

14.3 Reconstructing Gene Regulatory Networks

308

(4)

14.3.1 Computing Pairwise Mutual Information on the Cell

309

(1)

14.3.2 Performance of Pairwise MI Computations on One Cell Blade

310

(1)

14.3.3 Performance of MI Computations on Multiple Cell Blades

311

(1)

14.4 Pairwise Genomic Alignments

312

(11)

14.4.1 Computing Alignments

312

(1)

14.4.1.1 Global/Local Alignment

313

(1)

14.4.1.2 Spliced Alignment

314

(1)

14.4.1.3 Syntenic Alignment

315

(1)

14.4.2 A Parallel Alignment Algorithm for the Cell BE

316

(1)

14.4.2.1 Parallel Alignment using Prefix Computations

316

(1)

14.4.2.2 Wavefont Communication Scheme

317

(1)

14.4.2.3 A Hybrid Parallel Algorithm

318

(2)

14.4.2.4 Hirschberg's Technique for Linear Space

320

(1)

14.4.2.5 Algorithms for Specialized Alignments

321

(1)

14.4.2.6 Memory Usage

321

(1)

14.4.3 Performance of the Hybrid Alignment Algorithms

321

(2)

14.5 Ending Notes

323

(1)

Acknowledgment

324

(1)

Bibliography

324

(5)

VIII Molecular Modeling

329

(44)

15 Drug Design on the Cell BE

331

(20)

Cecilia Gonzalez-Alvarez

Harald Servat

Daniel Cabrera-Benitez

Xavier Aguilar

Carles Pons

Juan Fernandez-Recio

Daniel Jimenez-Gonzalez

15.1 Introduction

332

(1)

15.2 Bioinformatics and Drug Design

333

(4)

15.2.1 Protein-Ligand Docking

335

(1)

15.2.2 Protein-Protein Docking

336

(1)

15.2.3 Molecular Mechanics

337

(1)

15.3 Cell BE Porting Analysis

337

(2)

15.4 Experimental Setup

339

(1)

15.5 Case Study: Docking with FTDock

339

(4)

15.5.1 Algorithm Description

339

(1)

15.5.2 Profiling and Implementation

340

(1)

15.5.3 Performance Evaluation

341

(2)

15.6 Case Study: Molecular Dynamics with Moldy

343

(2)

15.6.1 Algorithm Description

343

(1)

15.6.2 Profiling and Implementation

343

(1)

15.6.3 Performance Evaluation

344

(1)

15.7 Conclusions

345

(1)

Acknowledgments

346

(1)

Bibliography

346

(5)

16 GPU Algorithms for Molecular Modeling

351

(22)

John E. Stone

David J. Hardy

Barry Isralewitz

Klaus Schulten

16.1 Introduction

352

(1)

16.2 Computational Challenges of Molecular Modeling

352

(1)

16.3 GPU Overview

353

(4)

16.3.1 GPU Hardware Organization

354

(1)

16.3.2 GPU Programming Model

355

(2)

16.4 GPU Particle-Grid Algorithms

357

(4)

16.4.1 Electrostatic Potential

357

(1)

16.4.2 Direct Summation on GPUs

358

(2)

16.4.3 Cutoff Summation on GPUs

360

(1)

16.4.4 Floating-Point Precision Effects

361

(1)

16.5 GPU N-Body Algorithms

361

(4)

16.5.1 N-Body Forces

361

(1)

16.5.2 N-Body Forces on GPUs

362

(2)

16.5.3 Long-Range Electrostatic Forces

364

(1)

16.6 Adapting Software for GPU Acceleration

365

(3)

16.6.1 Case Study: NAMD Parallel Molecular Dynamics

365

(2)

16.6.2 Case Study: VMD Molecular Graphics and Analysis

367

(1)

16.7 Concluding Remarks

368

(1)

Acknowledgments

369

(1)

Bibliography

369

(4)

IX Complementary Topics

373

(88)

17 Dataflow Frameworks for Emerging Heterogeneous Architectures and Their Application to Biomedicine

375

(18)

Fmit V. Catalyurek

Renato Ferreira

Timothy D. R. Hartley

George Teodoro

Rafael Sachetto

17.1 Motivation

375

(2)

17.2 Dataflow Computing Model and Runtime Support

377

(1)

17.3 Use Case Application: Neuroblastoma Image Analysis System

378

(2)

17.4 Middleware for Multi-Granularity Dataflow

380

(8)

17.4.1 Coarse-grained on Distributed GPU Clusters

381

(1)

17.4.1.1 Supporting Heterogeneous Resources

381

(2)

17.4.1.2 Experimental Evaluation

383

(3)

17.4.2 Fine-Grained on Cell

386

(1)

17.4.2.1 DCL for Cell---Design and Architecture

386

(1)

17.4.2.2 NBIA with DCL

386

(2)

17.5 Conclusions and Future Work

388

(1)

Acknowledgments

389

(1)

Bibliography

389

(4)

18 Accelerator Support in the Charm++ Parallel Programming Model

393

(20)

Laxmikant V. Kale

David M. Kunzman

Lukasz Wesolowski

18.1 Introduction

393

(1)

18.2 Motivations and Goals of Our Work

394

(2)

18.3 The Charm++ Parallel Programming Model

396

(2)

18.3.1 General Description of Charm++

396

(1)

18.3.2 Suitability of Charm++ for Exploiting Accelerators

397

(1)

18.4 Support for Cell and Larrabee in Charm++

398

(7)

18.4.1 SIMD Instruction Abstraction

400

(1)

18.4.2 Accelerated Entry Methods

401

(2)

18.4.3 Support for Heterogeneous Systems

403

(1)

18.4.4 Performance

404

(1)

18.5 Support for CUDA-Based GPUs

405

(2)

18.6 Related Work

407

(1)

18.7 Concluding Remarks

408

(1)

Bibliography

408

(5)

19 Efficient Parallel Scan Algorithms for Manycore GPUs

413

(30)

Shubhabrata Sengupta

Mark Harris

Michael Garland

John D. Owens

19.1 Introduction

414

(2)

19.2 CUDA---A General-Purpose Parallel Computing Architecture for Graphics Processors

416

(1)

19.3 Scan: An Algorithmic Primitive for Efficient Data-Parallel Computation

417

(4)

19.3.1 Scan

417

(1)

19.3.1.1 A Serial Implementation

418

(1)

19.3.1.2 A Basic Parallel Implementation

418

(2)

19.3.2 Segmented Scan

420

(1)

19.4 Design of an Efficient Scan Algorithm

421

(4)

19.4.1 Hierarchy of the Scan Algorithm

421

(1)

19.4.2 Intra-Warp Scan Algorithm

422

(1)

19.4.3 Intra-Block Scan Algorithm

423

(1)

19.4.4 Global Scan Algorithm

423

(2)

19.5 Design of an Efficient Segmented Scan Algorithm

425

(8)

19.5.1 Operator Transformation

425

(1)

19.5.2 Direct Intra-Warp Segmented Scan

426

(4)

19.5.3 Block and Global Segmented Scan Algorithms

430

(3)

19.6 Algorithmic Complexity

433

(1)

19.7 Some Alternative Designs for Scan Algorithms

434

(3)

19.7.1 Saving Bandwidth by Performing a Reduction

434

(1)

19.7.2 Eliminating Recursion by Performing More Work per Block

435

(2)

19.8 Optimizations in CUDPP

437

(1)

19.9 Performance Analysis

437

(3)

19.10 Conclusions

440

(1)

Acknowledgments

440

(1)

Bibliography

441

(2)

20 High Performance Topology-Aware Communication in Multicore Processors

443

(18)

Hari Subramoni

Fabrizio Petrini

Virat Agarwal

Davide Pasetto

20.1 Introduction

444

(1)

20.2 Background

445

(3)

20.2.1 Intel Nehalem

445

(1)

20.2.2 Sun Niagara

446

(1)

20.2.3 AMD Opteron

447

(1)

20.2.4 MPI

447

(1)

20.3 Methodology

448

(1)

20.3.1 Basic Memory-Based Copy

448

(1)

20.3.2 Vector Instructions

448

(1)

20.3.3 Streaming Instructions

449

(1)

20.3.4 Kernel-Based Direct Copy

449

(1)

20.4 Experimental Results

449

(9)

20.4.1 Intra-Socket Performance Results

450

(4)

20.4.2 Inter-Socket Performance Results

454

(1)

20.4.3 Comparison with MPI

455

(2)

20.4.4 Performance Comparison of Different Multicore Architectures

457

(1)

20.5 Related Work

458

(1)

20.6 Conclusion and Future Work

458

(1)

Bibliography

459

(2)

Index

461

Jakub Kurzak is a research director in the Innovative Computing Laboratory in the Department of Electrical Engineering and Computer Science at the University of Tennessee. Dr. Kurzak is a program committee member for several international conferences and a reviewer for a number of top-ranking journals. His research focuses on utilizing multicore systems and accelerators for scientific computing.

David A. Bader is a professor in the School of Computational Science and Engineering, College of Computing, and executive director for High Performance Computing at the Georgia Institute of Technology. He is a lead scientist in the DARPA Ubiquitous High Performance Computing (UHPC) program, an associate editor for several high-impact journals, and editor of the book Petascale Computing: Algorithms and Applications (CRC Press, 2008). An IEEE Fellow and member of the ACM, Dr. Bader has been an NSF CAREER Award recipient and has received awards from IBM, NVIDIA, Intel, Sun Microsystems, and Microsoft Research. His main areas of research are in parallel algorithms, combinatorial optimization, and computational biology and genomics.

Jack Dongarra is a University Distinguished Professor of Electrical Engineering and Computer Science at the University of Tennessee, where he is the director of the Innovative Computing Laboratory and the director of the Center for Information Technology Research. He also is a member of the Distinguished Research Staff in the Computer Science and Mathematics Division at Oak Ridge National Laboratory, a Turing Fellow at the University of Manchester, and an adjunct professor in the Department of Computer Science at Rice University. A Fellow of the AAAS, ACM, IEEE, and SIAM, Dr. Dongarra has received numerous awards, including the first SIAM Special Interest Group on Supercomputing award for Career Achievement, the first IEEE Medal of Excellence in Scalable Computing, and the IEEE Sidney Fernbach Award. His research encompasses numerical algorithms in linear algebra, parallel computing, the use of advanced computer architectures, programming methodology, and tools for parallel computers.

Scientific Computing with Multicore and Accelerators [Kõva köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv