Klienditugi: 7440010 (E-R 10-18)

Intel Xeon Phi Coprocessor High Performance Programming [Pehme köide]

3.60/5 (12 hinnangut Goodreads-ist)

James Reinders (Director and Programming Model Architect, Intel Corporation), James Jeffers (Principal Engineer and Visualization Lead, Intel Corporation)

Formaat: Paperback / softback, 432 pages, kõrgus x laius: 235x191 mm, kaal: 720 g, Contains 1 Digital (delivered electronically)
Ilmumisaeg: 28-Mar-2013
Kirjastus: Morgan Kaufmann Publishers In
ISBN-10: 0124104142
ISBN-13: 9780124104143

Teised raamatud teemal:

Computer programming / software development - (Hetkel poes: 4 nimetust)

Pehme köide
Hind: 66,03 €*
* saadame teile pakkumise kasutatud raamatule, mille hind võib erineda kodulehel olevast hinnast
See raamat on trükist otsas, kuid me saadame teile pakkumise kasutatud raamatule.
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Lisa soovinimekirja

Formaat: Paperback / softback, 432 pages, kõrgus x laius: 235x191 mm, kaal: 720 g, Contains 1 Digital (delivered electronically)
Ilmumisaeg: 28-Mar-2013
Kirjastus: Morgan Kaufmann Publishers In
ISBN-10: 0124104142
ISBN-13: 9780124104143

Teised raamatud teemal:

Computer programming / software development - (Hetkel poes: 4 nimetust)

Püsilink: https://www.kriso.ee/db/9780124104143.html

Märksõnad:

Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi coprocessor. They have distilled their own experiences coupled with insights from many expert customers, Intel Field Engineers, Application Engineers and Technical Consulting Engineers, to create this authoritative first book on the essentials of programming for this new architecture and these new products.

This book is useful even before you ever touch a system with an Intel Xeon Phi coprocessor. To ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi coprocessors, or other high performance microprocessors. Applying these techniques will generally increase your program performance on any system, and better prepare you for Intel Xeon Phi coprocessors and the Intel MIC architecture.

A practical guide to the essentials of the Intel Xeon Phi coprocessor
Presents best practices for portable, high-performance computing and a familiar and proven threaded, scalar-vector programming model
Includes simple but informative code examples that explain the unique aspects of this new highly parallel and high performance computational product
Covers wide vectors, many cores, many threads and high bandwidth cache/memory architecture

Arvustused

"Read this book. Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi coprocessor. They have distilled their own experiences coupled with insights from many expert customers, to create this authoritative first book on the essentials of programming for this new architecture and these new products." --Slashdot.org, May 5, 2013

"The authorsare uniquely experienced in software development for this new silicon. As a result, this book is the definitive programming reference for the 60+ core monster from Intelhighly readable and interlaced with lots of code examples." --DrDobbs.com, April 2, 2013

"This book belongs on the bookshelf of every HPC professional. Not only does it successfully and accessibly teach us how to use and obtain high performance on the Intel MIC architecture, it is about much more than that. It takes us back to the universal fundamentals of high-performance computing including how to think and reason about the performance of algorithms mapped to modern architectures, and it puts into your hands powerful tools that will be useful for years to come." --Robert J. Harrison, Institute for Advanced Computational Science, Stony Brook University, from the Foreword

"The book benefits software engineers, scientific researchers, and high performance and supercomputing developers in need of high-performance computing resources" --HPCwire.com, March 31, 2013

"The book benefits software engineers, scientific researchers, and high performance and supercomputing developers in need of high-performance computing resourcesI got my hands on a preliminary copy of the book back in November at SC12, and I can tell you that Jim and James did a great job."--Knowledgespeak.com, April 1, 2013

Muu info

Exploit the parallel power of the Intel Xeon Phi coprocessor for high-performance computing.

Foreword

xiii

Preface

xvii

Acknowledgements

xix

Chapter 1 Introduction

(22)

Trend: more parallelism

(1)

Why Intel® Xeon Phi™ coprocessors are needed

(3)

Platforms with coprocessors

(1)

The first Intel® Xeon Phi™ coprocessor

(3)

Keeping the "Ninja Gap" under control

(1)

Transforming-and-tuning double advantage

(1)

When to use an Intel® Xeon Phi™ coprocessor

(1)

Maximizing performance on processors first

(1)

Why scaling past one hundred threads is so important

(3)

Maximizing parallel program performance

(1)

Measuring readiness for highly parallel execution

(1)

What about GPUs?

(1)

Beyond the ease of porting to increased performance

(1)

Transformation for performance

(1)

Hyper-threading versus multithreading

(1)

Coprocessor major usage model: MPI versus offload

(1)

Compiler and programming models

(1)

Cache optimizations

(1)

Examples, then details

(1)

For more information

(2)

Chapter 2 High Performance Closed Track Test Drive!

(36)

Looking under the hood: coprocessor specifications

(2)

Starting the car: communicating with the coprocessor

(2)

Taking it out easy: running our first code

(4)

Starting to accelerate: running more than one thread

(6)

Petal to the metal: hitting full speed using all cores

(11)

Easing in to the first curve: accessing memory bandwidth

(5)

High speed banked curved: maximizing memory bandwidth

(3)

Back to the pit: a summary

(2)

Chapter 3 A Friendly Country Road Race

(24)

Preparing for our country road trip: chapter focus

(1)

Getting a feel for the road: the 9-point stencil algorithm

(1)

At the starting line: the baseline 9-point stencil implementation

(7)

Rough road ahead: running the baseline stencil code

(2)

Cobblestone street ride: vectors but not yet scaling

(2)

Open road all-out race: vectors plus scaling

(3)

Some grease and wrenches!: a bit of tuning

(6)

Adjusting the "Alignment"

(1)

Using streaming stores

(2)

Using huge 2-MB memory pages

(2)

Summary

(1)

For more information

(2)

Chapter 4 Driving Around Town: Optimizing A Real-World Code Example

(24)

Choosing the direction: the basic diffusion calculation

(1)

Turn ahead: accounting for boundary effects

(7)

Finding a wide boulevard: scaling the code

(2)

Thunder road: ensuring vectorization

(4)

Peeling out: peeling code from the inner loop

(3)

Trying higher octane fuel: improving speed using data locality and tiling

100

(5)

High speed driver certificate: summary of our high speed tour

105

(2)

Chapter 5 Lots of Data (Vectors)

107

(58)

Why vectorize?

107

(1)

How to vectorize

108

(1)

Five approaches to achieving vectorization

108

(2)

Six step vectorization methodology

110

(2)

Step 1 Measure baseline release build performance

111

(1)

Step 2 Determine hotspots using Intel® VTune™ amplifier XE

111

(1)

Step 3 Determine loop candidates using Intel compiler vec-report

111

(1)

Step 4 Get advice using the Intel Compiler GAP report and toolkit resources

112

(1)

Step 5 Implement GAP advice and other suggestions (such as using elemental functions and/or array notations)

112

(1)

Step 6 Repeat!

112

(1)

Streaming through caches: data layout, alignment, prefetching, and so on

112

(11)

Why data layout affects vectorization performance

113

(1)

Data alignment

114

(2)

Prefetching

116

(5)

Streaming stores

121

(2)

Compiler tips

123

(3)

Avoid manual loop unrolling

123

(1)

Requirements for a loop to vectorize (Intel® Compiler)

124

(2)

Importance of inlining, interference with simple profiling

126

(1)

Compiler options

126

(2)

Memory disambiguation inside vector-loops

127

(1)

Compiler directives

128

(22)

SIMD directives

129

(5)

The VECTOR and NOVECTOR directives

134

(1)

The IVDEP directive

135

(2)

Random number function vectorization

137

(1)

Utilizing full vectors, -opt-assume-safe-padding

138

(4)

Option -opt-assume-safe-padding

142

(1)

Data alignment to assist vectorization

142

(4)

Tradeoffs in array notations due to vector lengths

146

(4)

Use array sections to encourage vectorization

150

(6)

Fortran array sections

150

(2)

Cilk Plus array sections and elemental functions

152

(4)

Look at what the compiler created: assembly code inspection

156

(7)

How to find the assembly code

157

(1)

Quick inspection of assembly code

158

(5)

Numerical result variations with vectorization

163

(1)

Summary

163

(1)

For more information

163

(2)

Chapter 6 Lots of Tasks (not Threads)

165

(24)

OpenMP, Fortran 2008, Intel® TBB, Intel® Cilk™ Plus, Intel® MKL

166

(2)

Task creation needs to happen on the coprocessor

166

(2)

Importance of thread pools

168

(1)

OpenMP

168

(3)

Parallel processing model

168

(1)

Directives

169

(1)

Significant controls over OpenMP

169

(1)

Nesting

170

(1)

Fortran 2008

171

(3)

DO CONCURRENT

171

(1)

DO CONCURRENT and DATA RACES

171

(1)

DO CONCURRENT definition

172

(1)

DO CONCURRENT vs. FOR ALL

173

(1)

DO CONCURRENT vs. OpenMP "Parallel"

173

(1)

Intel® TBB

174

(7)

History

175

(2)

Using TBB

177

(1)

parallel_for

177

(1)

blocked_range

177

(1)

Partitioners

178

(1)

parallel_reduce

179

(1)

Parallel_invoke

180

(1)

Notes on C++11

180

(1)

TBB summary

181

(1)

Cilk Plus

181

(6)

History

183

(1)

Borrowing components from TBB

183

(1)

Loaning components to TBB

184

(1)

Keyword spelling

184

(1)

Cilk_for

184

(1)

cilk_spawn and cilk_sync

185

(2)

Reducers (Hyperobjects)

187

(1)

Array notation and elemental functions

187

(1)

Cilk Plus summary

187

(1)

Summary

187

(1)

For more information

188

(1)

Chapter 7 Offload

189

(54)

Two offload models

190

(1)

Choosing offload vs. native execution

191

(1)

Non-shared memory model: using offload pragmas/directives

191

(1)

Shared virtual memory model: using offload with shared VM

191

(1)

Intel® Math Kernel Library (Intel MKL) automatic offload

192

(1)

Language extensions for offload

192

(3)

Compiler options and environment variables for offload

193

(2)

Sharing environment variables for offload

195

(1)

Offloading to multiple coprocessors

195

(1)

Using pragma/directive offload

195

(22)

Placing variables and functions on the coprocessor

198

(2)

Managing memory allocation for pointer variables

200

(6)

Optimization for time: another reason to persist allocations

206

(1)

Target-specific code using a pragma in C/C + +

206

(3)

Target-specific code using a directive in fortran

209

(1)

Code that should not be built for processor-only execution

209

(2)

Predefined macros for Intel® MIC architecture

211

(1)

Fortran arrays

211

(1)

Allocating memory for parts of C/C++ arrays

212

(1)

Allocating memory for parts of fortran arrays

213

(1)

Moving data from one variable to another

214

(1)

Restrictions on offloaded code using a pragma

215

(2)

Using offload with shared virtual memory

217

(11)

Using shared memory and shared variables

217

(2)

About shared functions

219

(1)

Shared memory management functions

219

(1)

Synchronous and asynchronous function execution: _cilk_offload

219

(1)

Sharing variables and functions: _cilk_shared

220

(2)

Rules for using _cilk_shared and _cilk_offload

222

(1)

Synchronization between the processor and the target

222

(1)

Writing target-specific code with _cilk_offload

223

(1)

Restrictions on offloaded code using shared virtual memory

224

(1)

Persistent data when using shared virtual memory

225

(2)

C++ declarations of persistent data with shared virtual memory

227

(1)

About asynchronous computation

228

(1)

About asynchronous data transfer

229

(5)

Asynchronous data transfer from the processor to the coprocessor

229

(5)

Applying the target attribute to multiple declarations

234

(4)

Vec-report option used with offloads

235

(1)

Measuring timing and data in offload regions

236

(1)

_Offload_report

236

(1)

Using libraries in offloaded code

237

(1)

About creating offload libraries with xiar and xild

237

(1)

Performing file I/O on the coprocessor

238

(2)

Logging stdout and stderr from offloaded code

240

(1)

Summary

241

(1)

For more information

241

(2)

Chapter 8 Coprocessor Architecture

243

(26)

The Intel® Xeon Phi™ coprocessor family

244

(1)

Coprocessor card design

245

(1)

Intel® Xeon Phi™ coprocessor silicon overview

246

(1)

Individual coprocessor core architecture

247

(2)

Instruction and multithread processing

249

(2)

Cache organization and memory access considerations

251

(1)

Prefetching

252

(1)

Vector processing unit architecture

253

(4)

Vector instructions

254

(3)

Coprocessor PCIe system interface and DMA

257

(3)

DMA capabilities

258

(2)

Coprocessor power management capabilities

260

(3)

Reliability, availability, and serviceability (RAS)

263

(2)

Machine check architecture (MCA)

264

(1)

Coprocessor system management controller (SMC)

265

(2)

Sensors

265

(1)

Thermal design power monitoring and control

266

(1)

Fan speed control

266

(1)

Potential application impact

266

(1)

Benchmarks

267

(1)

Summary

267

(1)

For more information

267

(2)

Chapter 9 Coprocessor System Software

269

(24)

Coprocessor software architecture overview

269

(2)

Symmetry

271

(1)

Ring levels: user and kernel

271

(1)

Coprocessor programming models and options

271

(5)

Breadth and depth

273

(1)

Coprocessor MPI programming models

274

(2)

Coprocessor software architecture components

276

(1)

Development tools and application layer

276

(1)

Intel® manycore platform software stack

277

(10)

MYO: mine yours ours

277

(1)

COI: coprocessor offload infrastructure

278

(1)

SCIF: symmetric communications interface

278

(1)

Virtual networking (NetDev), TCP/IP, and sockets

278

(1)

Coprocessor system management

279

(3)

Coprocessor components for MPI applications

282

(5)

Linux support for Intel® Xeon Phi™ coprocessors

287

(1)

Tuning memory allocation performance

288

(2)

Controlling the number of 2 MB pages

288

(1)

Monitoring the number of 2 MB pages on the coprocessor

288

(1)

A sample method for allocating 2 MB pages

289

(1)

Summary

290

(1)

For more information

291

(2)

Chapter 10 Linux on the Coprocessor

293

(32)

Coprocessor Linux baseline

293

(1)

Introduction to coprocessor Linux bootstrap and configuration

294

(1)

Default coprocessor Linux configuration

295

(2)

Step 1 Ensure root access

296

(1)

Step 2 Generate the default configuration

296

(1)

Step 3 Change configuration

296

(1)

Step 4 Start the Intel® MPSS service

296

(1)

Changing coprocessor configuration

297

(8)

Configurable components

297

(1)

Configuration files

298

(1)

Configuring boot parameters

298

(2)

Coprocessor root file system

300

(5)

The micctrl utility

305

(7)

Coprocessor state control

306

(1)

Booting coprocessors

306

(1)

Shutting down coprocessors

306

(1)

Rebooting the coprocessors

306

(1)

Resetting coprocessors

307

(1)

Coprocessor configuration initialization and propagation

308

(1)

Helper functions for configuration parameters

309

(2)

Other file system helper functions

311

(1)

Adding software

312

(3)

Adding files to the root file system

313

(1)

Example: Adding a new global file set

314

(1)

Coprocessor Linux boot process

315

(3)

Booting the coprocessor

315

(3)

Coprocessors in a Linux cluster

318

(4)

Intel® Cluster Ready

319

(1)

How Intel® Cluster Checker works

319

(1)

Intel® Cluster Checker support for coprocessors

320

(2)

Summary

322

(1)

For more information

323

(2)

Chapter 11 Math Library

325

(18)

Intel Math Kernel Library overview

326

(1)

Intel MKL differences on the coprocessor

327

(1)

Intel MKL and Intel compiler

327

(1)

Coprocessor support overview

327

(3)

Control functions for automatic offload

328

(2)

Examples of how to set the environment variables

330

(1)

Using the coprocessor in native mode

330

(2)

Tips for using native mode

332

(1)

Using automatic offload mode

332

(5)

How to enable automatic offload

333

(1)

Examples of using control work division

333

(1)

Tips for effective use of automatic offload

333

(3)

Some tips for effective use of Intel MKL with or without offload

336

(1)

Using compiler-assisted offload

337

(2)

Tips for using compiler assisted offload

338

(1)

Precision choices and variations

339

(3)

Fast transcendentals and mathematics

339

(1)

Understanding the potential for floating-point arithmetic variations

339

(3)

Summary

342

(1)

For more information

342

(1)

Chapter 12 MPI

343

(20)

MPI overview

343

(2)

Using MPI on Intel® Xeon Phi™ coprocessors

345

(4)

Heterogeneity (and why it matters)

345

(3)

Prerequisites (batteries not included)

348

(1)

Offload from an MPI rank

349

(5)

Hello world

350

(1)

Trapezoidal rule

350

(4)

Using MPI natively on the coprocessor

354

(7)

Hello world (again)

354

(2)

Trapezoidal rule (revisited)

356

(5)

Summary

361

(1)

For more information

362

(1)

Chapter 13 Profiling and Timing

363

(22)

Event monitoring registers on the coprocessor

364

(1)

List of events used in this guide

364

(1)

Efficiency metrics

364

(6)

CPI

365

(4)

Compute to data access ratio

369

(1)

Potential performance issues

370

(7)

General cache usage

371

(2)

TLB misses

373

(1)

VPU usage

374

(2)

Memory bandwidth

376

(1)

Intel® VTune™ Amplifier XE product

377

(1)

Avoid simple profiling

378

(1)

Performance application programming interface

378

(1)

MPI analysis: Intel Trace Analyzer and Collector

378

(2)

Generating a trace file: coprocessor only application

379

(1)

Generating a trace file: processor + coprocessor application

379

(1)

Timing

380

(3)

Clocksources on the coprocessor

380

(1)

MIC elapsed time counter (micetc)

380

(1)

Time stamp counter (tsc)

380

(1)

Setting the clocksource

381

(1)

Time structures

381

(1)

Time penalty

382

(1)

Measuring timing and data in offload regions

383

(1)

Summary

383

(1)

For more information

383

(2)

Chapter 14 Summary

385

(2)

Advice

385

(1)

Additional resources

386

(1)

Another book coming?

386

(1)

Feedback appreciated

386

(1)

Glossary

387

(14)

Index

401

Jim Jeffers was the primary strategic planner and one of the first full-time employees on the program that became Intel ® MIC. He served as lead SW Engineering Manager on the program and formed and launched the SW development team. As the program evolved, he became the workloads (applications) and SW performance team manager. He has some of the deepest insight into the market, architecture and programming usages of the MIC product line. He has been a developer and development manager for embedded and high performance systems for close to 30 years. James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the worlds first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for a number of Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. James has published numerous articles, contributed to several books and is widely interviewed on parallelism. James has managed software development groups, customer service and consulting teams, business development and marketing teams. James is sought after to keynote on parallel programming, and is the author/co-author of three books currently in print including Structured Parallel Programming, published by Morgan Kaufmann in 2012.

Intel Xeon Phi Coprocessor High Performance Programming [Pehme köide]

Arvustused

Muu info

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv