Muutke küpsiste eelistusi

Intel Xeon Phi Coprocessor High Performance Programming [Pehme köide]

(Director and Programming Model Architect, Intel Corporation), (Principal Engineer and Visualization Lead, Intel Corporation)
  • Formaat: Paperback / softback, 432 pages, kõrgus x laius: 235x191 mm, kaal: 720 g, Contains 1 Digital (delivered electronically)
  • Ilmumisaeg: 28-Mar-2013
  • Kirjastus: Morgan Kaufmann Publishers In
  • ISBN-10: 0124104142
  • ISBN-13: 9780124104143
Teised raamatud teemal:
  • Pehme köide
  • Hind: 66,03 €*
  • * saadame teile pakkumise kasutatud raamatule, mille hind võib erineda kodulehel olevast hinnast
  • See raamat on trükist otsas, kuid me saadame teile pakkumise kasutatud raamatule.
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Lisa soovinimekirja
  • Formaat: Paperback / softback, 432 pages, kõrgus x laius: 235x191 mm, kaal: 720 g, Contains 1 Digital (delivered electronically)
  • Ilmumisaeg: 28-Mar-2013
  • Kirjastus: Morgan Kaufmann Publishers In
  • ISBN-10: 0124104142
  • ISBN-13: 9780124104143
Teised raamatud teemal:

Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi coprocessor. They have distilled their own experiences coupled with insights from many expert customers, Intel Field Engineers, Application Engineers and Technical Consulting Engineers, to create this authoritative first book on the essentials of programming for this new architecture and these new products.

This book is useful even before you ever touch a system with an Intel Xeon Phi coprocessor. To ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi coprocessors, or other high performance microprocessors. Applying these techniques will generally increase your program performance on any system, and better prepare you for Intel Xeon Phi coprocessors and the Intel MIC architecture.


    • A practical guide to the essentials of the Intel Xeon Phi coprocessor
    • Presents best practices for portable, high-performance computing and a familiar and proven threaded, scalar-vector programming model
    • Includes simple but informative code examples that explain the unique aspects of this new highly parallel and high performance computational product
    • Covers wide vectors, many cores, many threads and high bandwidth cache/memory architecture

    Arvustused

    "Read this book. Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi coprocessor. They have distilled their own experiences coupled with insights from many expert customers, to create this authoritative first book on the essentials of programming for this new architecture and these new products." --Slashdot.org, May 5, 2013

    "The authorsare uniquely experienced in software development for this new silicon. As a result, this book is the definitive programming reference for the 60+ core monster from Intelhighly readable and interlaced with lots of code examples." --DrDobbs.com, April 2, 2013

    "This book belongs on the bookshelf of every HPC professional. Not only does it successfully and accessibly teach us how to use and obtain high performance on the Intel MIC architecture, it is about much more than that. It takes us back to the universal fundamentals of high-performance computing including how to think and reason about the performance of algorithms mapped to modern architectures, and it puts into your hands powerful tools that will be useful for years to come." --Robert J. Harrison, Institute for Advanced Computational Science, Stony Brook University, from the Foreword

    "The book benefits software engineers, scientific researchers, and high performance and supercomputing developers in need of high-performance computing resources" --HPCwire.com, March 31, 2013

    "The book benefits software engineers, scientific researchers, and high performance and supercomputing developers in need of high-performance computing resourcesI got my hands on a preliminary copy of the book back in November at SC12, and I can tell you that Jim and James did a great job."--Knowledgespeak.com, April 1, 2013

    Muu info

    Exploit the parallel power of the Intel Xeon Phi coprocessor for high-performance computing.
    Foreword xiii
    Preface xvii
    Acknowledgements xix
    Chapter 1 Introduction
    1(22)
    Trend: more parallelism
    1(1)
    Why Intel® Xeon Phi™ coprocessors are needed
    2(3)
    Platforms with coprocessors
    5(1)
    The first Intel® Xeon Phi™ coprocessor
    6(3)
    Keeping the "Ninja Gap" under control
    9(1)
    Transforming-and-tuning double advantage
    10(1)
    When to use an Intel® Xeon Phi™ coprocessor
    11(1)
    Maximizing performance on processors first
    11(1)
    Why scaling past one hundred threads is so important
    12(3)
    Maximizing parallel program performance
    15(1)
    Measuring readiness for highly parallel execution
    15(1)
    What about GPUs?
    16(1)
    Beyond the ease of porting to increased performance
    16(1)
    Transformation for performance
    17(1)
    Hyper-threading versus multithreading
    17(1)
    Coprocessor major usage model: MPI versus offload
    18(1)
    Compiler and programming models
    19(1)
    Cache optimizations
    20(1)
    Examples, then details
    21(1)
    For more information
    21(2)
    Chapter 2 High Performance Closed Track Test Drive!
    23(36)
    Looking under the hood: coprocessor specifications
    24(2)
    Starting the car: communicating with the coprocessor
    26(2)
    Taking it out easy: running our first code
    28(4)
    Starting to accelerate: running more than one thread
    32(6)
    Petal to the metal: hitting full speed using all cores
    38(11)
    Easing in to the first curve: accessing memory bandwidth
    49(5)
    High speed banked curved: maximizing memory bandwidth
    54(3)
    Back to the pit: a summary
    57(2)
    Chapter 3 A Friendly Country Road Race
    59(24)
    Preparing for our country road trip: chapter focus
    59(1)
    Getting a feel for the road: the 9-point stencil algorithm
    60(1)
    At the starting line: the baseline 9-point stencil implementation
    61(7)
    Rough road ahead: running the baseline stencil code
    68(2)
    Cobblestone street ride: vectors but not yet scaling
    70(2)
    Open road all-out race: vectors plus scaling
    72(3)
    Some grease and wrenches!: a bit of tuning
    75(6)
    Adjusting the "Alignment"
    76(1)
    Using streaming stores
    77(2)
    Using huge 2-MB memory pages
    79(2)
    Summary
    81(1)
    For more information
    81(2)
    Chapter 4 Driving Around Town: Optimizing A Real-World Code Example
    83(24)
    Choosing the direction: the basic diffusion calculation
    84(1)
    Turn ahead: accounting for boundary effects
    84(7)
    Finding a wide boulevard: scaling the code
    91(2)
    Thunder road: ensuring vectorization
    93(4)
    Peeling out: peeling code from the inner loop
    97(3)
    Trying higher octane fuel: improving speed using data locality and tiling
    100(5)
    High speed driver certificate: summary of our high speed tour
    105(2)
    Chapter 5 Lots of Data (Vectors)
    107(58)
    Why vectorize?
    107(1)
    How to vectorize
    108(1)
    Five approaches to achieving vectorization
    108(2)
    Six step vectorization methodology
    110(2)
    Step 1 Measure baseline release build performance
    111(1)
    Step 2 Determine hotspots using Intel® VTune™ amplifier XE
    111(1)
    Step 3 Determine loop candidates using Intel compiler vec-report
    111(1)
    Step 4 Get advice using the Intel Compiler GAP report and toolkit resources
    112(1)
    Step 5 Implement GAP advice and other suggestions (such as using elemental functions and/or array notations)
    112(1)
    Step 6 Repeat!
    112(1)
    Streaming through caches: data layout, alignment, prefetching, and so on
    112(11)
    Why data layout affects vectorization performance
    113(1)
    Data alignment
    114(2)
    Prefetching
    116(5)
    Streaming stores
    121(2)
    Compiler tips
    123(3)
    Avoid manual loop unrolling
    123(1)
    Requirements for a loop to vectorize (Intel® Compiler)
    124(2)
    Importance of inlining, interference with simple profiling
    126(1)
    Compiler options
    126(2)
    Memory disambiguation inside vector-loops
    127(1)
    Compiler directives
    128(22)
    SIMD directives
    129(5)
    The VECTOR and NOVECTOR directives
    134(1)
    The IVDEP directive
    135(2)
    Random number function vectorization
    137(1)
    Utilizing full vectors, -opt-assume-safe-padding
    138(4)
    Option -opt-assume-safe-padding
    142(1)
    Data alignment to assist vectorization
    142(4)
    Tradeoffs in array notations due to vector lengths
    146(4)
    Use array sections to encourage vectorization
    150(6)
    Fortran array sections
    150(2)
    Cilk Plus array sections and elemental functions
    152(4)
    Look at what the compiler created: assembly code inspection
    156(7)
    How to find the assembly code
    157(1)
    Quick inspection of assembly code
    158(5)
    Numerical result variations with vectorization
    163(1)
    Summary
    163(1)
    For more information
    163(2)
    Chapter 6 Lots of Tasks (not Threads)
    165(24)
    OpenMP, Fortran 2008, Intel® TBB, Intel® Cilk™ Plus, Intel® MKL
    166(2)
    Task creation needs to happen on the coprocessor
    166(2)
    Importance of thread pools
    168(1)
    OpenMP
    168(3)
    Parallel processing model
    168(1)
    Directives
    169(1)
    Significant controls over OpenMP
    169(1)
    Nesting
    170(1)
    Fortran 2008
    171(3)
    DO CONCURRENT
    171(1)
    DO CONCURRENT and DATA RACES
    171(1)
    DO CONCURRENT definition
    172(1)
    DO CONCURRENT vs. FOR ALL
    173(1)
    DO CONCURRENT vs. OpenMP "Parallel"
    173(1)
    Intel® TBB
    174(7)
    History
    175(2)
    Using TBB
    177(1)
    parallel_for
    177(1)
    blocked_range
    177(1)
    Partitioners
    178(1)
    parallel_reduce
    179(1)
    Parallel_invoke
    180(1)
    Notes on C++11
    180(1)
    TBB summary
    181(1)
    Cilk Plus
    181(6)
    History
    183(1)
    Borrowing components from TBB
    183(1)
    Loaning components to TBB
    184(1)
    Keyword spelling
    184(1)
    Cilk_for
    184(1)
    cilk_spawn and cilk_sync
    185(2)
    Reducers (Hyperobjects)
    187(1)
    Array notation and elemental functions
    187(1)
    Cilk Plus summary
    187(1)
    Summary
    187(1)
    For more information
    188(1)
    Chapter 7 Offload
    189(54)
    Two offload models
    190(1)
    Choosing offload vs. native execution
    191(1)
    Non-shared memory model: using offload pragmas/directives
    191(1)
    Shared virtual memory model: using offload with shared VM
    191(1)
    Intel® Math Kernel Library (Intel MKL) automatic offload
    192(1)
    Language extensions for offload
    192(3)
    Compiler options and environment variables for offload
    193(2)
    Sharing environment variables for offload
    195(1)
    Offloading to multiple coprocessors
    195(1)
    Using pragma/directive offload
    195(22)
    Placing variables and functions on the coprocessor
    198(2)
    Managing memory allocation for pointer variables
    200(6)
    Optimization for time: another reason to persist allocations
    206(1)
    Target-specific code using a pragma in C/C + +
    206(3)
    Target-specific code using a directive in fortran
    209(1)
    Code that should not be built for processor-only execution
    209(2)
    Predefined macros for Intel® MIC architecture
    211(1)
    Fortran arrays
    211(1)
    Allocating memory for parts of C/C++ arrays
    212(1)
    Allocating memory for parts of fortran arrays
    213(1)
    Moving data from one variable to another
    214(1)
    Restrictions on offloaded code using a pragma
    215(2)
    Using offload with shared virtual memory
    217(11)
    Using shared memory and shared variables
    217(2)
    About shared functions
    219(1)
    Shared memory management functions
    219(1)
    Synchronous and asynchronous function execution: _cilk_offload
    219(1)
    Sharing variables and functions: _cilk_shared
    220(2)
    Rules for using _cilk_shared and _cilk_offload
    222(1)
    Synchronization between the processor and the target
    222(1)
    Writing target-specific code with _cilk_offload
    223(1)
    Restrictions on offloaded code using shared virtual memory
    224(1)
    Persistent data when using shared virtual memory
    225(2)
    C++ declarations of persistent data with shared virtual memory
    227(1)
    About asynchronous computation
    228(1)
    About asynchronous data transfer
    229(5)
    Asynchronous data transfer from the processor to the coprocessor
    229(5)
    Applying the target attribute to multiple declarations
    234(4)
    Vec-report option used with offloads
    235(1)
    Measuring timing and data in offload regions
    236(1)
    _Offload_report
    236(1)
    Using libraries in offloaded code
    237(1)
    About creating offload libraries with xiar and xild
    237(1)
    Performing file I/O on the coprocessor
    238(2)
    Logging stdout and stderr from offloaded code
    240(1)
    Summary
    241(1)
    For more information
    241(2)
    Chapter 8 Coprocessor Architecture
    243(26)
    The Intel® Xeon Phi™ coprocessor family
    244(1)
    Coprocessor card design
    245(1)
    Intel® Xeon Phi™ coprocessor silicon overview
    246(1)
    Individual coprocessor core architecture
    247(2)
    Instruction and multithread processing
    249(2)
    Cache organization and memory access considerations
    251(1)
    Prefetching
    252(1)
    Vector processing unit architecture
    253(4)
    Vector instructions
    254(3)
    Coprocessor PCIe system interface and DMA
    257(3)
    DMA capabilities
    258(2)
    Coprocessor power management capabilities
    260(3)
    Reliability, availability, and serviceability (RAS)
    263(2)
    Machine check architecture (MCA)
    264(1)
    Coprocessor system management controller (SMC)
    265(2)
    Sensors
    265(1)
    Thermal design power monitoring and control
    266(1)
    Fan speed control
    266(1)
    Potential application impact
    266(1)
    Benchmarks
    267(1)
    Summary
    267(1)
    For more information
    267(2)
    Chapter 9 Coprocessor System Software
    269(24)
    Coprocessor software architecture overview
    269(2)
    Symmetry
    271(1)
    Ring levels: user and kernel
    271(1)
    Coprocessor programming models and options
    271(5)
    Breadth and depth
    273(1)
    Coprocessor MPI programming models
    274(2)
    Coprocessor software architecture components
    276(1)
    Development tools and application layer
    276(1)
    Intel® manycore platform software stack
    277(10)
    MYO: mine yours ours
    277(1)
    COI: coprocessor offload infrastructure
    278(1)
    SCIF: symmetric communications interface
    278(1)
    Virtual networking (NetDev), TCP/IP, and sockets
    278(1)
    Coprocessor system management
    279(3)
    Coprocessor components for MPI applications
    282(5)
    Linux support for Intel® Xeon Phi™ coprocessors
    287(1)
    Tuning memory allocation performance
    288(2)
    Controlling the number of 2 MB pages
    288(1)
    Monitoring the number of 2 MB pages on the coprocessor
    288(1)
    A sample method for allocating 2 MB pages
    289(1)
    Summary
    290(1)
    For more information
    291(2)
    Chapter 10 Linux on the Coprocessor
    293(32)
    Coprocessor Linux baseline
    293(1)
    Introduction to coprocessor Linux bootstrap and configuration
    294(1)
    Default coprocessor Linux configuration
    295(2)
    Step 1 Ensure root access
    296(1)
    Step 2 Generate the default configuration
    296(1)
    Step 3 Change configuration
    296(1)
    Step 4 Start the Intel® MPSS service
    296(1)
    Changing coprocessor configuration
    297(8)
    Configurable components
    297(1)
    Configuration files
    298(1)
    Configuring boot parameters
    298(2)
    Coprocessor root file system
    300(5)
    The micctrl utility
    305(7)
    Coprocessor state control
    306(1)
    Booting coprocessors
    306(1)
    Shutting down coprocessors
    306(1)
    Rebooting the coprocessors
    306(1)
    Resetting coprocessors
    307(1)
    Coprocessor configuration initialization and propagation
    308(1)
    Helper functions for configuration parameters
    309(2)
    Other file system helper functions
    311(1)
    Adding software
    312(3)
    Adding files to the root file system
    313(1)
    Example: Adding a new global file set
    314(1)
    Coprocessor Linux boot process
    315(3)
    Booting the coprocessor
    315(3)
    Coprocessors in a Linux cluster
    318(4)
    Intel® Cluster Ready
    319(1)
    How Intel® Cluster Checker works
    319(1)
    Intel® Cluster Checker support for coprocessors
    320(2)
    Summary
    322(1)
    For more information
    323(2)
    Chapter 11 Math Library
    325(18)
    Intel Math Kernel Library overview
    326(1)
    Intel MKL differences on the coprocessor
    327(1)
    Intel MKL and Intel compiler
    327(1)
    Coprocessor support overview
    327(3)
    Control functions for automatic offload
    328(2)
    Examples of how to set the environment variables
    330(1)
    Using the coprocessor in native mode
    330(2)
    Tips for using native mode
    332(1)
    Using automatic offload mode
    332(5)
    How to enable automatic offload
    333(1)
    Examples of using control work division
    333(1)
    Tips for effective use of automatic offload
    333(3)
    Some tips for effective use of Intel MKL with or without offload
    336(1)
    Using compiler-assisted offload
    337(2)
    Tips for using compiler assisted offload
    338(1)
    Precision choices and variations
    339(3)
    Fast transcendentals and mathematics
    339(1)
    Understanding the potential for floating-point arithmetic variations
    339(3)
    Summary
    342(1)
    For more information
    342(1)
    Chapter 12 MPI
    343(20)
    MPI overview
    343(2)
    Using MPI on Intel® Xeon Phi™ coprocessors
    345(4)
    Heterogeneity (and why it matters)
    345(3)
    Prerequisites (batteries not included)
    348(1)
    Offload from an MPI rank
    349(5)
    Hello world
    350(1)
    Trapezoidal rule
    350(4)
    Using MPI natively on the coprocessor
    354(7)
    Hello world (again)
    354(2)
    Trapezoidal rule (revisited)
    356(5)
    Summary
    361(1)
    For more information
    362(1)
    Chapter 13 Profiling and Timing
    363(22)
    Event monitoring registers on the coprocessor
    364(1)
    List of events used in this guide
    364(1)
    Efficiency metrics
    364(6)
    CPI
    365(4)
    Compute to data access ratio
    369(1)
    Potential performance issues
    370(7)
    General cache usage
    371(2)
    TLB misses
    373(1)
    VPU usage
    374(2)
    Memory bandwidth
    376(1)
    Intel® VTune™ Amplifier XE product
    377(1)
    Avoid simple profiling
    378(1)
    Performance application programming interface
    378(1)
    MPI analysis: Intel Trace Analyzer and Collector
    378(2)
    Generating a trace file: coprocessor only application
    379(1)
    Generating a trace file: processor + coprocessor application
    379(1)
    Timing
    380(3)
    Clocksources on the coprocessor
    380(1)
    MIC elapsed time counter (micetc)
    380(1)
    Time stamp counter (tsc)
    380(1)
    Setting the clocksource
    381(1)
    Time structures
    381(1)
    Time penalty
    382(1)
    Measuring timing and data in offload regions
    383(1)
    Summary
    383(1)
    For more information
    383(2)
    Chapter 14 Summary
    385(2)
    Advice
    385(1)
    Additional resources
    386(1)
    Another book coming?
    386(1)
    Feedback appreciated
    386(1)
    Glossary 387(14)
    Index 401
    Jim Jeffers was the primary strategic planner and one of the first full-time employees on the program that became Intel ® MIC. He served as lead SW Engineering Manager on the program and formed and launched the SW development team. As the program evolved, he became the workloads (applications) and SW performance team manager. He has some of the deepest insight into the market, architecture and programming usages of the MIC product line. He has been a developer and development manager for embedded and high performance systems for close to 30 years. James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the worlds first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for a number of Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. James has published numerous articles, contributed to several books and is widely interviewed on parallelism. James has managed software development groups, customer service and consulting teams, business development and marketing teams. James is sought after to keynote on parallel programming, and is the author/co-author of three books currently in print including Structured Parallel Programming, published by Morgan Kaufmann in 2012.