Muutke küpsiste eelistusi

E-raamat: Pro TBB: C++ Parallel Programming with Threading Building Blocks

  • Formaat: PDF+DRM
  • Ilmumisaeg: 09-Jul-2019
  • Kirjastus: APress
  • Keel: eng
  • ISBN-13: 9781484243985
  • Formaat - PDF+DRM
  • Hind: 4,08 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: PDF+DRM
  • Ilmumisaeg: 09-Jul-2019
  • Kirjastus: APress
  • Keel: eng
  • ISBN-13: 9781484243985

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

This open access book is a modern guide for all C++ programmers to learn Intel Threading Building Blocks (TBB). Written by TBB and parallel programming experts, this book reflects their collective decades of experience in developing and teaching parallel programming with TBB, offering their insights in an approachable manner. Throughout the book the authors present numerous examples and best practices to help you become an effective TBB programmer and leverage the power of parallel systems.

Pro Intel Threading Building Blocks starts with the basics, explaining parallel algorithms and C++'s built-in standard template library for parallelism. You'll learn the key concepts of managing memory, working with data structures and how to handle typical issues with synchronization. Later chapters apply these ideas to complex systems to explain performance tradeoffs, mapping common parallel patterns, controlling threads and overhead, and extending TBB to program heterogeneous systems or system-on-chips. 

What You'll Learn

  • Use Threading Building Blocks to produce code that is portable, simple, scalable, and more understandable
  • Review best practices for parallelizing computationally intensive tasks in your applications
  • Integrate TBB with other threading packages
  • Create scalable, high performance data-parallel programs
  • Work with generic programming to write efficient algorithms

Who This Book Is For

C++ programmers learning to run applications on multicore systems, as well as C or C++ programmers without much experience with templates. No previous experience with parallel programming or multicore processors is required.

Arvustused

Pro TBB is an invaluable book . The book provides comprehensive coverage of a full-fledged model of parallelism. Besides the TBB constructs, various mechanisms that address issues of exception handling, task partitioning, concurrent data structures, mutual exclusion, granularity, and task-thread affinity are elaborated and evaluated in great detail. The first part of the book is a light introduction to TBB, and the second part provides an in-depth presentation with examples and a performance analysis of TBB constructs. (B. Belkhouche, Computing Reviews, July 29, 2021)

About the Authors xv
Acknowledgments xvii
Preface xix
Part 1 1(248)
Chapter 1 Jumping Right In: "Hello, TBB!"
3(30)
Why Threading Building Blocks?
3(4)
Performance: Small Overhead, Big Benefits for C++
4(1)
Evolving Support for Parallelism in TBB and C++
5(1)
Recent C++ Additions for Parallelism
6(1)
The Threading Building Blocks (TBB) Library
7(4)
Parallel Execution Interfaces
8(2)
Interfaces That Are Independent of the Execution Model
10(1)
Using the Building Blocks in TBB
10(1)
Let's Get Started Already!
11(10)
Getting the Threading Building Blocks (TBB) Library
11(1)
Getting a Copy of the Examples
12(1)
Writing a First "Hello, TBB!" Example
12(3)
Building the Simple Examples
15(1)
Building on Windows Using Microsoft Visual Studio
16(1)
Building on a Linux Platform from a Terminal
17(4)
A More Complete Example
21(12)
Starting with a Serial Implementation
21(4)
Adding a Message-Driven Layer Using a Flow Graph
25(2)
Adding a Fork-Join Layer Using a parallel_for
27(2)
Adding a SIMD Layer Using a Parallel STL Transform
29(4)
Chapter 2 Generic Parallel Algorithms
33(46)
Functional /Task Parallelism
37(5)
A Slightly More Complicated Example: A Parallel Implementation of Quicksort
40(2)
Loops: parallel_for, parallel_reduce, and parallel_scan
42(15)
parallel_for: Applying a Body to Each Element in a Range
42(4)
parallel_reduce: Calculating a Single Result Across a Range
46(6)
parallel_scan: A Reduction with Intermediate Values
52(2)
How Does This Work?
54(2)
A Slightly More Complicated Example: Line of Sight
56(1)
Cook Until Done: parallel_do and parallel_pipeline
57(22)
parallel_do: Apply a Body Until There Are No More Items Left
58(9)
parallel_pipeline: Streaming Items Through a Series of Filters
67(12)
Chapter 3 Flow Graphs
79(30)
Why Use Graphs to Express Parallelism?
80(2)
The Basics of the TBB Flow Graph Interface
82(9)
Step 1: Create the Graph Object
84(1)
Step 2: Make the Nodes
84(3)
Step 3: Add Edges
87(2)
Step 4: Start the Graph
89(2)
Step 5: Wait for the Graph to Complete Executing
91(1)
A More Complicated Example of a Data Flow Graph
91(6)
Implementing the Example as a TBB Flow Graph
93(3)
Understanding the Performance of a Data Flow Graph
96(1)
The Special Case of Dependency Graphs
97(9)
Implementing a Dependency Graph
99(6)
Estimating the Scalability of a Dependency Graph
105(1)
Advanced Topics in TBB Flow Graphs
106(3)
Chapter 4 TBB and the Parallel Algorithms of the C++ Standard Template Library
109(28)
Does the C++ STL Library Belong in This Book?
110(2)
A Parallel STL Execution Policy Analogy
112(1)
A Simple Example Using std::for_each
113(4)
What Algorithms Are Provided in a Parallel STL Implementation?
117(3)
How to Get and Use a Copy of Parallel STL That Uses TBB
117(1)
Algorithms in Intel's Parallel STL
118(2)
Capturing More Use Cases with Custom Iterators
120(4)
Highlighting Some of the Most Useful Algorithms
124(6)
std::for_each, std::for_each_n
124(2)
std::transform
126(1)
std::reduce
127(1)
std::transform_reduce
128(2)
A Deeper Dive into the Execution Policies
130(2)
The sequenced_policy
131(1)
The parallel_policy
131(1)
The unsequenced_policy
132(1)
The parallel_unsequenced_policy
132(1)
Which Execution Policy Should We Use?
132(2)
Other Ways to Introduce SIMD Parallelism
134(3)
Chapter 5 Synchronization: Why and How to Avoid It
137(42)
A Running Example: Histogram of an Image
138(3)
An Unsafe Parallel Implementation
141(4)
A First Safe Parallel Implementation: Coarse-Grained Locking
145(8)
Mutex Flavors
151(2)
A Second Safe Parallel Implementation: Fine-Grained Locking
153(5)
A Third Safe Parallel Implementation: Atomics
158(5)
A Better Parallel Implementation: Privatization and Reduction
163(7)
Thread Local Storage, TLS
164(1)
enumerable_thread_specific, ETS
165(3)
combinable
168(2)
The Easiest Parallel Implementation: Reduction Template
170(2)
Recap of Our Options
172(7)
Chapter 6 Data Structures for Concurrency
179(28)
Key Data Structures Basics
180(2)
Unordered Associative Containers
180(1)
Map vs. Set
181(1)
Multiple Values
181(1)
Hashing
181(1)
Unordered
182(1)
Concurrent Containers
182(25)
Concurrent Unordered Associative Containers
185(8)
Concurrent Queues: Regular, Bounded, and Priority
193(9)
Concurrent Vector
202(5)
Chapter 7 Scalable Memory Allocation
207(26)
Modern C++ Memory Allocation
208(1)
Scalable Memory Allocation: What
209(1)
Scalable Memory Allocation: Why
209(3)
Avoiding False Sharing with Padding
210(2)
Scalable Memory Allocation Alternatives: Which
212(2)
Compilation Considerations
214(1)
Most Popular Usage (C/C++ Proxy Library): How
214(6)
Linux: malloc/new Proxy Library Usage
216(1)
macOS: malloc/new Proxy Library Usage
216(1)
Windows: malloc/new Proxy Library Usage
217(1)
Testing our Proxy Library Usage
218(2)
C Functions: Scalable Memory Allocators for C
220(1)
C++ Classes: Scalable Memory Allocators for C++
221(1)
Allocators with std::allocator<T> Signature
222(1)
scalable_allocator
222(1)
tbb_allocator
222(1)
zero_allocator
223(1)
cached_aligned_allocator
223(1)
Memory Pool Support: memory_pool_allocator
223(1)
Array Allocation Support: aligned_space
224(1)
Replacing new and delete Selectively
224(4)
Performance Tuning: Some Control Knobs
228(5)
What Are Huge Pages?
228(1)
TBB Support for Huge Pages
228(1)
scalable_allocation_mode(int mode, intptr_t value)
229(1)
TBBMALLOC_USE_HUGE_PAGES
229(1)
TBBMALLOC_SET_SOFT_HEAP_LIMIT
230(1)
int scalable_allocation_command(int cmd, void *param)
230(1)
TBBMALLOC_CLEAN_ALL_BUFFERS
230(1)
TBBMALLOC_CLEAN_THREAD_BUFFERS
230(3)
Chapter 8 Mapping Parallel Patterns to TBB
233(16)
Parallel Patterns vs. Parallel Algorithms
233(2)
Patterns Categorize Algorithms, Designs, etc.
235(1)
Patterns That Work
236(1)
Data Parallelism Wins
237(1)
Nesting Pattern
238(1)
Map Pattern
239(1)
Workpile Pattern
240(1)
Reduction Patterns (Reduce and Scan)
241(2)
Fork-Join Pattern
243(1)
Divide-and-Conquer Pattern
244(1)
Branch-and-Bound Pattern
244(2)
Pipeline Pattern
246(1)
Event-Based Coordination Pattern (Reactive Streams)
247(2)
Part 2 249(356)
Chapter 9 The Pillars of Composability
251(26)
What Is Composability?
253(6)
Nested Composition
254(2)
Concurrent Composition
256(2)
Serial Composition
258(1)
The Features That Make TBB a Composable Library
259(11)
The TBB Thread Pool (the Market) and Task Arenas
260(3)
The TBB Task Dispatcher: Work Stealing and More
263(7)
Putting It All Together
270(4)
Looking Forward
274(3)
Controlling the Number of Threads
274(1)
Work Isolation
274(1)
Task-to-Thread and Thread-to-Core Affinity
275(1)
Task Priorities
275(2)
Chapter 10 Using Tasks to Create Your Own Algorithms
277(36)
A Running Example: The Sequence
278(2)
The High-Level Approach: parallel_invoke
280(2)
The Highest Among the Lower: task_group
282(2)
The Low-Level Task Interface: Part One - Task Blocking
284(6)
The Low-Level Task Interface: Part Two - Task Continuation
290(7)
Bypassing the Scheduler
297(1)
The Low-Level Task Interface: Part Three - Task Recycling
297(3)
Task Interface Checklist
300(1)
One More Thing: FIFO (aka Fire-and-Forget) Tasks
301(1)
Putting These Low-Level Features to Work
302(11)
Chapter 11 Controlling the Number of Threads Used for Execution
313(24)
A Brief Recap of the TBB Scheduler Architecture
314(1)
Interfaces for Controlling the Number of Threads
315(5)
Controlling Thread Count with task_scheduler_init
315(1)
Controlling Thread Count with task_arena
316(2)
Controlling Thread Count with global_control
318(1)
Summary of Concepts and Classes
318(2)
The Best Approaches for Setting the Number of Threads
320(12)
Using a Single task_scheduler_init Object for a Simple Application
320(3)
Using More Than One task_scheduler_init Object in a Simple Application
323(2)
Using Multiple Arenas with Different Numbers of Slots to Influence Where TBB Places Its Worker Threads
325(4)
Using global_control to Control How Many Threads Are Available to Fill Arena Slots
329(1)
Using global_control to Temporarily Restrict the Number of Available Threads
330(2)
When NOT to Control the Number of Threads
332(2)
Figuring Out What's Gone Wrong
334(3)
Chapter 12 Using Work Isolation for Correctness and Performance
337(20)
Work Isolation for Correctness
338(11)
Creating an Isolated Region with this_task_arena::isolate
343(6)
Using Task Arenas for Isolation: A Double-Edged Sword
349(8)
Don't Be Tempted to Use task_arenas to Create Work Isolation for Correctness
353(4)
Chapter 13 Creating Thread-to-Core and Task-to-Thread Affinity
357(16)
Creating Thread-to-Core Affinity
358(4)
Creating Task-to-Thread Affinity
362(8)
When and How Should We Use the TBB Affinity Features?
370(3)
Chapter 14 Using Task Priorities
373(14)
Support for Non-Preemptive Priorities in the TBB Task Class
374(2)
Setting Static and Dynamic Priorities
376(1)
Two Small Examples
377(5)
Implementing Priorities Without Using TBB Task Support
382(5)
Chapter 15 Cancellation and Exception Handling
387(24)
How to Cancel Collective Work
388(2)
Advanced Task Cancellation
390(9)
Explicit Assignment of TGC
392(3)
Default Assignment of TGC
395(4)
Exception Handling in TBB
399(3)
Tailoring Our Own TBB Exceptions
402(3)
Putting All Together: Composability, Cancellation, and Exception Handling
405(6)
Chapter 16 Tuning TBB Algorithms: Granularity, Locality, Parallelism, and Determinism
411(40)
Task Granularity: How Big Is Big Enough?
412(1)
Choosing Ranges and Partitioners for Loops
413(20)
An Overview of Partitioners
415(2)
Choosing a Grainsize (or Not) to Manage Task Granularity
417(3)
Ranges, Partitioners, and Data Cache Performance
420(8)
Using a static_partitioner
428(3)
Restricting the Scheduler for Determinism
431(2)
Tuning TBB Pipelines: Number of Filters, Modes, and Tokens
433(6)
Understanding a Balanced Pipeline
434(2)
Understanding an Imbalanced Pipeline
436(2)
Pipelines and Data Locality and Thread Affinity
438(1)
Deep in the Weeds
439(12)
Making Your Own Range Type
439(3)
The Pipeline Class and Thread-Bound Filters
442(9)
Chapter 17 Flow Graphs: Beyond the Basics
451(62)
Optimizing for Granularity, Locality, and Parallelism
452(28)
Node Granularity: How Big Is Big Enough?
452(10)
Memory Usage and Data Locality
462(15)
Task Arenas and Flow Graph
477(3)
Key FG Advice: Dos and Don'ts
480(21)
Do: Use Nested Parallelism
480(1)
Don't: Use Multifunction Nodes in Place of Nested Parallelism
481(1)
Do: Use join_node, sequencer_node, or multifunction_node to Reestablish Order in a Flow Graph When Needed
481(4)
Do: Use the Isolate Function for Nested Parallelism
485(3)
Do: Use Cancellation and Exception Handling in Flow Graphs
488(4)
Do: Set a Priority for a Graph Using task_group_context
492(1)
Don't: Make an Edge Between Nodes in Different Graphs
492(3)
Do: Use try_put to Communicate Across Graphs
495(2)
Do: Use composite_node to Encapsulate Groups of Nodes
497(4)
Introducing Intel Advisor: Flow Graph Analyzer
501(12)
The FGA Design Workflow
502(3)
The FGA Analysis Workflow
505(2)
Diagnosing Performance Issues with FGA
507(6)
Chapter 18 Beef Up Flow Graphs with Async Nodes
513(22)
Async World Example
514(5)
Why and When async_node?
519(2)
A More Realistic Example
521(14)
Chapter 19 Flow Graphs on Steroids: OpenCL Nodes
535(46)
Hello OpenCL_Node Example
536(8)
Where Are We Running Our Kernel?
544(7)
Back to the More Realistic Example of
Chapter 18
551(10)
The Devil Is in the Details
561(9)
The NDRange Concept
562(6)
Playing with the Offset
568(1)
Specifying the OpenCL Kernel
569(1)
Even More on Device Selection
570(4)
A Warning Regarding the Order Is in Order!
574(7)
Chapter 20 TBB on NUMA Architectures
581(24)
Discovering Your Platform Topology
583(12)
Understanding the Costs of Accessing Memory
587(1)
Our Baseline Example
588(1)
Mastering Data Placement and Processor Affinity
589(6)
Putting hwloc and TBB to Work Together
595(6)
More Advanced Alternatives
601(4)
Appendix A: History and Inspiration 605(18)
A Decade of "Hatchling to Soaring"
605(6)
1 TBB's Revolution Inside Intel
605(1)
2 TBB's First Revolution of Parallelism
606(1)
3 TBB's Second Revolution of Parallelism
607(1)
4 TBB's Birds
608(3)
Inspiration for TBB
611(12)
Relaxed Sequential Execution Model
612(1)
Influential Libraries
613(1)
Influential Languages
614(1)
Influential Pragmas
615(1)
Influences of Generic Programming
615(1)
Considering Caches
616(1)
Considering Costs of Time Slicing
617(1)
Further Reading
618(5)
Appendix B: TBB Precis 623(106)
Debug and Conditional Coding
624(2)
Preview Feature Macros
626(1)
Ranges
626(1)
Partitioners
627(1)
Algorithms
628(1)
Algorithm: parallel_do
629(2)
Algorithm: parallel_for
631(4)
Algorithm: parallel_for_each
635(1)
Algorithm: parallel_invoke
636(2)
Algorithm: parallel_pipeline
638(3)
Algorithm: parallel_reduce and parallel_deterministic_reduce
641(4)
Algorithm: parallel_scan
645(3)
Algorithm: parallel_sort
648(3)
Algorithm: pipeline
651(2)
Flow Graph
653(1)
Flow Graph: graph class
654(1)
Flow Graph: ports and edges
655(1)
Flow Graph: nodes
655(12)
Memory Allocation
667(6)
Containers
673(20)
Synchronization
693(6)
Thread Local Storage (TLS)
699(9)
Timing
708(1)
Task Groups: Use of the Task Stealing Scheduler
709(1)
Task Scheduler: Fine Control of the Task Stealing Scheduler
710(11)
Floating-Point Settings
721(2)
Exceptions
723(2)
Threads
725(1)
Parallel STL
726(3)
Glossary 729(16)
Index 745
Michael Voss is a Principal Engineer in the Intel Architecture, Graphics and Software Group at Intel. He has been a member of the TBB development team since before the 1.0 release in 2006 and was the initial architect of the TBB flow graph API.  He is also one of the lead developers of Flow Graph Analyzer, a graphical tool for analyzing data flow applications targeted at both homogeneous and heterogeneous platforms. He has co-authored over 40 published papers and articles on topics related to parallel programming and frequently consults with customers across a wide range of domains to help them effectively use the threading libraries provided by Intel.  Prior to joining Intel in 2006, he was an Assistant Professor in the Edward S. Rogers Department of Electrical and Computer Engineering at the University of Toronto. He received his Ph.D. from the School of Electrical and Computer Engineering at Purdue University in 2001.   Rafael Asenjo, Professor of Computer Architecture at the University of Malaga, Spain, obtained a PhD in Telecommunication Engineering in 1997 and was an Associate Professor at the Computer Architecture Department from 2001 to 2017. He was a Visiting Scholar at the University of Illinois in Urbana-Champaign (UIUC) in 1996 and 1997 and Visiting Research Associate in the same University in 1998. He was also a Research Visitor at IBM T.J. Watson in 2008 and at Cray Inc. in 2011. He has been using TBB since 2008 and over the last five years, he has focused on productively exploiting heterogeneous chips leveraging TBB as the orchestrating framework. In 2013 and 2014 he visited UIUC to work on CPU+GPU chips. In 2015 and 2016 he also started to research into CPU+FPGA chips while visiting U. of Bristol. He served as General Chair for ACM PPoPP'16 and as an Organization Committee member as well as a Program Committee member for several HPC related conferences (PPoPP, SC, PACT, IPDPS, HPCA, EuroPar, and SBAC-PAD). His research interests include heterogeneous programming models and architectures, parallelization of irregular codes and energy consumption.   James Reinders is a consultant with more than three decades experience in Parallel Computing, and is an author/co-author/editor of nine technical books related to parallel programming.  He has had the great fortune to help make key contributions to two of the world's fastest computers (#1 on Top500 list) as well as many other supercomputers, and software developer tools. James finished 10,001 days (over 27 years) at Intel in mid-2016, and now continues to write, teach, program, and do consulting in areas related to parallel computing (HPC and AI).