About the Authors |
|
xv | |
Acknowledgments |
|
xvii | |
Preface |
|
xix | |
Part 1 |
|
1 | (248) |
|
Chapter 1 Jumping Right In: "Hello, TBB!" |
|
|
3 | (30) |
|
Why Threading Building Blocks? |
|
|
3 | (4) |
|
Performance: Small Overhead, Big Benefits for C++ |
|
|
4 | (1) |
|
Evolving Support for Parallelism in TBB and C++ |
|
|
5 | (1) |
|
Recent C++ Additions for Parallelism |
|
|
6 | (1) |
|
The Threading Building Blocks (TBB) Library |
|
|
7 | (4) |
|
Parallel Execution Interfaces |
|
|
8 | (2) |
|
Interfaces That Are Independent of the Execution Model |
|
|
10 | (1) |
|
Using the Building Blocks in TBB |
|
|
10 | (1) |
|
Let's Get Started Already! |
|
|
11 | (10) |
|
Getting the Threading Building Blocks (TBB) Library |
|
|
11 | (1) |
|
Getting a Copy of the Examples |
|
|
12 | (1) |
|
Writing a First "Hello, TBB!" Example |
|
|
12 | (3) |
|
Building the Simple Examples |
|
|
15 | (1) |
|
Building on Windows Using Microsoft Visual Studio |
|
|
16 | (1) |
|
Building on a Linux Platform from a Terminal |
|
|
17 | (4) |
|
|
21 | (12) |
|
Starting with a Serial Implementation |
|
|
21 | (4) |
|
Adding a Message-Driven Layer Using a Flow Graph |
|
|
25 | (2) |
|
Adding a Fork-Join Layer Using a parallel_for |
|
|
27 | (2) |
|
Adding a SIMD Layer Using a Parallel STL Transform |
|
|
29 | (4) |
|
Chapter 2 Generic Parallel Algorithms |
|
|
33 | (46) |
|
Functional /Task Parallelism |
|
|
37 | (5) |
|
A Slightly More Complicated Example: A Parallel Implementation of Quicksort |
|
|
40 | (2) |
|
Loops: parallel_for, parallel_reduce, and parallel_scan |
|
|
42 | (15) |
|
parallel_for: Applying a Body to Each Element in a Range |
|
|
42 | (4) |
|
parallel_reduce: Calculating a Single Result Across a Range |
|
|
46 | (6) |
|
parallel_scan: A Reduction with Intermediate Values |
|
|
52 | (2) |
|
|
54 | (2) |
|
A Slightly More Complicated Example: Line of Sight |
|
|
56 | (1) |
|
Cook Until Done: parallel_do and parallel_pipeline |
|
|
57 | (22) |
|
parallel_do: Apply a Body Until There Are No More Items Left |
|
|
58 | (9) |
|
parallel_pipeline: Streaming Items Through a Series of Filters |
|
|
67 | (12) |
|
|
79 | (30) |
|
Why Use Graphs to Express Parallelism? |
|
|
80 | (2) |
|
The Basics of the TBB Flow Graph Interface |
|
|
82 | (9) |
|
Step 1: Create the Graph Object |
|
|
84 | (1) |
|
|
84 | (3) |
|
|
87 | (2) |
|
|
89 | (2) |
|
Step 5: Wait for the Graph to Complete Executing |
|
|
91 | (1) |
|
A More Complicated Example of a Data Flow Graph |
|
|
91 | (6) |
|
Implementing the Example as a TBB Flow Graph |
|
|
93 | (3) |
|
Understanding the Performance of a Data Flow Graph |
|
|
96 | (1) |
|
The Special Case of Dependency Graphs |
|
|
97 | (9) |
|
Implementing a Dependency Graph |
|
|
99 | (6) |
|
Estimating the Scalability of a Dependency Graph |
|
|
105 | (1) |
|
Advanced Topics in TBB Flow Graphs |
|
|
106 | (3) |
|
Chapter 4 TBB and the Parallel Algorithms of the C++ Standard Template Library |
|
|
109 | (28) |
|
Does the C++ STL Library Belong in This Book? |
|
|
110 | (2) |
|
A Parallel STL Execution Policy Analogy |
|
|
112 | (1) |
|
A Simple Example Using std::for_each |
|
|
113 | (4) |
|
What Algorithms Are Provided in a Parallel STL Implementation? |
|
|
117 | (3) |
|
How to Get and Use a Copy of Parallel STL That Uses TBB |
|
|
117 | (1) |
|
Algorithms in Intel's Parallel STL |
|
|
118 | (2) |
|
Capturing More Use Cases with Custom Iterators |
|
|
120 | (4) |
|
Highlighting Some of the Most Useful Algorithms |
|
|
124 | (6) |
|
std::for_each, std::for_each_n |
|
|
124 | (2) |
|
|
126 | (1) |
|
|
127 | (1) |
|
|
128 | (2) |
|
A Deeper Dive into the Execution Policies |
|
|
130 | (2) |
|
|
131 | (1) |
|
|
131 | (1) |
|
|
132 | (1) |
|
The parallel_unsequenced_policy |
|
|
132 | (1) |
|
Which Execution Policy Should We Use? |
|
|
132 | (2) |
|
Other Ways to Introduce SIMD Parallelism |
|
|
134 | (3) |
|
Chapter 5 Synchronization: Why and How to Avoid It |
|
|
137 | (42) |
|
A Running Example: Histogram of an Image |
|
|
138 | (3) |
|
An Unsafe Parallel Implementation |
|
|
141 | (4) |
|
A First Safe Parallel Implementation: Coarse-Grained Locking |
|
|
145 | (8) |
|
|
151 | (2) |
|
A Second Safe Parallel Implementation: Fine-Grained Locking |
|
|
153 | (5) |
|
A Third Safe Parallel Implementation: Atomics |
|
|
158 | (5) |
|
A Better Parallel Implementation: Privatization and Reduction |
|
|
163 | (7) |
|
Thread Local Storage, TLS |
|
|
164 | (1) |
|
enumerable_thread_specific, ETS |
|
|
165 | (3) |
|
|
168 | (2) |
|
The Easiest Parallel Implementation: Reduction Template |
|
|
170 | (2) |
|
|
172 | (7) |
|
Chapter 6 Data Structures for Concurrency |
|
|
179 | (28) |
|
Key Data Structures Basics |
|
|
180 | (2) |
|
Unordered Associative Containers |
|
|
180 | (1) |
|
|
181 | (1) |
|
|
181 | (1) |
|
|
181 | (1) |
|
|
182 | (1) |
|
|
182 | (25) |
|
Concurrent Unordered Associative Containers |
|
|
185 | (8) |
|
Concurrent Queues: Regular, Bounded, and Priority |
|
|
193 | (9) |
|
|
202 | (5) |
|
Chapter 7 Scalable Memory Allocation |
|
|
207 | (26) |
|
Modern C++ Memory Allocation |
|
|
208 | (1) |
|
Scalable Memory Allocation: What |
|
|
209 | (1) |
|
Scalable Memory Allocation: Why |
|
|
209 | (3) |
|
Avoiding False Sharing with Padding |
|
|
210 | (2) |
|
Scalable Memory Allocation Alternatives: Which |
|
|
212 | (2) |
|
Compilation Considerations |
|
|
214 | (1) |
|
Most Popular Usage (C/C++ Proxy Library): How |
|
|
214 | (6) |
|
Linux: malloc/new Proxy Library Usage |
|
|
216 | (1) |
|
macOS: malloc/new Proxy Library Usage |
|
|
216 | (1) |
|
Windows: malloc/new Proxy Library Usage |
|
|
217 | (1) |
|
Testing our Proxy Library Usage |
|
|
218 | (2) |
|
C Functions: Scalable Memory Allocators for C |
|
|
220 | (1) |
|
C++ Classes: Scalable Memory Allocators for C++ |
|
|
221 | (1) |
|
Allocators with std::allocator<T> Signature |
|
|
222 | (1) |
|
|
222 | (1) |
|
|
222 | (1) |
|
|
223 | (1) |
|
|
223 | (1) |
|
Memory Pool Support: memory_pool_allocator |
|
|
223 | (1) |
|
Array Allocation Support: aligned_space |
|
|
224 | (1) |
|
Replacing new and delete Selectively |
|
|
224 | (4) |
|
Performance Tuning: Some Control Knobs |
|
|
228 | (5) |
|
|
228 | (1) |
|
TBB Support for Huge Pages |
|
|
228 | (1) |
|
scalable_allocation_mode(int mode, intptr_t value) |
|
|
229 | (1) |
|
|
229 | (1) |
|
TBBMALLOC_SET_SOFT_HEAP_LIMIT |
|
|
230 | (1) |
|
int scalable_allocation_command(int cmd, void *param) |
|
|
230 | (1) |
|
TBBMALLOC_CLEAN_ALL_BUFFERS |
|
|
230 | (1) |
|
TBBMALLOC_CLEAN_THREAD_BUFFERS |
|
|
230 | (3) |
|
Chapter 8 Mapping Parallel Patterns to TBB |
|
|
233 | (16) |
|
Parallel Patterns vs. Parallel Algorithms |
|
|
233 | (2) |
|
Patterns Categorize Algorithms, Designs, etc. |
|
|
235 | (1) |
|
|
236 | (1) |
|
|
237 | (1) |
|
|
238 | (1) |
|
|
239 | (1) |
|
|
240 | (1) |
|
Reduction Patterns (Reduce and Scan) |
|
|
241 | (2) |
|
|
243 | (1) |
|
Divide-and-Conquer Pattern |
|
|
244 | (1) |
|
|
244 | (2) |
|
|
246 | (1) |
|
Event-Based Coordination Pattern (Reactive Streams) |
|
|
247 | (2) |
Part 2 |
|
249 | (356) |
|
Chapter 9 The Pillars of Composability |
|
|
251 | (26) |
|
|
253 | (6) |
|
|
254 | (2) |
|
|
256 | (2) |
|
|
258 | (1) |
|
The Features That Make TBB a Composable Library |
|
|
259 | (11) |
|
The TBB Thread Pool (the Market) and Task Arenas |
|
|
260 | (3) |
|
The TBB Task Dispatcher: Work Stealing and More |
|
|
263 | (7) |
|
|
270 | (4) |
|
|
274 | (3) |
|
Controlling the Number of Threads |
|
|
274 | (1) |
|
|
274 | (1) |
|
Task-to-Thread and Thread-to-Core Affinity |
|
|
275 | (1) |
|
|
275 | (2) |
|
Chapter 10 Using Tasks to Create Your Own Algorithms |
|
|
277 | (36) |
|
A Running Example: The Sequence |
|
|
278 | (2) |
|
The High-Level Approach: parallel_invoke |
|
|
280 | (2) |
|
The Highest Among the Lower: task_group |
|
|
282 | (2) |
|
The Low-Level Task Interface: Part One - Task Blocking |
|
|
284 | (6) |
|
The Low-Level Task Interface: Part Two - Task Continuation |
|
|
290 | (7) |
|
|
297 | (1) |
|
The Low-Level Task Interface: Part Three - Task Recycling |
|
|
297 | (3) |
|
|
300 | (1) |
|
One More Thing: FIFO (aka Fire-and-Forget) Tasks |
|
|
301 | (1) |
|
Putting These Low-Level Features to Work |
|
|
302 | (11) |
|
Chapter 11 Controlling the Number of Threads Used for Execution |
|
|
313 | (24) |
|
A Brief Recap of the TBB Scheduler Architecture |
|
|
314 | (1) |
|
Interfaces for Controlling the Number of Threads |
|
|
315 | (5) |
|
Controlling Thread Count with task_scheduler_init |
|
|
315 | (1) |
|
Controlling Thread Count with task_arena |
|
|
316 | (2) |
|
Controlling Thread Count with global_control |
|
|
318 | (1) |
|
Summary of Concepts and Classes |
|
|
318 | (2) |
|
The Best Approaches for Setting the Number of Threads |
|
|
320 | (12) |
|
Using a Single task_scheduler_init Object for a Simple Application |
|
|
320 | (3) |
|
Using More Than One task_scheduler_init Object in a Simple Application |
|
|
323 | (2) |
|
Using Multiple Arenas with Different Numbers of Slots to Influence Where TBB Places Its Worker Threads |
|
|
325 | (4) |
|
Using global_control to Control How Many Threads Are Available to Fill Arena Slots |
|
|
329 | (1) |
|
Using global_control to Temporarily Restrict the Number of Available Threads |
|
|
330 | (2) |
|
When NOT to Control the Number of Threads |
|
|
332 | (2) |
|
Figuring Out What's Gone Wrong |
|
|
334 | (3) |
|
Chapter 12 Using Work Isolation for Correctness and Performance |
|
|
337 | (20) |
|
Work Isolation for Correctness |
|
|
338 | (11) |
|
Creating an Isolated Region with this_task_arena::isolate |
|
|
343 | (6) |
|
Using Task Arenas for Isolation: A Double-Edged Sword |
|
|
349 | (8) |
|
Don't Be Tempted to Use task_arenas to Create Work Isolation for Correctness |
|
|
353 | (4) |
|
Chapter 13 Creating Thread-to-Core and Task-to-Thread Affinity |
|
|
357 | (16) |
|
Creating Thread-to-Core Affinity |
|
|
358 | (4) |
|
Creating Task-to-Thread Affinity |
|
|
362 | (8) |
|
When and How Should We Use the TBB Affinity Features? |
|
|
370 | (3) |
|
Chapter 14 Using Task Priorities |
|
|
373 | (14) |
|
Support for Non-Preemptive Priorities in the TBB Task Class |
|
|
374 | (2) |
|
Setting Static and Dynamic Priorities |
|
|
376 | (1) |
|
|
377 | (5) |
|
Implementing Priorities Without Using TBB Task Support |
|
|
382 | (5) |
|
Chapter 15 Cancellation and Exception Handling |
|
|
387 | (24) |
|
How to Cancel Collective Work |
|
|
388 | (2) |
|
Advanced Task Cancellation |
|
|
390 | (9) |
|
Explicit Assignment of TGC |
|
|
392 | (3) |
|
Default Assignment of TGC |
|
|
395 | (4) |
|
Exception Handling in TBB |
|
|
399 | (3) |
|
Tailoring Our Own TBB Exceptions |
|
|
402 | (3) |
|
Putting All Together: Composability, Cancellation, and Exception Handling |
|
|
405 | (6) |
|
Chapter 16 Tuning TBB Algorithms: Granularity, Locality, Parallelism, and Determinism |
|
|
411 | (40) |
|
Task Granularity: How Big Is Big Enough? |
|
|
412 | (1) |
|
Choosing Ranges and Partitioners for Loops |
|
|
413 | (20) |
|
An Overview of Partitioners |
|
|
415 | (2) |
|
Choosing a Grainsize (or Not) to Manage Task Granularity |
|
|
417 | (3) |
|
Ranges, Partitioners, and Data Cache Performance |
|
|
420 | (8) |
|
Using a static_partitioner |
|
|
428 | (3) |
|
Restricting the Scheduler for Determinism |
|
|
431 | (2) |
|
Tuning TBB Pipelines: Number of Filters, Modes, and Tokens |
|
|
433 | (6) |
|
Understanding a Balanced Pipeline |
|
|
434 | (2) |
|
Understanding an Imbalanced Pipeline |
|
|
436 | (2) |
|
Pipelines and Data Locality and Thread Affinity |
|
|
438 | (1) |
|
|
439 | (12) |
|
Making Your Own Range Type |
|
|
439 | (3) |
|
The Pipeline Class and Thread-Bound Filters |
|
|
442 | (9) |
|
Chapter 17 Flow Graphs: Beyond the Basics |
|
|
451 | (62) |
|
Optimizing for Granularity, Locality, and Parallelism |
|
|
452 | (28) |
|
Node Granularity: How Big Is Big Enough? |
|
|
452 | (10) |
|
Memory Usage and Data Locality |
|
|
462 | (15) |
|
Task Arenas and Flow Graph |
|
|
477 | (3) |
|
Key FG Advice: Dos and Don'ts |
|
|
480 | (21) |
|
Do: Use Nested Parallelism |
|
|
480 | (1) |
|
Don't: Use Multifunction Nodes in Place of Nested Parallelism |
|
|
481 | (1) |
|
Do: Use join_node, sequencer_node, or multifunction_node to Reestablish Order in a Flow Graph When Needed |
|
|
481 | (4) |
|
Do: Use the Isolate Function for Nested Parallelism |
|
|
485 | (3) |
|
Do: Use Cancellation and Exception Handling in Flow Graphs |
|
|
488 | (4) |
|
Do: Set a Priority for a Graph Using task_group_context |
|
|
492 | (1) |
|
Don't: Make an Edge Between Nodes in Different Graphs |
|
|
492 | (3) |
|
Do: Use try_put to Communicate Across Graphs |
|
|
495 | (2) |
|
Do: Use composite_node to Encapsulate Groups of Nodes |
|
|
497 | (4) |
|
Introducing Intel Advisor: Flow Graph Analyzer |
|
|
501 | (12) |
|
|
502 | (3) |
|
The FGA Analysis Workflow |
|
|
505 | (2) |
|
Diagnosing Performance Issues with FGA |
|
|
507 | (6) |
|
Chapter 18 Beef Up Flow Graphs with Async Nodes |
|
|
513 | (22) |
|
|
514 | (5) |
|
|
519 | (2) |
|
|
521 | (14) |
|
Chapter 19 Flow Graphs on Steroids: OpenCL Nodes |
|
|
535 | (46) |
|
Hello OpenCL_Node Example |
|
|
536 | (8) |
|
Where Are We Running Our Kernel? |
|
|
544 | (7) |
|
Back to the More Realistic Example of Chapter 18 |
|
|
551 | (10) |
|
The Devil Is in the Details |
|
|
561 | (9) |
|
|
562 | (6) |
|
|
568 | (1) |
|
Specifying the OpenCL Kernel |
|
|
569 | (1) |
|
Even More on Device Selection |
|
|
570 | (4) |
|
A Warning Regarding the Order Is in Order! |
|
|
574 | (7) |
|
Chapter 20 TBB on NUMA Architectures |
|
|
581 | (24) |
|
Discovering Your Platform Topology |
|
|
583 | (12) |
|
Understanding the Costs of Accessing Memory |
|
|
587 | (1) |
|
|
588 | (1) |
|
Mastering Data Placement and Processor Affinity |
|
|
589 | (6) |
|
Putting hwloc and TBB to Work Together |
|
|
595 | (6) |
|
More Advanced Alternatives |
|
|
601 | (4) |
Appendix A: History and Inspiration |
|
605 | (18) |
|
A Decade of "Hatchling to Soaring" |
|
|
605 | (6) |
|
1 TBB's Revolution Inside Intel |
|
|
605 | (1) |
|
2 TBB's First Revolution of Parallelism |
|
|
606 | (1) |
|
3 TBB's Second Revolution of Parallelism |
|
|
607 | (1) |
|
|
608 | (3) |
|
|
611 | (12) |
|
Relaxed Sequential Execution Model |
|
|
612 | (1) |
|
|
613 | (1) |
|
|
614 | (1) |
|
|
615 | (1) |
|
Influences of Generic Programming |
|
|
615 | (1) |
|
|
616 | (1) |
|
Considering Costs of Time Slicing |
|
|
617 | (1) |
|
|
618 | (5) |
Appendix B: TBB Precis |
|
623 | (106) |
|
Debug and Conditional Coding |
|
|
624 | (2) |
|
|
626 | (1) |
|
|
626 | (1) |
|
|
627 | (1) |
|
|
628 | (1) |
|
|
629 | (2) |
|
|
631 | (4) |
|
Algorithm: parallel_for_each |
|
|
635 | (1) |
|
Algorithm: parallel_invoke |
|
|
636 | (2) |
|
Algorithm: parallel_pipeline |
|
|
638 | (3) |
|
Algorithm: parallel_reduce and parallel_deterministic_reduce |
|
|
641 | (4) |
|
|
645 | (3) |
|
|
648 | (3) |
|
|
651 | (2) |
|
|
653 | (1) |
|
|
654 | (1) |
|
Flow Graph: ports and edges |
|
|
655 | (1) |
|
|
655 | (12) |
|
|
667 | (6) |
|
|
673 | (20) |
|
|
693 | (6) |
|
Thread Local Storage (TLS) |
|
|
699 | (9) |
|
|
708 | (1) |
|
Task Groups: Use of the Task Stealing Scheduler |
|
|
709 | (1) |
|
Task Scheduler: Fine Control of the Task Stealing Scheduler |
|
|
710 | (11) |
|
|
721 | (2) |
|
|
723 | (2) |
|
|
725 | (1) |
|
|
726 | (3) |
Glossary |
|
729 | (16) |
Index |
|
745 | |