Preface |
|
xiii | |
Acknowledgments |
|
xv | |
Authors |
|
xvii | |
|
|
xix | |
|
|
xxiii | |
|
1 Introduction and Overview |
|
|
1 | (18) |
|
|
|
1 | (1) |
|
1.2 Overview of Case Studies |
|
|
2 | (1) |
|
|
3 | (4) |
|
1.3.1 Quickly Increasing Clock Speeds Ended by 2005 |
|
|
3 | (1) |
|
1.3.2 The Move to Multicore |
|
|
4 | (2) |
|
1.3.3 SIMD Is Parallelism Too |
|
|
6 | (1) |
|
1.3.4 Highly Threaded Hardware |
|
|
7 | (1) |
|
1.4 Program in Tasks, Not Threads |
|
|
7 | (1) |
|
|
8 | (1) |
|
1.6 Scaling and Vectorization |
|
|
9 | (1) |
|
1.7 Advancing Programming Languages for Parallel Programming |
|
|
10 | (1) |
|
|
10 | (1) |
|
1.7.2 Parallel Programming Needs |
|
|
10 | (1) |
|
1.7.3 Relaxed Sequential Semantics |
|
|
11 | (1) |
|
1.8 Parallel Programming in C and C++ |
|
|
11 | (5) |
|
1.8.1 Brief Survey of Key Parallelism Options |
|
|
12 | (1) |
|
|
12 | (1) |
|
|
12 | (1) |
|
|
12 | (1) |
|
|
13 | (1) |
|
1.8.1.5 GPU Specific Models |
|
|
13 | (1) |
|
1.8.2 More on TBB: Intel Threading Building Blocks |
|
|
13 | (1) |
|
1.8.2.1 Parallel for: parallel for() |
|
|
14 | (1) |
|
1.8.2.2 Parallel Reductions: parallel reduce |
|
|
15 | (1) |
|
1.8.2.3 Parallel Invocation of Functions: parallel invoke |
|
|
16 | (1) |
|
1.8.2.4 Learning More about TBB |
|
|
16 | (1) |
|
1.9 Data Movement and Layout |
|
|
16 | (1) |
|
|
17 | (1) |
|
|
18 | (1) |
|
2 Houdini: Multithreading Existing Software |
|
|
19 | (28) |
|
|
|
19 | (2) |
|
|
21 | (9) |
|
|
22 | (5) |
|
2.2.2 Threading the Simple Cases |
|
|
27 | (3) |
|
|
30 | (4) |
|
2.3.1 Always Be Reentrant |
|
|
30 | (1) |
|
|
31 | (1) |
|
|
31 | (1) |
|
2.3.4 Never Blindly Thread |
|
|
32 | (1) |
|
2.3.5 Command Line Control |
|
|
33 | (1) |
|
2.3.6 Constant Memory versus Number of Cores |
|
|
33 | (1) |
|
|
34 | (1) |
|
|
34 | (6) |
|
|
35 | (1) |
|
2.4.2 Reader/Writer Locks |
|
|
35 | (1) |
|
2.4.3 Ownership Is Important |
|
|
36 | (1) |
|
2.4.4 Sole Ownership Is a Writer Lock |
|
|
37 | (1) |
|
2.4.5 Failure Modes of This System |
|
|
38 | (2) |
|
|
40 | (4) |
|
|
41 | (2) |
|
|
43 | (1) |
|
|
44 | (3) |
|
3 The Presto Execution System: Designing for Multithreading |
|
|
47 | (26) |
|
|
|
48 | (1) |
|
3.1.1 A Note about Interactivity |
|
|
48 | (1) |
|
|
49 | (3) |
|
|
50 | (1) |
|
|
51 | (1) |
|
3.2.3 Animation in Presto |
|
|
52 | (1) |
|
3.3 Presto's Execution System |
|
|
52 | (5) |
|
3.3.1 Phases of Execution |
|
|
53 | (1) |
|
|
53 | (1) |
|
|
54 | (1) |
|
|
54 | (1) |
|
3.3.2 Engine Architecture |
|
|
54 | (1) |
|
|
55 | (1) |
|
|
55 | (1) |
|
|
56 | (1) |
|
|
56 | (1) |
|
3.3.2.5 Engine Architecture and Multithreading |
|
|
56 | (1) |
|
|
57 | (2) |
|
3.4.1 Dependencies Declared a Priori |
|
|
57 | (1) |
|
3.4.2 Client Callbacks Are Static Functions |
|
|
57 | (1) |
|
3.4.3 Presto Singletons Are Protected |
|
|
58 | (1) |
|
|
58 | (1) |
|
3.4.5 And Then There's Python |
|
|
58 | (1) |
|
3.4.5.1 Global Interpreter Lock |
|
|
58 | (1) |
|
|
59 | (1) |
|
3.5 Memory Access Patterns |
|
|
59 | (1) |
|
3.6 Flexibility to Experiment |
|
|
60 | (1) |
|
|
60 | (1) |
|
3.6.2 Targeting Other Platforms |
|
|
60 | (1) |
|
3.7 Multithreading Strategies |
|
|
61 | (3) |
|
3.7.1 Per-Node Multithreading |
|
|
61 | (1) |
|
3.7.2 Per-Branch Multithreading |
|
|
62 | (1) |
|
3.7.3 Per-Model Multithreading |
|
|
62 | (2) |
|
3.7.4 Per-Frame Multithreading |
|
|
64 | (1) |
|
|
64 | (5) |
|
|
65 | (1) |
|
|
65 | (1) |
|
|
66 | (1) |
|
|
67 | (1) |
|
3.8.5 Problematic Data Structures |
|
|
67 | (2) |
|
3.9 Other Multithreading Strategies |
|
|
69 | (1) |
|
|
69 | (1) |
|
3.9.2 Predictive Computations |
|
|
70 | (1) |
|
3.10 Debugging and Profiling Tools |
|
|
70 | (1) |
|
|
71 | (2) |
|
4 LibEE: Parallel Evaluation of Character Rigs |
|
|
73 | (38) |
|
|
|
74 | (2) |
|
|
76 | (1) |
|
4.3 Specific Requirements for Character Animation |
|
|
76 | (3) |
|
4.3.1 Animation Graph Goals |
|
|
77 | (1) |
|
4.3.2 Animation Graph Features |
|
|
77 | (1) |
|
4.3.2.1 Few Unique Traversed Paths through Graph |
|
|
77 | (1) |
|
4.3.2.2 Animation Rigs Have Implicit Parallelism |
|
|
78 | (1) |
|
4.3.2.3 Expensive Nodes Which Can Be Internally Parallel |
|
|
78 | (1) |
|
4.3.3 Animation Graph Constraints |
|
|
78 | (1) |
|
|
78 | (1) |
|
4.3.3.2 No Scripting Languages in Operators |
|
|
78 | (1) |
|
|
79 | (1) |
|
|
79 | (1) |
|
4.4.2 Graph Evaluation Mechanism |
|
|
80 | (1) |
|
|
80 | (5) |
|
|
81 | (1) |
|
|
81 | (1) |
|
4.5.1.2 Parallel Unit Tests |
|
|
81 | (1) |
|
4.5.1.3 Threading Checker Tools |
|
|
82 | (1) |
|
|
82 | (1) |
|
|
83 | (1) |
|
|
84 | (1) |
|
|
84 | (1) |
|
4.6 Scalability: Software Considerations |
|
|
85 | (7) |
|
4.6.1 Authoring Parallel Loops |
|
|
86 | (1) |
|
|
87 | (1) |
|
|
87 | (1) |
|
4.6.4 Thread-Friendly Memory Allocators |
|
|
88 | (1) |
|
4.6.5 Oversubscription Due to Multiple Threading Models |
|
|
88 | (1) |
|
4.6.6 Cache Reuse---Chains of Nodes |
|
|
89 | (1) |
|
4.6.7 Cache Reuse---Scheduling Nodes to Maximize Sharing |
|
|
89 | (1) |
|
|
89 | (1) |
|
|
89 | (2) |
|
4.6.10 Other Processes Running on System |
|
|
91 | (1) |
|
|
91 | (1) |
|
4.6.12 Failed Approaches Discussion |
|
|
91 | (1) |
|
4.7 Scalability: Hardware Considerations |
|
|
92 | (3) |
|
|
92 | (1) |
|
|
92 | (1) |
|
|
92 | (1) |
|
|
93 | (1) |
|
|
94 | (1) |
|
4.7.6 Many-Core Architectures |
|
|
94 | (1) |
|
4.8 Production Considerations |
|
|
95 | (2) |
|
4.8.1 Character Systems Restructure |
|
|
96 | (1) |
|
4.8.2 No More Scripted Nodes |
|
|
96 | (1) |
|
4.8.3 Optimizing for Maximum Parallelism |
|
|
96 | (1) |
|
4.9 Threading Visualization Tool |
|
|
97 | (3) |
|
4.10 Rig Optimization Case Studies |
|
|
100 | (4) |
|
4.10.1 Case Study 1: Quadruped Critical Path Optimization |
|
|
100 | (1) |
|
4.10.2 Case Study 2: Hair Solver |
|
|
100 | (1) |
|
4.10.3 Case Study 3: Free Clothes! |
|
|
100 | (4) |
|
4.11 Overall Performance Results |
|
|
104 | (1) |
|
4.12 Limits of Scalability |
|
|
104 | (2) |
|
|
106 | (5) |
|
5 Fluids: Simulation on the CPU |
|
|
111 | (26) |
|
|
|
111 | (1) |
|
|
112 | (8) |
|
5.2.1 Everything You Need to Get Started |
|
|
114 | (1) |
|
|
114 | (1) |
|
5.2.3 Example: Dot Product |
|
|
115 | (2) |
|
5.2.4 Example: Maximum Absolute Value |
|
|
117 | (1) |
|
5.2.5 Platform Considerations |
|
|
118 | (1) |
|
|
119 | (1) |
|
|
120 | (16) |
|
|
120 | (2) |
|
5.3.2 Smoke, Fire, and Explosions |
|
|
122 | (2) |
|
5.3.2.1 Advection Solvers |
|
|
124 | (2) |
|
|
126 | (2) |
|
|
128 | (4) |
|
5.3.3.1 Parallel Point Rasterization |
|
|
132 | (4) |
|
|
136 | (1) |
|
6 Bullet Physics: Simulation with OpenCL |
|
|
137 | (26) |
|
|
|
138 | (2) |
|
6.1.1 Rigid Body Dynamics Simulation |
|
|
138 | (1) |
|
6.1.2 Refactoring before the Full Rewrite |
|
|
139 | (1) |
|
6.2 Rewriting from Scratch Using OpenCL |
|
|
140 | (5) |
|
6.2.1 Brief OpenCL Introduction |
|
|
140 | (2) |
|
|
142 | (1) |
|
6.2.3 Dealing with Branchy Code/Thread Divergence |
|
|
143 | (1) |
|
6.2.4 Serializing Data to Contiguous Memory |
|
|
144 | (1) |
|
6.2.5 Sharing CPU and GPU Code |
|
|
144 | (1) |
|
6.2.6 Precompiled Kernel Caching |
|
|
145 | (1) |
|
6.3 GPU Spatial Acceleration Structures |
|
|
145 | (6) |
|
6.3.1 Reference All Pairs Overlap Test |
|
|
146 | (1) |
|
|
147 | (1) |
|
6.3.3 Parallel 1-Axis Sort and Sweep |
|
|
148 | (1) |
|
6.3.4 Parallel 3-Axis Sweep and Prune |
|
|
149 | (1) |
|
|
150 | (1) |
|
6.3.6 Static Local Space AABB Tree |
|
|
150 | (1) |
|
6.4 GPU Contact Point Generation |
|
|
151 | (4) |
|
6.4.1 Collision Shape Representation |
|
|
151 | (1) |
|
6.4.2 Convex 3D Height Field Using Cube Maps |
|
|
152 | (1) |
|
6.4.3 Separating Axis Test |
|
|
153 | (1) |
|
6.4.4 Sutherland Hodgeman Clipping |
|
|
153 | (1) |
|
6.4.5 Minkowski Portal Refinement |
|
|
154 | (1) |
|
|
154 | (1) |
|
6.5 GPU Constraint Solving |
|
|
155 | (8) |
|
6.5.1 Equations of Motion |
|
|
155 | (1) |
|
6.5.2 Contact and Friction Constraint Setup |
|
|
155 | (1) |
|
6.5.3 Parallel Projected Gauss-Seidel Method |
|
|
156 | (1) |
|
6.5.4 Batch Creation and Two-Stage Batching |
|
|
157 | (2) |
|
6.5.5 Non-Contact Constraints |
|
|
159 | (1) |
|
6.5.6 GPU Deterministic Simulation |
|
|
159 | (1) |
|
6.5.7 Conclusion and Future Work |
|
|
159 | (4) |
|
7 OpenSubdiv: Interoperating GPU Compute and Drawing |
|
|
163 | (40) |
|
|
|
164 | (2) |
|
7.1.1 Why Fast Subdivision? |
|
|
165 | (1) |
|
|
165 | (1) |
|
|
166 | (1) |
|
|
166 | (3) |
|
7.2.1 Patches and Arbitrary Topology |
|
|
166 | (1) |
|
7.2.2 Topological Data Structures |
|
|
167 | (1) |
|
|
167 | (1) |
|
|
168 | (1) |
|
|
169 | (1) |
|
7.3.1 Implementing Subdivision Schemata |
|
|
169 | (1) |
|
7.4 Serializing the Mesh Representation |
|
|
170 | (3) |
|
7.4.1 Case Study: Subdividing a Pyramid |
|
|
170 | (1) |
|
7.4.2 Generating Indexing Tables |
|
|
170 | (2) |
|
7.4.3 Preparing for Parallel Execution |
|
|
172 | (1) |
|
7.5 Transition from Multicores to Many-Cores |
|
|
173 | (2) |
|
7.5.1 Streaming Multiprocessors and SIMT |
|
|
173 | (1) |
|
7.5.2 Practical Implementation with OpenCL |
|
|
174 | (1) |
|
7.6 Reducing Branching Divergence |
|
|
175 | (4) |
|
7.6.1 Sorting Vertices by Type |
|
|
176 | (1) |
|
7.6.2 Further Vertex Sorting |
|
|
176 | (3) |
|
7.7 Optimization Trade-Offs |
|
|
179 | (3) |
|
7.7.1 Alternative Strategy: NVIDIA Dynamic Parallelism |
|
|
179 | (1) |
|
7.7.2 Alternative Strategy: Vertex Stencils |
|
|
180 | (1) |
|
|
181 | (1) |
|
7.8 Evaluating Our Progress |
|
|
182 | (1) |
|
7.9 Fundamental Limitations of Uniform Subdivision |
|
|
183 | (3) |
|
|
184 | (1) |
|
|
184 | (1) |
|
7.9.3 Animating Subdivision Surfaces |
|
|
185 | (1) |
|
7.9.4 Better, Faster, Different |
|
|
185 | (1) |
|
7.10 Feature-Adaptive Subdivision |
|
|
186 | (4) |
|
7.10.1 GPU Hardware Tessellation |
|
|
186 | (1) |
|
7.10.2 Catmull-Clark Terminology |
|
|
187 | (1) |
|
7.10.3 Bi-Cubic Patch Representation |
|
|
188 | (1) |
|
7.10.4 Feature-Adaptive Subdivision |
|
|
189 | (1) |
|
7.11 Implementing the GPU Rendering Engine |
|
|
190 | (7) |
|
7.11.1 Bi-Cubic Bspline Patches with GLSL |
|
|
191 | (1) |
|
7.11.1.1 Handling Surface Boundaries |
|
|
192 | (1) |
|
7.11.1.2 Handling Patch Transitions |
|
|
193 | (1) |
|
|
194 | (2) |
|
7.11.2 Mitigating Drawing Overheads |
|
|
196 | (1) |
|
|
197 | (2) |
|
7.12.1 Displacement Mapping |
|
|
198 | (1) |
|
|
199 | (4) |
Bibliography |
|
203 | (6) |
Index |
|
209 | |