|
|
xi | |
|
|
xvii | |
Foreword |
|
xix | |
Acknowledgments |
|
xxi | |
|
|
1 | (14) |
|
1.1 Introduction to Heterogeneous Computing |
|
|
1 | (1) |
|
1.2 The Goals of This Book |
|
|
2 | (1) |
|
|
2 | (5) |
|
1.4 Concurrency and Parallel Programming Models |
|
|
7 | (1) |
|
1.5 Threads and Shared Memory |
|
|
8 | (1) |
|
1.6 Message-Passing Communication |
|
|
9 | (1) |
|
1.7 Different Grains of Parallelism |
|
|
10 | (2) |
|
1.7.1 Data Sharing and Synchronization |
|
|
11 | (1) |
|
1.7.2 Shared Virtual Memory |
|
|
11 | (1) |
|
1.8 Heterogeneous Computing with OpenCL |
|
|
12 | (1) |
|
|
13 | (2) |
|
|
14 | (1) |
|
Chapter 2 Device Architectures |
|
|
15 | (26) |
|
|
15 | (1) |
|
|
15 | (14) |
|
2.2.1 Performance Increase with Frequency, and its Limitations |
|
|
17 | (1) |
|
2.2.2 Superscalar Execution |
|
|
18 | (1) |
|
2.2.3 Very Long Instruction Word |
|
|
19 | (2) |
|
2.2.4 SIMD and Vector Processing |
|
|
21 | (1) |
|
2.2.5 Hardware Multithreading |
|
|
22 | (3) |
|
2.2.6 Multicore Architectures |
|
|
25 | (1) |
|
2.2.7 Integration: Systems-on-Chip and the APU |
|
|
26 | (2) |
|
2.2.8 Cache Hierarchies and Memory Systems |
|
|
28 | (1) |
|
2.3 The Architectural Design Space |
|
|
29 | (9) |
|
|
29 | (4) |
|
|
33 | (4) |
|
2.3.3 APU and APU-like Designs |
|
|
37 | (1) |
|
|
38 | (3) |
|
|
39 | (2) |
|
Chapter 3 Introduction to OpenCL |
|
|
41 | (34) |
|
|
41 | (2) |
|
3.1.1 The OpenCL Standard |
|
|
41 | (1) |
|
3.1.2 The OpenCL Specification |
|
|
42 | (1) |
|
3.2 The OpenCL Platform Model |
|
|
43 | (2) |
|
3.2.1 Platforms and Devices |
|
|
44 | (1) |
|
3.3 The OpenCL Execution Model |
|
|
45 | (5) |
|
|
45 | (2) |
|
|
47 | (1) |
|
|
48 | (1) |
|
3.3.4 Device-Side Enqueuing |
|
|
49 | (1) |
|
3.4 Kernels and the OpenCL Programming Model |
|
|
50 | (6) |
|
3.4.1 Compilation and Argument Handling |
|
|
53 | (2) |
|
3.4.2 Starting Kernel Execution on a Device |
|
|
55 | (1) |
|
|
56 | (6) |
|
|
56 | (3) |
|
3.5.2 Data Transfer Commands |
|
|
59 | (1) |
|
|
60 | (2) |
|
3.5.4 Generic Address Space |
|
|
62 | (1) |
|
3.6 The OpenCL Runtime with an Example |
|
|
62 | (7) |
|
3.6.1 Complete Vector Addition Listing |
|
|
66 | (3) |
|
3.7 Vector Addition Using an OpenCL C++ Wrapper |
|
|
69 | (2) |
|
3.8 OpenCL for CUDA Programmers |
|
|
71 | (2) |
|
|
73 | (2) |
|
|
73 | (2) |
|
|
75 | (36) |
|
|
75 | (1) |
|
|
75 | (8) |
|
|
83 | (8) |
|
|
91 | (8) |
|
|
99 | (8) |
|
|
107 | (2) |
|
4.6.1 Reporting Compilation Errors |
|
|
107 | (1) |
|
4.6.2 Creating a Program String |
|
|
108 | (1) |
|
|
109 | (2) |
|
Chapter 5 OpenCL Runtime and Concurrency Model |
|
|
111 | (32) |
|
5.1 Commands and the Queuing Model |
|
|
111 | (7) |
|
5.1.1 Blocking Memory Operations |
|
|
111 | (1) |
|
|
112 | (1) |
|
5.1.3 Command Barriers and Markers |
|
|
113 | (1) |
|
|
114 | (1) |
|
5.1.5 Profiling Using Events |
|
|
114 | (1) |
|
|
115 | (1) |
|
5.1.7 Out-of-Order Command-Queues |
|
|
116 | (2) |
|
5.2 Multiple Command-Queues |
|
|
118 | (3) |
|
5.3 The Kernel Execution Domain: Work-Items, Work-Groups, and NDRanges |
|
|
121 | (9) |
|
|
124 | (1) |
|
5.3.2 Work-Group Barriers |
|
|
125 | (3) |
|
5.3.3 Built-In Work-Group Functions |
|
|
128 | (1) |
|
5.3.4 Predicate Evaluation Functions |
|
|
128 | (1) |
|
5.3.5 Broadcast Functions |
|
|
129 | (1) |
|
5.3.6 Parallel Primitive Functions |
|
|
129 | (1) |
|
5.4 Native and Built-In Kernels |
|
|
130 | (2) |
|
|
130 | (2) |
|
|
132 | (1) |
|
|
132 | (10) |
|
5.5.1 Creating a Device-Side Queue |
|
|
135 | (1) |
|
5.5.2 Enqueuing Device-Side Kernels |
|
|
136 | (6) |
|
|
142 | (1) |
|
|
142 | (1) |
|
Chapter 6 OpenCL Host-Side Memory Model |
|
|
143 | (20) |
|
|
144 | (4) |
|
|
144 | (1) |
|
|
145 | (2) |
|
|
147 | (1) |
|
|
148 | (11) |
|
6.2.1 Managing Default Memory Objects |
|
|
149 | (6) |
|
6.2.2 Managing Memory Objects with Allocation Options |
|
|
155 | (4) |
|
6.3 Shared Virtual Memory |
|
|
159 | (2) |
|
|
161 | (2) |
|
Chapter 7 OpenCL Device-Side Memory Model |
|
|
163 | (24) |
|
7.1 Synchronization and Communication |
|
|
164 | (4) |
|
|
165 | (1) |
|
|
166 | (2) |
|
|
168 | (7) |
|
|
168 | (1) |
|
|
169 | (4) |
|
|
173 | (2) |
|
|
175 | (1) |
|
|
175 | (3) |
|
|
178 | (1) |
|
7.6 Generic Address Space |
|
|
178 | (2) |
|
|
180 | (6) |
|
|
183 | (2) |
|
|
185 | (1) |
|
|
186 | (1) |
|
Chapter 8 Dissecting OpenCL on a Heterogeneous System |
|
|
187 | (26) |
|
8.1 OpenCL on an AMD FX-8350 CPU |
|
|
187 | (5) |
|
8.1.1 Runtime Implementation |
|
|
188 | (3) |
|
8.1.2 Vectorizing Within a Work-Item |
|
|
191 | (1) |
|
|
191 | (1) |
|
8.2 OpenCL on the AMD Radeon R9 290X GPU |
|
|
192 | (9) |
|
8.2.1 Threading and the Memory System |
|
|
194 | (2) |
|
8.2.2 Instruction Set Architecture and Execution Units |
|
|
196 | (4) |
|
8.2.3 Resource Allocation |
|
|
200 | (1) |
|
8.3 Memory Performance Considerations in OpenCL |
|
|
201 | (10) |
|
|
201 | (4) |
|
8.3.2 Local Memory as a Software-Managed Cache |
|
|
205 | (6) |
|
|
211 | (2) |
|
|
211 | (2) |
|
Chapter 9 Case study: Image clustering |
|
|
213 | (16) |
|
|
213 | (2) |
|
9.2 The Feature Histogram on the CPU |
|
|
215 | (2) |
|
9.2.1 Sequential Implementation |
|
|
215 | (1) |
|
9.2.2 OpenMP parallelization |
|
|
216 | (1) |
|
9.3 OpenCL Implementation |
|
|
217 | (10) |
|
9.3.1 Naive GPU Implementation: GPU1 |
|
|
217 | (1) |
|
9.3.2 Coalesced Memory Accesses: GPU2 |
|
|
218 | (3) |
|
9.3.3 Vectorizing Computation: GPU3 |
|
|
221 | (2) |
|
9.3.4 Move SURF Features to Local Memory: GPU4 |
|
|
223 | (2) |
|
9.3.5 Move Cluster Centroids to Constant Memory: GPU5 |
|
|
225 | (2) |
|
|
227 | (1) |
|
|
227 | (1) |
|
|
228 | (1) |
|
|
228 | (1) |
|
Chapter 10 OpenCL Profiling and Debugging |
|
|
229 | (20) |
|
|
229 | (1) |
|
10.2 Profiling OpenCL Code Using Events |
|
|
229 | (2) |
|
|
231 | (1) |
|
10.4 Profiling Using CodeXL |
|
|
232 | (6) |
|
10.4.1 Collecting OpenCL Application Traces |
|
|
233 | (2) |
|
10.4.2 Host API Trace View |
|
|
235 | (1) |
|
10.4.3 Summary Pages View |
|
|
236 | (1) |
|
10.4.4 Collecting GPU Kernel Performance Counters |
|
|
236 | (1) |
|
10.4.5 CPU Performance Profiling Using CodeXL |
|
|
237 | (1) |
|
10.5 Analyzing Kernels Using CodeXL |
|
|
238 | (5) |
|
10.5.1 KernelAnalyzer Statistics and ISA Views |
|
|
239 | (3) |
|
10.5.2 KernelAnalyzer Analysis View |
|
|
242 | (1) |
|
10.6 Debugging OpenCL Kernels Using CodeXL |
|
|
243 | (3) |
|
10.6.1 API-Level Debugging |
|
|
244 | (1) |
|
|
245 | (1) |
|
10.7 Debugging Using printf |
|
|
246 | (1) |
|
|
247 | (2) |
|
Chapter 11 Mapping High-Level Programming Languages to OpenCL 2.0 |
|
|
249 | (24) |
|
|
249 | (1) |
|
11.2 A Brief Introduction to C++ AMP |
|
|
250 | (4) |
|
11.2.1 C++ AMP array_view |
|
|
251 | (1) |
|
11.2.2 C++ AMP parallel_for_each, or Kernel Invocation |
|
|
252 | (2) |
|
11.3 OpenCL 2.0 as a Compiler Target |
|
|
254 | (1) |
|
11.4 Mapping Key C++ AMP Constructs to OpenCL |
|
|
254 | (5) |
|
11.5 C++ AMP Compilation Flow |
|
|
259 | (1) |
|
11.6 Compiled C++ AMP Code |
|
|
260 | (1) |
|
11.7 How Shared Virtual Memory in OpenCL 2.0 Fits in |
|
|
261 | (2) |
|
11.8 Compiler Support for Tiling in C++ AMP |
|
|
263 | (2) |
|
11.8.1 Dividing the Compute Domain |
|
|
263 | (1) |
|
11.8.2 Specifying the Address Space and Barriers |
|
|
264 | (1) |
|
11.9 Address Space Deduction |
|
|
265 | (2) |
|
11.10 Data Movement Optimization |
|
|
267 | (1) |
|
|
267 | (1) |
|
11.10.2 array_view---<constT, N> |
|
|
268 | (1) |
|
11.11 Binomial Options: A Full Example |
|
|
268 | (2) |
|
11.12 Preliminary Results |
|
|
270 | (1) |
|
|
271 | (2) |
|
|
272 | (1) |
|
Chapter 12 WebCL Enabling OpenCL Acceleration of Web Applications |
|
|
273 | (18) |
|
|
273 | (1) |
|
12.2 Programming with WebCL |
|
|
273 | (8) |
|
|
281 | (1) |
|
12.4 Interoperability with WebGL |
|
|
282 | (1) |
|
|
282 | (3) |
|
12.6 Security Enhancement |
|
|
285 | (1) |
|
|
286 | (2) |
|
12.8 Status and Future of WebCL |
|
|
288 | (3) |
|
|
288 | (1) |
|
|
288 | (3) |
|
|
291 | (10) |
|
|
291 | (1) |
|
|
291 | (2) |
|
|
293 | (6) |
|
|
294 | (1) |
|
|
295 | (1) |
|
13.3.3 Reference Counting |
|
|
295 | (1) |
|
13.3.4 Platform and Devices |
|
|
296 | (1) |
|
13.3.5 The Execution Environment |
|
|
296 | (3) |
|
|
299 | (2) |
|
|
299 | (2) |
Index |
|
301 | |