Introduction |
|
xiii | |
|
PART 1 BACKGROUND KNOWLEDGE |
|
|
|
Chapter 1 Early Intel® Architecture |
|
|
3 | (28) |
|
|
5 | (11) |
|
|
7 | (1) |
|
|
8 | (2) |
|
|
10 | (3) |
|
1.1.4 Machine Code Format |
|
|
13 | (3) |
|
|
16 | (4) |
|
1.2.1 IEEE 754 Floating Point |
|
|
16 | (3) |
|
|
19 | (1) |
|
1.3 Intel® 80286 and 80287 |
|
|
20 | (3) |
|
1.3.1 Protected and Real Mode |
|
|
21 | (1) |
|
1.3.2 Protected Mode Segmentation |
|
|
21 | (1) |
|
|
22 | (1) |
|
1.4 Intel® 80386 and 80387 |
|
|
23 | (8) |
|
|
24 | (2) |
|
|
26 | (2) |
|
|
28 | (3) |
|
Chapter 2 Intel® Pentium® processors |
|
|
31 | (12) |
|
|
32 | (2) |
|
|
33 | (1) |
|
|
34 | (4) |
|
|
34 | (1) |
|
|
35 | (1) |
|
2.2.3 Out-of-Order Execution |
|
|
36 | (2) |
|
|
38 | (5) |
|
|
38 | (2) |
|
|
40 | (1) |
|
2.3.3 Intel® Hyper-Threading |
|
|
41 | (1) |
|
|
41 | (2) |
|
Chapter 3 Intel® Core™ processors |
|
|
43 | (10) |
|
|
44 | (4) |
|
|
44 | (4) |
|
3.2 Second Generation Intel® Core™ Processor Family |
|
|
48 | (5) |
|
|
49 | (1) |
|
3.2.2 Intel® Flex Memory Technology |
|
|
49 | (1) |
|
3.2.3 Intel® Turbo Boost Technology |
|
|
50 | (1) |
|
|
51 | (1) |
|
|
52 | (1) |
|
Chapter 4 Performance Workflow |
|
|
53 | (20) |
|
4.1 Step 0: Defining the Problem |
|
|
54 | (2) |
|
4.2 Step 1: Determine the Source of the Problem |
|
|
56 | (1) |
|
4.3 Step 2: Determine Whether the Bottleneck Can Be Avoided |
|
|
57 | (1) |
|
4.4 Step 3: Design a Reproducible Experiment |
|
|
57 | (2) |
|
4.5 Step 4: Check Upstream |
|
|
59 | (8) |
|
|
60 | (1) |
|
|
60 | (2) |
|
|
62 | (5) |
|
4.6 Step 5: Algorithmic Improvement |
|
|
67 | (1) |
|
4.7 Step 6: Architectural Tuning |
|
|
68 | (2) |
|
|
70 | (1) |
|
4.9 Step 8: Performance Regression Testing |
|
|
71 | (2) |
|
|
71 | (2) |
|
Chapter 5 Designing Experiments |
|
|
73 | (32) |
|
|
74 | (1) |
|
5.2 Dealing with External Variables |
|
|
74 | (5) |
|
5.2.1 Controllable External Variables |
|
|
74 | (3) |
|
5.2.2 Uncontrollable External Variables |
|
|
77 | (2) |
|
|
79 | (6) |
|
|
80 | (2) |
|
5.3.2 Clock Time and Unix Time |
|
|
82 | (3) |
|
|
85 | (20) |
|
|
87 | (7) |
|
5.4.2 Working with Results |
|
|
94 | (1) |
|
5.4.3 Creating Custom Tests |
|
|
95 | (3) |
|
|
98 | (2) |
|
|
100 | (5) |
|
|
|
Chapter 6 Introduction to Profiling |
|
|
105 | (14) |
|
|
106 | (4) |
|
|
107 | (2) |
|
6.1.2 Using Event Counters |
|
|
109 | (1) |
|
6.2 Top-Down Hierarchical Analysis |
|
|
110 | (9) |
|
|
112 | (2) |
|
|
114 | (3) |
|
|
117 | (1) |
|
|
117 | (1) |
|
|
118 | (1) |
|
Chapter 7 Intel® VTune™ Amplifier XE |
|
|
119 | (18) |
|
7.1 Installation and Configuration |
|
|
120 | (4) |
|
7.1.1 Building the Kernel Modules |
|
|
121 | (1) |
|
7.1.2 System Configuration |
|
|
122 | (2) |
|
7.2 Data Collection and Reporting |
|
|
124 | (13) |
|
|
125 | (4) |
|
|
129 | (1) |
|
|
129 | (6) |
|
|
135 | (2) |
|
|
137 | (30) |
|
|
138 | (20) |
|
|
138 | (1) |
|
|
139 | (5) |
|
8.1.3 Measurement Parameters |
|
|
144 | (3) |
|
8.1.4 Enabling, Disabling, and Resetting Counters |
|
|
147 | (1) |
|
8.1.5 Reading Counting Events |
|
|
148 | (4) |
|
8.1.6 Reading Sampling Events |
|
|
152 | (6) |
|
|
158 | (9) |
|
|
158 | (2) |
|
|
160 | (1) |
|
8.2.3 Perf Record, Perf Report, and Perf Top |
|
|
160 | (3) |
|
|
163 | (2) |
|
|
165 | (2) |
|
|
167 | (12) |
|
|
168 | (8) |
|
|
169 | (7) |
|
|
176 | (3) |
|
|
178 | (1) |
|
Chapter 10 GPU Profiling Tools |
|
|
179 | (12) |
|
10.1 Traditional Graphics Stack |
|
|
180 | (4) |
|
|
180 | (1) |
|
10.1.2 Hardware and Low-Level Infrastructure: DRI |
|
|
181 | (1) |
|
10.1.3 Higher Level Software Infrastructure |
|
|
182 | (2) |
|
|
184 | (3) |
|
|
187 | (4) |
|
|
189 | (2) |
|
Chapter 11 Other Helpful Tools |
|
|
191 | (16) |
|
|
191 | (3) |
|
|
194 | (1) |
|
|
195 | (6) |
|
|
196 | (1) |
|
|
197 | (1) |
|
|
198 | (1) |
|
|
199 | (1) |
|
|
199 | (2) |
|
|
201 | (1) |
|
|
202 | (5) |
|
|
202 | (1) |
|
|
203 | (4) |
|
PART 3 OPTIMIZATION TECHNIQUES |
|
|
|
Chapter 12 Toolchain Primer |
|
|
207 | (34) |
|
|
209 | (2) |
|
12.2 ELF and the x86/x86_64 ABIs |
|
|
211 | (7) |
|
12.2.1 Relocations and PIC |
|
|
212 | (4) |
|
|
216 | (2) |
|
|
218 | (9) |
|
12.3.1 Querying CPU Features |
|
|
219 | (5) |
|
12.3.2 Runtime Dispatching |
|
|
224 | (3) |
|
|
227 | (6) |
|
|
227 | (3) |
|
12.4.2 Using the Appropriate Types and Qualifiers |
|
|
230 | (1) |
|
|
231 | (2) |
|
|
233 | (1) |
|
|
233 | (8) |
|
12.5.1 Standalone Assembly |
|
|
234 | (2) |
|
|
236 | (1) |
|
12.5.3 Compiler Intrinsics |
|
|
237 | (1) |
|
|
238 | (3) |
|
|
241 | (10) |
|
|
242 | (5) |
|
13.1.1 Extra Work and Masking |
|
|
243 | (2) |
|
13.1.2 Combining and Rearranging Branches |
|
|
245 | (2) |
|
13.2 Improving Prediction |
|
|
247 | (4) |
|
13.2.1 Profile Guided Optimization |
|
|
248 | (1) |
|
|
249 | (2) |
|
Chapter 14 Optimizing Cache Usage |
|
|
251 | (12) |
|
14.1 Processor Cache Organization |
|
|
252 | (2) |
|
|
253 | (1) |
|
14.2 Querying Cache Topology |
|
|
254 | (4) |
|
|
258 | (1) |
|
|
259 | (4) |
|
|
261 | (2) |
|
Chapter 15 Exploiting Parallelism |
|
|
263 | (12) |
|
|
265 | (10) |
|
|
266 | (3) |
|
|
269 | (4) |
|
|
273 | (2) |
|
Chapter 16 Special Instructions |
|
|
275 | (4) |
|
16.1 Intel® Advanced Encryption Standard New Instructions (AES-NI) |
|
|
275 | (1) |
|
|
276 | (1) |
|
16.2 PCLMUL-Packed Carry-Less Multiplication |
|
|
276 | (1) |
|
|
277 | (1) |
|
|
277 | (1) |
|
|
277 | (1) |
|
16.4 SSE4.2 String Functions |
|
|
277 | (2) |
|
|
278 | (1) |
Index |
|
279 | |