Foreword |
|
xvii | |
Rebooting Computing and Low-Power Computer Vision |
|
xix | |
Editors |
|
xxi | |
|
|
|
Chapter 1 Book Introduction |
|
|
3 | (14) |
|
|
|
|
|
|
|
4 | (1) |
|
|
4 | (13) |
|
1.2.1 History of Low-Power Computer Vision Challenge |
|
|
4 | (1) |
|
1.2.2 Survey on Energy-Efficient Deep Neural Networks for Computer Vision |
|
|
5 | (1) |
|
1.2.3 Hardware Design and Software Practices for Efficient Neural Network Inference |
|
|
6 | (1) |
|
1.2.4 Progressive Automatic Design of Search Space for One-Shot Neural Architecture |
|
|
6 | (1) |
|
1.2.5 Fast Adjustable Threshold for Uniform Neural Network Quantization |
|
|
7 | (1) |
|
1.2.6 Power-efficient Neural Network Scheduling on Heterogeneous system on chips (SoCs) |
|
|
8 | (1) |
|
1.2.7 Efficient Neural Architecture Search |
|
|
9 | (1) |
|
1.2.8 Design Methodology for Low-Power Image Recognition Systems Design |
|
|
10 | (1) |
|
1.2.9 Guided Design for Efficient On-device Object Detection Model |
|
|
11 | (1) |
|
1.2.10 Quantizing Neural Networks for Low-Power Computer Vision |
|
|
12 | (1) |
|
1.2.11 A Practical Guide to Designing Efficient Mobile Architectures |
|
|
13 | (1) |
|
1.2.12 A Survey of Quantization Methods for Efficient Neural Network Inference |
|
|
14 | (3) |
|
Chapter 2 History of Low-Power Computer Vision Challenge |
|
|
17 | (8) |
|
|
|
|
|
|
|
|
|
17 | (1) |
|
2.2 Low-Power Image Recognition Challenge (LPIRC): 2015-2019 |
|
|
18 | (2) |
|
2.3 Low-Power Computer Vision Challenge (LPCVC): 2020 |
|
|
20 | (1) |
|
|
21 | (2) |
|
|
23 | (2) |
|
Chapter 3 Survey on Energy-Efficient Deep Neural Networks for Computer Vision |
|
|
25 | (30) |
|
|
|
|
|
|
|
|
26 | (4) |
|
|
30 | (2) |
|
3.2.1 Computation Intensity of Deep Neural Networks |
|
|
30 | (1) |
|
3.2.2 Low-Power Deep Neural Networks |
|
|
31 | (1) |
|
3.3 Parameter Quantization |
|
|
32 | (3) |
|
3.4 Deep Neural Network Pruning |
|
|
35 | (2) |
|
3.5 Deep Neural Network Layer And Filter Compression |
|
|
37 | (2) |
|
3.6 Parameter Matrix Decomposition Techniques |
|
|
39 | (1) |
|
3.7 Neural Architecture Search |
|
|
40 | (2) |
|
3.8 Knowledge Distillation |
|
|
42 | (2) |
|
3.9 Energy Consumption--accuracy Tradeoff With Deep Neural Networks |
|
|
44 | (2) |
|
3.10 Guidelines For Low-Power Computer Vision |
|
|
46 | (2) |
|
3.10.1 Relationship Between Low-Power Computer Vision Techniques |
|
|
46 | (1) |
|
3.10.2 Deep Neural Network And Resolution Scaling |
|
|
47 | (1) |
|
|
48 | (2) |
|
3.11.1 Accuracy Measurements On Popular Datasets |
|
|
48 | (1) |
|
3.11.2 Memory Requirement And Number Of Operations |
|
|
49 | (1) |
|
3.11.3 On-Device Energy Consumption And Latency |
|
|
50 | (1) |
|
3.12 Summary And Conclusions |
|
|
50 | (5) |
|
Section II Competition Winners |
|
|
|
Chapter 4 Hardware Design and Software Practices for Efficient Neural Network Inference |
|
|
55 | (36) |
|
|
|
|
|
|
|
|
|
|
|
4.1 Hardware And Software Design Framework For Efficient Neural Network Inference |
|
|
56 | (4) |
|
|
56 | (2) |
|
4.1.2 From Model to Instructions |
|
|
58 | (2) |
|
4.2 ISA-Based CNN Accelerator: Angel-Eye |
|
|
60 | (15) |
|
4.2.1 Hardware Architecture |
|
|
61 | (4) |
|
|
65 | (4) |
|
|
69 | (1) |
|
4.2.4 Extension Support of Upsampling Layers |
|
|
69 | (2) |
|
|
71 | (3) |
|
4.2.6 Practice on DAC-SDC Low-Power Object Detection Challenge |
|
|
74 | (1) |
|
4.3 Neural Network Model Optimization |
|
|
75 | (15) |
|
4.3.1 Pruning and Quantization |
|
|
75 | (1) |
|
|
76 | (2) |
|
4.3.1.2 Network Quantization |
|
|
78 | (1) |
|
4.3.1.3 Evaluation and Practices |
|
|
79 | (2) |
|
4.3.2 Pruning with Hardware Cost Model |
|
|
81 | (1) |
|
4.3.2.1 Iterative Search-based Pruning Methods |
|
|
81 | (1) |
|
4.3.2.2 Local Programming-based Pruning and the Practice in LPCVC'19 |
|
|
82 | (3) |
|
4.3.3 Architecture Search Framework |
|
|
85 | (1) |
|
|
85 | (3) |
|
4.3.3.2 Case Study Using the aw_nas Framework: Black-box Search Space Tuning for Hardware-aware NAS |
|
|
88 | (2) |
|
|
90 | (1) |
|
Chapter 5 Progressive Automatic Design of Search Space for One-Shot Neural Architecture Search |
|
|
91 | (20) |
|
|
|
|
|
92 | (1) |
|
|
92 | (3) |
|
|
95 | (1) |
|
|
96 | (5) |
|
5.4.1 Problem Formulation and Motivation |
|
|
96 | (2) |
|
5.4.2 Progressive Automatic Design of Search Space |
|
|
98 | (3) |
|
|
101 | (9) |
|
5.5.1 Dataset and Implement Details |
|
|
101 | (2) |
|
5.5.2 Comparison with State-of-the-art Methods |
|
|
103 | (3) |
|
5.5.3 Automatically Designed Search Space |
|
|
106 | (3) |
|
|
109 | (1) |
|
|
110 | (1) |
|
Chapter 6 Fast Adjustable Threshold for Uniform Neural Network Quantization |
|
|
111 | (16) |
|
|
|
|
|
112 | (1) |
|
|
113 | (3) |
|
6.2.1 Quantization with Knowledge Distillation |
|
|
115 | (1) |
|
6.2.2 Quantization without Fine-tuning |
|
|
115 | (1) |
|
6.2.3 Quantization with Training/Fine-tuning |
|
|
115 | (1) |
|
|
116 | (7) |
|
6.3.1 Quantization with Threshold Fine-tuning |
|
|
116 | (1) |
|
6.3.1.1 Differentiable Quantization Threshold |
|
|
116 | (2) |
|
6.3.1.2 Batch Normalization Folding |
|
|
118 | (1) |
|
|
118 | (1) |
|
6.3.1.4 Training of Asymmetric Thresholds |
|
|
119 | (1) |
|
6.3.1.5 Vector Quantization |
|
|
120 | (1) |
|
6.3.2 Training on the Unlabeled Data |
|
|
120 | (1) |
|
6.3.3 Quantization of Depth-wise Separable Convolution |
|
|
121 | (1) |
|
6.3.3.1 Scaling the Weights for MobileNet-V2 (with ReLU6) |
|
|
122 | (1) |
|
6.4 Experiments And Results |
|
|
123 | (2) |
|
6.4.1 Experiments Description |
|
|
123 | (1) |
|
6.4.1.1 Researched Architectures |
|
|
123 | (1) |
|
6.4.1.2 Training Procedure |
|
|
124 | (1) |
|
|
124 | (1) |
|
|
125 | (2) |
|
Chapter 7 Power-efficient Neural Network Scheduling |
|
|
127 | (46) |
|
|
|
|
7.1 Introduction To Neural Network Scheduling On Heterogeneous Socs |
|
|
128 | (3) |
|
|
129 | (1) |
|
|
130 | (1) |
|
7.2 Coarse-Grained Scheduling For Neural Network Tasks: A Case Study Of Champion Solution In LPIRC 2016 |
|
|
131 | (9) |
|
7.2.1 Introduction to the LPIRC2016 Mission and the Solutions |
|
|
131 | (2) |
|
7.2.2 Static Scheduling for the Image Recognition Task |
|
|
133 | (1) |
|
7.2.3 Manual Load Balancing for Pipelined Fast R-CNN |
|
|
134 | (4) |
|
7.2.4 The Result of Static Scheduling |
|
|
138 | (2) |
|
7.3 Fine-Grained Neural Network Scheduling On Power-Efficient Processors |
|
|
140 | (14) |
|
7.3.1 Network Scheduling on SUs: Compiler-Level Techniques |
|
|
140 | (1) |
|
7.3.2 Memory-Efficient Network Scheduling |
|
|
141 | (1) |
|
7.3.3 The Formulation of the Layer-Fusion Problem by Computational Graphs |
|
|
142 | (3) |
|
7.3.4 Cost Estimation of Fused Layer-Groups |
|
|
145 | (4) |
|
7.3.5 Hardware-Aware Network Fusion Algorithm (HaNF) |
|
|
149 | (1) |
|
7.3.6 Implementation of the Network Fusion Algorithm |
|
|
150 | (2) |
|
7.3.7 Evaluation of Memory Overhead |
|
|
152 | (1) |
|
7.3.8 Performance on Different Processors |
|
|
153 | (1) |
|
7.4 Scheduler-Friendly Network Quantizations |
|
|
154 | (16) |
|
7.4.1 The Problem of Layer Pipelining between CPU and Integer SUs |
|
|
154 | (1) |
|
7.4.2 Introduction to Neural Network Quantization for Integer Neural Accelerators |
|
|
155 | (4) |
|
7.4.3 Related Work of Neural Network Quantization |
|
|
159 | (1) |
|
7.4.4 Linear Symmetric Quantization for Low-Precision Integer Hardware |
|
|
160 | (1) |
|
7.4.5 Making Full Use of the Pre-Trained Parameters |
|
|
161 | (1) |
|
7.4.6 Low-Precision Representation and Quantization Algorithm |
|
|
161 | (2) |
|
7.4.7 BN Layer Fusion of Quantized Networks |
|
|
163 | (1) |
|
7.4.8 Bias and Scaling Factor Quantization for Low-Precision Integer Operation |
|
|
164 | (1) |
|
|
165 | (5) |
|
|
170 | (3) |
|
Chapter 8 Efficient Neural Network Architectures |
|
|
173 | (18) |
|
|
|
8.1 Standard Convolution Layer |
|
|
174 | (1) |
|
8.2 Efficient Convolution Layers |
|
|
175 | (1) |
|
8.3 Manually Designed Efficient CNN Models |
|
|
175 | (4) |
|
8.4 Neural Architecture Search |
|
|
179 | (3) |
|
8.5 Hardware-Aware Neural Architecture Search |
|
|
182 | (7) |
|
|
184 | (1) |
|
8.5.2 Specialized Models for Different Hardware |
|
|
185 | (1) |
|
8.5.3 Handling Many Platforms and Constraints |
|
|
186 | (3) |
|
|
189 | (2) |
|
Chapter 9 Design Methodology for Low-Power Image Recognition Systems |
|
|
191 | (30) |
|
|
|
|
|
|
9.1 Design Methodology Used In Lpirc 2017 |
|
|
193 | (8) |
|
9.1.1 Object Detection Networks |
|
|
194 | (1) |
|
9.1.2 Throughput Maximization by Pipelining |
|
|
195 | (1) |
|
9.1.3 Software Optimization Techniques |
|
|
196 | (1) |
|
9.1.3.1 Tucker Decomposition |
|
|
197 | (1) |
|
9.1.3.2 CPU Parallelization |
|
|
198 | (1) |
|
9.1.3.3 16-bit Quantization |
|
|
198 | (2) |
|
|
200 | (1) |
|
9.2 Image Recognition Network Exploration |
|
|
201 | (7) |
|
9.2.1 Single Stage Detectors |
|
|
202 | (2) |
|
9.2.2 Software Optimization Techniques |
|
|
204 | (1) |
|
|
205 | (1) |
|
9.2.4 Network Exploration |
|
|
206 | (1) |
|
9.2.5 Lpirc 2018 Solution |
|
|
207 | (1) |
|
9.3 Network Pipelining For Heterogeneous Processor Systems |
|
|
208 | (9) |
|
9.3.1 Network Pipelining Problem |
|
|
209 | (2) |
|
9.3.2 Network Pipelining Heuristic |
|
|
211 | (2) |
|
9.3.3 Software Framework for Network Pipelining |
|
|
213 | (1) |
|
9.3.4 Experimental Results |
|
|
214 | (3) |
|
9.4 Conclusion And Future Work |
|
|
217 | (4) |
|
Chapter 10 Guided Design for Efficient On-device Object Detection Model |
|
|
221 | (14) |
|
|
|
|
222 | (2) |
|
10.1.1 LPIRC Track 1 in 2018 and 2019 |
|
|
223 | (1) |
|
10.1.2 Three Awards for Amazon team |
|
|
223 | (1) |
|
|
224 | (1) |
|
10.3 Award-Winning Methods |
|
|
225 | (7) |
|
10.3.1 Quantization Friendly Model |
|
|
225 | (1) |
|
10.3.2 Network Architecture Optimization |
|
|
226 | (1) |
|
10.3.3 Training Hyper-parameters |
|
|
226 | (1) |
|
10.3.4 Optimal Model Architecture |
|
|
227 | (1) |
|
10.3.5 Neural Architecture Search |
|
|
228 | (1) |
|
|
228 | (2) |
|
10.3.7 Non-maximum Suppression Threshold |
|
|
230 | (1) |
|
|
231 | (1) |
|
|
232 | (3) |
|
Section III Invited Articles |
|
|
|
Chapter 11 Quantizing Neural Networks |
|
|
235 | (38) |
|
|
|
|
|
|
|
|
236 | (2) |
|
11.2 Quantization Fundamentals |
|
|
238 | (10) |
|
11.2.1 Hardware Background |
|
|
238 | (2) |
|
11.2.2 Uniform Affine Quantization |
|
|
240 | (2) |
|
11.2.2.1 Symmetric Uniform Quantization |
|
|
242 | (1) |
|
11.2.2.2 Power-of-two Quantizer |
|
|
242 | (1) |
|
11.2.2.3 Quantization Granularity |
|
|
243 | (1) |
|
11.2.3 Quantization Simulation |
|
|
243 | (1) |
|
11.2.3.1 Batch Normalization Folding |
|
|
244 | (1) |
|
11.2.3.2 Activation Function Fusing |
|
|
245 | (1) |
|
11.2.3.3 Other Layers and Quantization |
|
|
246 | (1) |
|
11.2.4 Practical Considerations |
|
|
247 | (1) |
|
11.2.4.1 Symmetric vs. Asymmetric Quantization |
|
|
247 | (1) |
|
11.2.4.2 Per-tensor and Per-channel Quantization |
|
|
248 | (1) |
|
11.3 Post-Training Quantization |
|
|
248 | (14) |
|
11.3.1 Quantization Range Setting |
|
|
249 | (2) |
|
11.3.2 Cross-Layer Equalization |
|
|
251 | (4) |
|
|
255 | (1) |
|
|
256 | (4) |
|
11.3.5 Standard PTQ Pipeline |
|
|
260 | (1) |
|
|
261 | (1) |
|
11.4 Quantization-Aware Training |
|
|
262 | (9) |
|
11.4.1 Simulating Quantization for Backward Path |
|
|
263 | (2) |
|
11.4.2 Batch Normalization Folding and QAT |
|
|
265 | (2) |
|
11.4.3 Initialization for QAT |
|
|
267 | (1) |
|
11.4.4 Standard QAT Pipeline |
|
|
268 | (2) |
|
|
270 | (1) |
|
11.5 Summary And Conclusions |
|
|
271 | (2) |
|
Chapter 12 Building Efficient Mobile Architectures |
|
|
273 | (18) |
|
|
|
|
274 | (2) |
|
12.2 Architecture Parameterizations |
|
|
276 | (5) |
|
12.2.1 Network Width Multiplier |
|
|
277 | (1) |
|
12.2.2 Input Resolution Multiplier |
|
|
277 | (1) |
|
12.2.3 Data and Internal Resolution |
|
|
278 | (1) |
|
12.2.4 Network Depth Multiplier |
|
|
279 | (1) |
|
12.2.5 Adjusting Multipliers for Multi-criteria Optimizations |
|
|
280 | (1) |
|
12.3 Optimizing Early Layers |
|
|
281 | (2) |
|
12.4 Optimizing The Final Layers |
|
|
283 | (2) |
|
12.4.1 Adjusting the Resolution of the Final Spatial Layer |
|
|
283 | (1) |
|
12.4.2 Reducing the Size of the Embedding Layer |
|
|
284 | (1) |
|
12.5 Adjusting Non-Linearities: H-Swish And H-Sigmoid |
|
|
285 | (2) |
|
12.6 Putting It All Together |
|
|
287 | (4) |
|
Chapter 13 A Survey of Quantization Methods for Efficient Neural Network Inference |
|
|
291 | (36) |
|
|
|
|
|
|
|
|
292 | (4) |
|
13.2 General History Of Quantization |
|
|
296 | (2) |
|
13.3 Basic Concepts Of Quantization |
|
|
298 | (15) |
|
13.3.1 Problem Setup and Notations |
|
|
299 | (1) |
|
13.3.2 Uniform Quantization |
|
|
299 | (1) |
|
13.3.3 Symmetric and Asymmetric Quantization |
|
|
300 | (2) |
|
13.3.4 Range Calibration Algorithms: Static vs. Dynamic Quantization |
|
|
302 | (1) |
|
13.3.5 Quantization Granularity |
|
|
303 | (2) |
|
13.3.6 Non-Uniform Quantization |
|
|
305 | (1) |
|
13.3.7 Fine-tuning Methods |
|
|
306 | (1) |
|
13.3.7.1 Quantization-Aware Training |
|
|
306 | (3) |
|
13.3.7.2 Post-Training Quantization |
|
|
309 | (1) |
|
13.3.7.3 Zero-shot Quantization |
|
|
310 | (2) |
|
13.3.8 Stochastic Quantization |
|
|
312 | (1) |
|
13.4 Advanced Concepts: Quantization Below 8 BITS |
|
|
313 | (9) |
|
13.4.1 Simulated and Integer-only Quantization |
|
|
313 | (2) |
|
13.4.2 Mixed-Precision Quantization |
|
|
315 | (2) |
|
13.4.3 Hardware Aware Quantization |
|
|
317 | (1) |
|
13.4.4 Distillation-Assisted Quantization |
|
|
317 | (1) |
|
13.4.5 Extreme Quantization |
|
|
318 | (3) |
|
13.4.6 Vector Quantization |
|
|
321 | (1) |
|
13.5 Quantization And Hardware Processors |
|
|
322 | (1) |
|
13.6 Future Directions For Research In Quantization |
|
|
323 | (2) |
|
13.7 Summary And Conclusions |
|
|
325 | (2) |
Bibliography |
|
327 | (76) |
Index |
|
403 | |