Muutke küpsiste eelistusi

Machine Learning on Commodity Tiny Devices: Theory and Practice [Kõva köide]

  • Formaat: Hardback, 250 pages, kõrgus x laius: 254x178 mm, kaal: 453 g, 7 Tables, black and white; 36 Line drawings, black and white; 20 Halftones, black and white; 56 Illustrations, black and white
  • Ilmumisaeg: 13-Dec-2022
  • Kirjastus: CRC Press
  • ISBN-10: 1032374233
  • ISBN-13: 9781032374239
  • Formaat: Hardback, 250 pages, kõrgus x laius: 254x178 mm, kaal: 453 g, 7 Tables, black and white; 36 Line drawings, black and white; 20 Halftones, black and white; 56 Illustrations, black and white
  • Ilmumisaeg: 13-Dec-2022
  • Kirjastus: CRC Press
  • ISBN-10: 1032374233
  • ISBN-13: 9781032374239
This book aims at the tiny machine learning (TinyML) software and hardware synergy for edge intelligence applications. This book presents on-device learning techniques covering model-level neural network design, algorithm-level training optimization and hardware-level instruction acceleration.

Analyzing the limitations of conventional in-cloud computing would reveal that on-device learning is a promising research direction to meet the requirements of edge intelligence applications. As to the cutting-edge research of TinyML, implementing a high-efficiency learning framework and enabling system-level acceleration is one of the most fundamental issues. This book presents a comprehensive discussion of the latest research progress and provides system-level insights on designing TinyML frameworks, including neural network design, training algorithm optimization and domain-specific hardware acceleration. It identifies the main challenges when deploying TinyML tasks in the real world and guides the researchers to deploy a reliable learning system.

This book will be of interest to students and scholars in the field of edge intelligence, especially to those with sufficient professional Edge AI skills. It will also be an excellent guide for researchers to implement high-performance TinyML systems.
List of Figures
xiii
List of Tables
xvii
Chapter 1 Introduction
1(10)
1.1 What Is Machine Learning On Devices?
1(2)
1.2 On-Device Learning and Tinyml Systems
3(2)
1.2.1 Property of On-Device Learning
3(1)
1.2.2 Objectives of TinyML Systems
4(1)
1.3 Challenges for Realistic Implementation
5(1)
1.4 Problem Statement of Building Tinyml Systems
6(1)
1.5 Deployment Prospects and Downstream Applications
7(2)
1.5.1 Evaluation Metrics for Practical Methods
7(1)
1.5.2 Intelligent Medical Diagnosis
8(1)
1.5.3 AI-Enhanced Motion Tracking
8(1)
1.5.4 Domain-Specific Acceleration Chips
8(1)
1.6 The Scope and Organization of This Book
9(2)
Chapter 2 Fundamentals: On-Device Learning Paradigm
11(14)
2.1 Motivation
11(4)
2.1.1 Drawbacks of In-Cloud Learning
12(1)
2.1.2 Rise of On-Device Learning
12(1)
2.1.3 Bit Precision and Data Quantization
13(1)
2.1.4 Potential Gains
14(1)
2.1.5 Why Not Existing Quantization Methods?
14(1)
2.2 Basic Training Algorithms
15(3)
2.2.1 Stochastic Gradient Descent
15(1)
2.2.2 Mini-Batch Stochastic Gradient Descent
16(1)
2.2.3 Training of Neural Networks
17(1)
2.3 Parameter Synchronization for Distributed Training
18(2)
2.3.1 Parameter Server Paradigm
18(1)
2.3.2 Parameter Synchronization Pace
19(1)
2.3.3 Heterogeneity-Aware Distributed Training
19(1)
2.4 Multi-Client On-Device Learning
20(2)
2.4.1 Preliminary Experiments
20(1)
2.4.2 Observations
20(1)
2.4.2.1 Training Convergence Efficiency
20(1)
2.4.2.2 Synchronization Frequency
21(1)
2.4.2.3 Communication Traffic
21(1)
2.4.3 Summary
22(1)
2.5 Developing Kits and Evaluation Platforms
22(1)
2.5.1 Devices
22(1)
2.5.2 Benchmarks
22(1)
2.5.3 Pipeline
22(1)
2.6
Chapter Summary
23(2)
Chapter 3 Preliminary: Theories and Algorithms
25(20)
3.1 Elements of Neural Networks
25(2)
3.1.1 Fully Connected Network
25(1)
3.1.2 Convolutional Neural Network
26(1)
3.1.3 Attention-Based Neural Network
26(1)
3.2 Model-Oriented Optimization Algorithms
27(4)
3.2.1 Tiny Transformer
27(3)
3.2.2 Quantization Strategy for Transformer
30(1)
3.3 Practice on Simple Convolutional Neural Networks
31(14)
3.3.1 PyTorch Installation
31(1)
3.3.1.1 On macOS
31(1)
3.3.1.2 On Windows
32(1)
3.3.2 CIFAR-10 Dataset
32(2)
3.3.3 Construction of CNN Model
34(1)
3.3.3.1 Convolutional Layers
34(1)
3.3.3.2 Activation Layers
35(1)
3.3.3.3 Pooling Layers
35(2)
3.3.3.4 Fully Connected Layers
37(1)
3.3.3.5 Structure of LeNet-5
38(1)
3.3.4 Model Training
39(1)
3.3.5 Model Testing
40(1)
3.3.6 GPU Acceleration
41(1)
3.3.6.1 CUDA Installation
42(1)
3.3.6.2 Programming for GPU
42(1)
3.3.7 Load Pre-Trained CNNs
43(2)
Chapter 4 Model-Level Design: Computation Acceleration and Communication Saving
45(22)
4.1 Optimization of Network Architecture
45(11)
4.1.1 Network-Aware Parameter Pruning
46(1)
4.1.1.1 Pruning Steps
46(1)
4.1.1.2 Pruning Strategy
47(1)
4.1.1.3 Pruning Metrics
47(1)
4.1.1.4 Summary
48(1)
4.1.2 Knowledge Distillation
48(1)
4.1.2.1 Combination of Loss Functions
49(1)
4.1.2.2 Tuning of Hyper-Parameters
50(1)
4.1.2.3 Usage of Model Training
50(1)
4.1.2.4 Summary
51(1)
4.1.3 Model Fine-Tuning
51(1)
4.1.3.1 Transfer Learning
51(1)
4.1.3.2 Layer-Wise Freezing and Updating
52(1)
4.1.3.3 Model-Wise Feature Sharing
53(1)
4.1.3.4 Summary
54(1)
4.1.4 Neural Architecture Search
54(1)
4.1.4.1 Search Space of HW-NAS
55(1)
4.1.4.2 Targeted Hardware Platforms
56(1)
4.1.4.3 Trend of Current HW-NAS Methods
56(1)
4.2 Optimization of Training Algorithm
56(10)
4.2.1 Low Rank Factorization
57(1)
4.2.2 Data-Adaptive Regularization
57(1)
4.2.2.1 Core Formulation
57(1)
4.2.2.2 On-Device Network Sparsification
58(1)
4.2.2.3 Block-Wise Regularization
59(1)
4.2.2.4 Summary
59(1)
4.2.3 Data Representation and Numerical Quantization
60(1)
4.2.3.1 Elements of Quantization
61(2)
4.2.3.2 Post-Training Quantization
63(1)
4.2.3.3 Quantization-Aware Training
63(3)
4.2.3.4 Summary
66(1)
4.3
Chapter Summary
66(1)
Chapter 5 Hardware-Level Design: Neural Engines and Tensor Accelerators
67(16)
5.1 On-Chip Resource Scheduling
67(3)
5.1.1 Embedded Memory Controlling
67(1)
5.1.2 Underlying Computational Primitives
68(1)
5.1.3 Low-Level Arithmetical Instructions
68(1)
5.1.4 MIMO-Based Communication
69(1)
5.2 Domain-Specific Hardware Acceleration
70(2)
5.2.1 Multiple Processing Primitives Scheduling
70(1)
5.2.2 I/O Connection Optimization
70(1)
5.2.3 Cache Management
71(1)
5.2.4 Topology Construction
71(1)
5.3 Cross-Device Energy Efficiency
72(3)
5.3.1 Multi-Client Collaboration
72(1)
5.3.2 Efficiency Analysis
72(2)
5.3.3 Problem Formulation for Energy Saving
74(1)
5.3.4 Algorithm Design and Pipeline Overview
75(1)
5.4 Distributed On-Device Learning
75(6)
5.4.1 Community-Aware Synchronous Parallel
76(1)
5.4.2 Infrastructure Design
77(1)
5.4.3 Community Manager
77(1)
5.4.4 Weight Learner
77(1)
5.4.4.1 Distance Metric Learning
77(1)
5.4.4.2 Asynchronous Advantage Actor-Critic
78(1)
5.4.4.3 Agent Learning Methodology
79(1)
5.4.5 Distributed Training Controller
80(1)
5.4.5.1 Intra-Community Synchronization
80(1)
5.4.5.2 Inter-Community Synchronization
80(1)
5.4.5.3 Communication Traffic Aggregation
81(1)
5.5
Chapter Summary
81(2)
Chapter 6 Infrastructure-Level Design: Serverless and Decentralized Machine Learning
83(16)
6.1 Serverless Computing
83(11)
6.1.1 Definition of Serverless Computing
83(3)
6.1.2 Architecture of Serverless Computing
86(1)
6.1.2.1 Virtualization Layer
86(1)
6.1.2.2 Encapsulation Layer
87(1)
6.1.2.3 System Orchestration Layer
88(2)
6.1.2.4 System Coordination Layer
90(1)
6.1.3 Benefits of Serverless Computing
90(1)
6.1.4 Challenges of Serverless Computing
91(1)
6.1.4.1 Programming and Modeling
91(1)
6.1.4.2 Pricing and Cost Prediction
91(1)
6.1.4.3 Scheduling
91(1)
6.1.4.4 Intra-Communications of Functions
92(1)
6.1.4.5 Data Caching
93(1)
6.1.4.6 Security and Privacy
93(1)
6.2 Serverless Machine Learning
94(3)
6.2.1 Introduction
94(1)
6.2.2 Machine Learning and Data Management
94(1)
6.2.3 Training Large Models in Serverless Computing
94(1)
6.2.3.1 Data Transfer and Parallelism in Serverless Computing
95(1)
6.2.3.2 Data Parallelism for Model Training in Serverless Computing
95(1)
6.2.3.3 Optimizing Parallelism Structure in Serverless Training
96(1)
6.2.4 Cost-Efficiency in Serverless Computing
96(1)
6.3
Chapter Summary
97(2)
Chapter 7 System-Level Design: From Standalone to Clusters
99(24)
7.1 Staleness-Aware Pipelining
99(4)
7.1.1 Data Parallelism
100(1)
7.1.2 Model Parallelism
100(1)
7.1.2.1 Linear Models
101(1)
7.1.2.2 Non-Linear Neural Networks
101(1)
7.1.3 Hybrid Parallelism
102(1)
7.1.4 Extension of Training Parallelism
102(1)
7.1.5 Summary
103(1)
7.2 Introduction to Federated Learning
103(3)
7.3 Training With Non-IID Data
106(6)
7.3.1 The Definition of Non-IID Data
107(1)
7.3.2 Enabling Technologies for Non-IID Data
108(1)
7.3.2.1 Data Sharing
108(1)
7.3.2.2 Robust Aggregation Methods
109(2)
7.3.2.3 Other Optimized Methods
111(1)
7.4 Large-Scale Collaborative Learning
112(3)
7.4.1 Parameter Server
113(1)
7.4.2 Decentralized P2P Scheme
113(1)
7.4.3 Collective Communication-Based AllReduce
114(1)
7.4.4 Data Flow-Based Graph
114(1)
7.5 Personalized Learning
115(2)
7.5.1 Data-Based Approaches
115(1)
7.5.2 Model-Based Approaches
115(1)
7.5.2.1 Single Model-Based Methods
116(1)
7.5.2.2 Multiple Model-Based Methods
116(1)
7.6 Practice On Fl Implementation
117(5)
7.6.1 Prerequisites
117(1)
7.6.2 Data Distribution
118(2)
7.6.3 Local Model Training
120(1)
7.6.4 Global Model Aggregation
121(1)
7.6.5 A Simple Example
121(1)
7.7
Chapter Summary
122(1)
Chapter 8 Application: Image-Based Visual Perception
123(22)
8.1 Image Classification
123(4)
8.1.1 Traditional Image Classification Methods
123(1)
8.1.2 Deep Learning-Based Image Classification Methods
124(3)
8.1.3 Conclusion
127(1)
8.2 Image Restoration and Super-Resolution
127(8)
8.2.1 Overview
127(2)
8.2.2 A Unified Framework for Image Restoration and Super-Resolution
129(1)
8.2.3 A Demo of Single Image Super-Resolution
130(1)
8.2.3.1 Networks Architecture
131(1)
8.2.3.2 Local Aware Attention
132(1)
8.2.3.3 Global Aware Attention
133(1)
8.2.3.4 LARD Block
134(1)
8.3 Self-Attention and Vision Transformers
135(3)
8.4 Environment Perception: Image Segmentation and Object Detection
138(6)
8.4.1 Object Detection
138(1)
8.4.1.1 Traditional Object Detection Model
138(1)
8.4.1.2 Deep Learning-Based Object Detection Model
139(2)
8.4.2 Image Segmentation
141(1)
8.4.2.1 Semantic Segmentation
141(1)
8.4.2.2 Instance Segmentation
142(1)
8.4.2.3 Panoramic Segmentation
143(1)
8.5
Chapter Summary
144(1)
Chapter 9 Application: Video-Based Real-Time Processing
145(16)
9.1 Video Recognition: Evolving From Images
145(4)
9.1.1 Challenges
146(1)
9.1.2 Methodologies
147(1)
9.1.2.1 Two-Stream Networks
147(1)
9.1.2.2 3D CNNs
148(1)
9.2 Motion Tracking: Learn From Time-Spatial Sequences
149(4)
9.2.1 Deep Learning-Based Tracking
149(1)
9.2.2 Optical Flow-Based Tracking
150(3)
9.3 Pose Estimation: Key Point Extraction
153(3)
9.3.1 2D-Based Extraction
153(1)
9.3.1.1 Single Person Estimation
153(1)
9.3.1.2 Multiple Human Estimation
154(1)
9.3.2 3D-Based Extraction
155(1)
9.4 Practice: Real-Time Mobile Human Pose Tracking
156(3)
9.4.1 Prerequisites and Data Preparation
156(2)
9.4.2 Hyper-Parameter Configuration and Model Training
158(1)
9.4.3 Realistic Inference and Performance Evaluation
158(1)
9.5
Chapter Summary
159(2)
Chapter 10 Application: Privacy, Security, Robustness and Trustworthiness in Edge Al
161(26)
10.1 Privacy Protection Methods
161(15)
10.1.1 Homomorphic Encryption-Enabled Methods
162(5)
10.1.2 Differential Privacy-Enabled Methods
167(3)
10.1.3 Secure Multi-Party Computation
170(2)
10.1.4 Lightweight Private Computation Techniques for Edge AI
172(1)
10.1.4.1 Example 1: Lightweight and Secure Decision Tree Classification
173(1)
10.1.4.2 Example 2: Lightweight and Secure SVM Classification
173(3)
10.2 Security and Robustness
176(4)
10.2.1 Practical Issues
176(2)
10.2.2 Backdoor Attacks
178(2)
10.2.3 Backdoor Defences
180(1)
10.3 Trustworthiness
180(5)
10.3.1 Blockchain and Swarm Learning
180(3)
10.3.2 Trusted Execution Environment and Federated Learning
183(2)
10.4
Chapter Summary
185(2)
Bibliography 187(60)
Index 247
Song Guo is a Full Professor leading the Edge Intelligence Lab and Research Group of Networking and Mobile Computing at the Hong Kong Polytechnic University. Professor Guo is a Fellow of the Canadian Academy of Engineering, Fellow of the IEEE, Fellow of the AAIA and Clarivate Highly Cited Researcher.

Qihua Zhou is a PhD student with the Department of Computing at the Hong Kong Polytechnic University. His research interests include distributed AI systems, large-scale parallel processing, TinyML systems and domain-specific accelerators.