Contributors |
|
xiii | |
Preface |
|
xv | |
|
1 The Modern Mathematics of Deep Learning |
|
|
1 | (111) |
|
|
|
|
|
|
1 | (30) |
|
|
4 | (1) |
|
1.1.2 Foundations of Learning Theory |
|
|
5 | (18) |
|
1.1.3 Do We Need a New Theory? |
|
|
23 | (8) |
|
1.2 Generalization of Large Neural Networks |
|
|
31 | (10) |
|
|
31 | (2) |
|
1.2.2 Norm-Based Bounds and Margin Theory |
|
|
33 | (2) |
|
1.2.3 Optimization and Implicit Regularization |
|
|
35 | (3) |
|
1.2.4 Limits of Classical Theory and Double Descent |
|
|
38 | (3) |
|
1.3 The Role of Depth in the Expressivity of Neural Networks |
|
|
41 | (8) |
|
1.3.1 Approximation of Radial Functions |
|
|
41 | (3) |
|
|
44 | (3) |
|
1.3.3 Alternative Notions of Expressivity |
|
|
47 | (2) |
|
1.4 Deep Neural Networks Overcome the Curse of Dimensionality |
|
|
49 | (8) |
|
1.4.1 Manifold Assumption |
|
|
49 | (2) |
|
|
51 | (2) |
|
|
53 | (4) |
|
1.5 Optimization of Deep Neural Networks |
|
|
57 | (8) |
|
1.5.1 Loss Landscape Analysis |
|
|
57 | (4) |
|
1.5.2 Lazy Training and Provable Convergence of Stochastic Gradient Descent |
|
|
61 | (4) |
|
1.6 Tangible Effects of Special Architectures |
|
|
65 | (13) |
|
1.6.1 Convolutional Neural Networks |
|
|
66 | (2) |
|
1.6.2 Residual Neural Networks |
|
|
68 | (2) |
|
1.6.3 Framelets and U-Nets |
|
|
70 | (3) |
|
1.6.4 Batch Normalization |
|
|
73 | (3) |
|
1.6.5 Sparse Neural Networks and Pruning |
|
|
75 | |
|
1.6.6 Recurrent Neural Networks |
|
|
76 | (2) |
|
1.7 Describing the Features that a Deep Neural Network Learns |
|
|
78 | (3) |
|
1.7.1 Invariances and the Scattering Transform |
|
|
78 | (1) |
|
1.7.2 Hierarchical Sparse Representations |
|
|
79 | (2) |
|
1.8 Effectiveness in Natural Sciences |
|
|
81 | (31) |
|
1.8.1 Deep Neural Networks Meet Inverse Problems |
|
|
82 | (2) |
|
|
84 | (28) |
|
2 Generalization in Deep Learning |
|
|
112 | (21) |
|
|
|
|
|
112 | (1) |
|
|
113 | (3) |
|
2.3 Rethinking Generalization |
|
|
116 | (5) |
|
2.3.1 Consistency of Theory |
|
|
118 | (1) |
|
2.3.2 Differences in Assumptions and Problem Settings |
|
|
119 | (2) |
|
2.3.3 Practical Role of Generalization Theory |
|
|
121 | (1) |
|
2.4 Generalization Bounds via Validation |
|
|
121 | (1) |
|
2.5 Direct Analyses of Neural Networks |
|
|
122 | (9) |
|
2.5.1 Model Description via Deep Paths |
|
|
123 | (2) |
|
2.5.2 Theoretical Insights via Tight Theory for Every Pair (P,5) |
|
|
125 | (2) |
|
2.5.3 Probabilistic Bounds over Random Datasets |
|
|
127 | (3) |
|
2.5.4 Probabilistic Bound for 0-1 Loss with Multi-Labels |
|
|
130 | (1) |
|
2.6 Discussions and Open Problems |
|
|
131 | (2) |
|
Appendix A Additional Discussions |
|
|
133 | (4) |
|
A1 Simple Regularization Algorithm |
|
|
133 | (2) |
|
A2 Relationship to Other Fields |
|
|
135 | (1) |
|
A3 SGD Chooses Direction in Terms of w |
|
|
135 | (1) |
|
A4 Simple Implementation of Two-Phase Training Procedure |
|
|
136 | (1) |
|
|
136 | (1) |
|
|
137 | (1) |
|
Appendix B Experimental Details |
|
|
137 | (1) |
|
|
138 | |
|
|
139 | (1) |
|
C2 Proof of Corollary 2.2 |
|
|
139 | (1) |
|
|
140 | (1) |
|
|
141 | (1) |
|
|
142 | (1) |
|
C6 Proof of Proposition 2.5 |
|
|
143 | (6) |
|
3 Expressivity of Deep Neural Networks |
|
|
149 | (51) |
|
|
|
|
|
149 | (6) |
|
|
151 | (3) |
|
3.1.2 Goal and Outline of this Chapter |
|
|
154 | (1) |
|
|
154 | (1) |
|
3.2 Shallow Neural Networks |
|
|
155 | (6) |
|
3.2.1 Universality of Shallow Neural Networks |
|
|
156 | (3) |
|
3.2.2 Lower Complexity Bounds |
|
|
159 | (1) |
|
3.2.3 Upper Complexity Bounds |
|
|
160 | (1) |
|
3.3 Universality of Deep Neural Networks |
|
|
161 | (2) |
|
3.4 Approximation of Classes of Smooth Functions |
|
|
163 | (4) |
|
3.5 Approximation of Piecewise Smooth Functions |
|
|
167 | (5) |
|
3.6 Assuming More Structure |
|
|
172 | (5) |
|
3.6.1 Hierachical Structure |
|
|
172 | (2) |
|
3.6.2 Assumptions on the Data Manifold |
|
|
174 | (1) |
|
3.6.3 Expressivity of Deep Neural Networks for Solutions ofPDEs |
|
|
175 | (2) |
|
3.7 Deep Versus Shallow Neural Networks |
|
|
177 | (3) |
|
3.8 Special Neural Network Architectures and Activation Functions |
|
|
180 | (20) |
|
3.8.1 Convolutional Neural Networks |
|
|
180 | (4) |
|
3.8.2 Residual Neural Networks |
|
|
184 | (1) |
|
3.8.3 Recurrent Neural Networks |
|
|
185 | (15) |
|
4 Optimization Landscape of Neural Networks |
|
|
200 | (29) |
|
|
|
|
|
201 | (4) |
|
4.2 Basics of Statistical Learning |
|
|
205 | (1) |
|
4.3 Optimization Landscape of Linear Networks |
|
|
206 | (8) |
|
4.3.1 Single-Hidden-Layer Linear Networks with Squared Loss and Fixed Size Regularization |
|
|
207 | (5) |
|
4.3.2 Deep Linear Networks with Squared Loss |
|
|
212 | (2) |
|
4.4 Optimization Landscape of Nonlinear Networks |
|
|
214 | (11) |
|
|
215 | (6) |
|
4.4.2 Positively Homogeneous Networks |
|
|
221 | (4) |
|
|
225 | (4) |
|
5 Explaining the Decisions of Convolutional and Recurrent Neural Networks |
|
|
229 | (38) |
|
|
|
|
|
|
|
229 | (2) |
|
|
231 | (2) |
|
5.2.1 Practical Advantages of Explainability |
|
|
231 | (1) |
|
5.2.2 Social and Legal Role of Explainability |
|
|
232 | (1) |
|
5.2.3 Theoretical Insights Through Explainability |
|
|
232 | (1) |
|
5.3 From Explaining Linear Models to General Model Explainability |
|
|
233 | (5) |
|
5.3.1 Explainability of Linear Models |
|
|
233 | (2) |
|
5.3.2 Generalizing Explainability to Nonlinear Models |
|
|
235 | (1) |
|
5.3.3 Short Survey on Explanation Methods |
|
|
236 | (2) |
|
5.4 Layer-Wise Relevance Propagation |
|
|
238 | (13) |
|
5.4.1 LRP in Convolutional Neural Networks |
|
|
239 | (3) |
|
5.4.2 Theoretical Interpretation of the LRP Redistribution Process |
|
|
242 | (6) |
|
5.4.3 Extending LRP to LSTM Networks |
|
|
248 | (3) |
|
5.5 Explaining a Visual Question Answering Model |
|
|
251 | (7) |
|
|
258 | (9) |
|
6 Stochastic Feedforward Neural Networks: Universal Approximation |
|
|
267 | (47) |
|
|
|
|
268 | (3) |
|
6.2 Overview of Previous Works and Results |
|
|
271 | (2) |
|
6.3 Markov Kernels and Stochastic Networks |
|
|
273 | (3) |
|
6.3.1 Binary Probability Distributions and Markov Kernels |
|
|
273 | (1) |
|
6.3.2 Stochastic Feedforward Networks |
|
|
274 | (2) |
|
6.4 Results for Shallow Networks |
|
|
276 | (2) |
|
6.4.1 Fixed Weights in the Output Layer |
|
|
277 | (1) |
|
6.4.2 Trainable Weights in the Output Layer |
|
|
278 | (1) |
|
6.5 Proofs for Shallow Networks |
|
|
278 | (8) |
|
6.5.1 Fixed Weights in the Output Layer |
|
|
279 | (4) |
|
6.5.2 Trainable Weights in the Second Layer |
|
|
283 | (2) |
|
6.5.3 Discussion of the Proofs for Shallow Networks |
|
|
285 | (1) |
|
6.6 Results for Deep Networks |
|
|
286 | (3) |
|
|
288 | (1) |
|
6.6.2 Approximation with Finite Weights and Biases |
|
|
288 | (1) |
|
6.7 Proofs for Deep Networks |
|
|
289 | (10) |
|
|
289 | (1) |
|
6.7.2 Probability Mass Sharing |
|
|
290 | (3) |
|
6.7.3 Universal Approximation |
|
|
293 | (3) |
|
6.7.4 Error Analysis for Finite Weights and Biases |
|
|
296 | (2) |
|
6.7.5 Discussion of the Proofs for Deep Networks |
|
|
298 | (1) |
|
6.8 Lower Bounds for Shallow and Deep Networks |
|
|
299 | (3) |
|
6.8.1 Parameter Counting Lower Bounds |
|
|
299 | (2) |
|
|
301 | (1) |
|
|
302 | (4) |
|
|
306 | (1) |
|
|
307 | (7) |
|
7 Deep Learning as Sparsity-Enforcing Algorithms |
|
|
314 | (24) |
|
|
|
|
314 | (2) |
|
|
316 | (1) |
|
|
317 | (3) |
|
7.4 Multilayer Sparse Coding |
|
|
320 | (4) |
|
7.4.1 ML-SC Pursuit and the Forward Pass |
|
|
321 | (2) |
|
7.4.2 ML-SC: A Projection Approach |
|
|
323 | (1) |
|
|
324 | (3) |
|
7.6 Multilayer Iterative Shrinkage Algorithms |
|
|
327 | (5) |
|
7.6.1 Towards Principled Recurrent Neural Networks |
|
|
329 | (3) |
|
7.7 Final Remarks and Outlook |
|
|
332 | (6) |
|
8 The Scattering Transform |
|
|
338 | (62) |
|
|
|
338 | (1) |
|
|
339 | (7) |
|
8.2.1 Euclidean Geometric Stability |
|
|
340 | (1) |
|
8.2.2 Representations with Euclidean Geometric Stability |
|
|
341 | (1) |
|
8.2.3 Non-Euclidean Geometric Stability |
|
|
342 | (1) |
|
|
343 | (3) |
|
8.3 Scattering on the Translation Group |
|
|
346 | (17) |
|
8.3.1 Windowed Scattering Transform |
|
|
346 | (3) |
|
8.3.2 Scattering Metric and Energy Conservation |
|
|
349 | (2) |
|
8.3.3 Local Translation Invariance and Lipschitz Continuity with Respect to Deformations |
|
|
351 | (3) |
|
|
354 | (3) |
|
8.3.5 Empirical Analysis of Scattering Properties |
|
|
357 | (5) |
|
8.3.6 Scattering in Modern Computer Vision |
|
|
362 | (1) |
|
8.4 Scattering Representations of Stochastic Processes |
|
|
363 | (8) |
|
8.4.1 Expected Scattering |
|
|
363 | (4) |
|
8.4.2 Analysis of Stationary Textures with Scattering |
|
|
367 | (2) |
|
8.4.3 Multifractal Analysis with Scattering Moments |
|
|
369 | (2) |
|
8.5 Non-Euclidean Scattering |
|
|
371 | (13) |
|
8.5.1 Joint versus Separable Scattering |
|
|
372 | (1) |
|
8.5.2 Scattering on Global Symmetry Groups |
|
|
372 | (3) |
|
|
375 | (8) |
|
8.5.4 Manifold Scattering |
|
|
383 | (1) |
|
8.6 Generative Modeling with Scattering |
|
|
384 | (9) |
|
8.6.1 Sufficient Statistics |
|
|
384 | (1) |
|
8.6.2 Microcanonical Scattering Models |
|
|
385 | (2) |
|
8.6.3 Gradient Descent Scattering Reconstruction |
|
|
387 | (2) |
|
8.6.4 Regularising Inverse Problems with Scattering |
|
|
389 | (2) |
|
8.6.5 Texture Synthesis with Microcanonical Scattering |
|
|
391 | (2) |
|
|
393 | (7) |
|
9 Deep Generative Models and Inverse Problems |
|
|
400 | (22) |
|
|
|
400 | (1) |
|
9.2 How to Tame High Dimensions |
|
|
401 | (5) |
|
|
401 | (1) |
|
9.2.2 Conditional Independence |
|
|
402 | (1) |
|
9.2.3 Deep Generative Models |
|
|
403 | (1) |
|
|
404 | (1) |
|
9.2.5 Invertible Generative Models |
|
|
405 | (1) |
|
9.2.6 Untrained Generative Models |
|
|
405 | (1) |
|
9.3 Linear Inverse Problems Using Deep Generative Models |
|
|
406 | (8) |
|
9.3.1 Reconstruction from Gaussian Measurements |
|
|
407 | (2) |
|
9.3.2 Optimization Challenges |
|
|
409 | (1) |
|
9.3.3 Extending the Range of the Generator |
|
|
410 | (1) |
|
9.3.4 Non-Linear Inverse Problems |
|
|
410 | (2) |
|
9.3.5 Inverse Problems with Untrained Generative Priors |
|
|
412 | (2) |
|
9.4 Supervised Methods for Inverse Problems |
|
|
414 | (8) |
|
10 Dynamical Systems and Optimal Control Approach to Deep Learning |
|
|
422 | (17) |
|
|
|
|
422 | (2) |
|
10.1.1 The Problem of Supervised Learning |
|
|
423 | (1) |
|
|
424 | (1) |
|
10.3 Mean-Field Optimal Control and Pontryagin's Maximum Principle |
|
|
425 | (3) |
|
10.3.1 Pontryagin's Maximum Principle |
|
|
426 | (2) |
|
10.4 Method of Successive Approximations |
|
|
428 | (7) |
|
10.4.1 Extended Pontryagin Maximum Principle |
|
|
428 | (1) |
|
10.4.2 The Basic Method of Successive Approximation |
|
|
428 | (3) |
|
10.4.3 Extended Method of Successive Approximation |
|
|
431 | (2) |
|
10.4.4 Discrete PMP and Discrete MSA |
|
|
433 | (2) |
|
|
435 | (4) |
|
11 Bridging Many-Body Quantum Physics and Deep Learning via Tensor Networks |
|
|
439 | |
|
|
|
|
|
|
440 | (2) |
|
11.2 Preliminaries - Many-Body Quantum Physics |
|
|
442 | (8) |
|
11.2.1 The Many-Body Quantum Wave Function |
|
|
443 | (1) |
|
11.2.2 Quantum Entanglement Measures |
|
|
444 | (3) |
|
|
447 | (3) |
|
11.3 Quantum Wave Functions and Deep Learning Architectures |
|
|
450 | (3) |
|
11.3.1 Convolutional and Recurrent Networks as Wave Functions |
|
|
450 | (3) |
|
11.3.2 Tensor Network Representations of Convolutional and Recurrent Networks |
|
|
453 | (1) |
|
11.4 Deep Learning Architecture Design via Entanglement Measures |
|
|
453 | (7) |
|
11.4.1 Dependencies via Entanglement Measures |
|
|
454 | (2) |
|
11.4.2 Quantum-Physics-Inspired Control of Inductive Bias |
|
|
456 | (4) |
|
11.5 Power of Deep Learning for Wave Function Representations |
|
|
460 | (7) |
|
11.5.1 Entanglement Scaling of Deep Recurrent Networks |
|
|
461 | (2) |
|
11.5.2 Entanglement Scaling of Overlapping Convolutional Networks |
|
|
463 | (4) |
|
|
467 | |