Preface |
|
ix | |
|
|
1 | (14) |
|
1.1 Data Analysis and Optimization |
|
|
1 | (3) |
|
|
4 | (1) |
|
1.3 Matrix Factorization Problems |
|
|
5 | (1) |
|
1.4 Support Vector Machines |
|
|
6 | (3) |
|
|
9 | (2) |
|
|
11 | (2) |
|
|
13 | (2) |
|
2 Foundations of Smooth Optimization |
|
|
15 | (11) |
|
2.1 A Taxonomy of Solutions to Optimization Problems |
|
|
15 | (1) |
|
|
16 | (2) |
|
2.3 Characterizing Minima of Smooth Functions |
|
|
18 | (2) |
|
2.4 Convex Sets and Functions |
|
|
20 | (2) |
|
2.5 Strongly Convex Functions |
|
|
22 | (4) |
|
|
26 | (29) |
|
|
27 | (1) |
|
3.2 Steepest-Descent Method |
|
|
28 | (5) |
|
|
28 | (1) |
|
|
29 | (1) |
|
3.2.3 Strongly Convex Case |
|
|
30 | (2) |
|
3.2.4 Comparison between Rates |
|
|
32 | (1) |
|
3.3 Descent Methods: Convergence |
|
|
33 | (3) |
|
3.4 Line-Search Methods: Choosing the Direction |
|
|
36 | (2) |
|
3.5 Line-Search Methods: Choosing the Steplength |
|
|
38 | (4) |
|
3.6 Convergence to Approximate Second-Order Necessary Points |
|
|
42 | (2) |
|
|
44 | (7) |
|
3.8 The KL and PL Properties |
|
|
51 | (4) |
|
4 Gradient Methods Using Momentum |
|
|
55 | (20) |
|
4.1 Motivation from Differential Equations |
|
|
56 | (2) |
|
4.2 Nesterov's Method: Convex Quadratics |
|
|
58 | (4) |
|
4.3 Convergence for Strongly Convex Functions |
|
|
62 | (4) |
|
4.4 Convergence for Weakly Convex Functions |
|
|
66 | (2) |
|
4.5 Conjugate Gradient Methods |
|
|
68 | (2) |
|
4.6 Lower Bounds on Convergence Rates |
|
|
70 | (5) |
|
|
75 | (25) |
|
5.1 Examples and Motivation |
|
|
76 | (4) |
|
|
76 | (1) |
|
5.1.2 Incremental Gradient Method |
|
|
77 | (1) |
|
5.1.3 Classification and the Perceptron |
|
|
77 | (1) |
|
5.1.4 Empirical Risk Minimization |
|
|
78 | (2) |
|
5.2 Randomness and Steplength: Insights |
|
|
80 | (5) |
|
5.2.1 Example: Computing a Mean |
|
|
80 | (2) |
|
5.2.2 The Randomized Kaczmarz Method |
|
|
82 | (3) |
|
5.3 Key Assumptions for Convergence Analysis |
|
|
85 | (2) |
|
5.3.1 Case 1: Bounded Gradients: Lg =0 |
|
|
86 | (1) |
|
5.3.2 Case 2: Randomized Kaczmarz: B -- 0, Lg > 0 |
|
|
86 | (1) |
|
5.3.3 Case 3: Additive Gaussian Noise |
|
|
86 | (1) |
|
5.3.4 Case 4: Incremental Gradient |
|
|
87 | (1) |
|
|
87 | (6) |
|
|
89 | (1) |
|
|
90 | (2) |
|
5.4.3 Case 3: B and Ls Both Nonzero |
|
|
92 | (1) |
|
5.5 Implementation Aspects |
|
|
93 | (7) |
|
|
93 | (1) |
|
|
94 | (1) |
|
5.5.3 Acceleration Using Momentum |
|
|
94 | (6) |
|
|
100 | (18) |
|
6.1 Coordinate Descent in Machine Learning |
|
|
101 | (2) |
|
6.2 Coordinate Descent for Smooth Convex Functions |
|
|
103 | (10) |
|
6.2.1 Lipschitz Constants |
|
|
104 | (1) |
|
6.2.2 Randomized CD: Sampling with Replacement |
|
|
105 | (5) |
|
|
110 | (2) |
|
6.2.4 Random Permutations CD: Sampling without Replacement |
|
|
112 | (1) |
|
6.3 Block-Coordinate Descent |
|
|
113 | (5) |
|
7 First-Order Methods for Constrained Optimization |
|
|
118 | (14) |
|
7.1 Optimality Conditions |
|
|
118 | (2) |
|
|
120 | (2) |
|
7.3 The Projected Gradient Algorithm |
|
|
122 | (5) |
|
7.3.1 General Case: A Short-Step Approach |
|
|
123 | (1) |
|
7.3.2 General Case: Backtracking |
|
|
124 | (1) |
|
7.3.3 Smooth Strongly Convex Case |
|
|
125 | (1) |
|
|
126 | (1) |
|
7.3.5 Alternative Search Directions |
|
|
126 | (1) |
|
7.4 The Conditional Gradient (Frank-Wolfe) Method |
|
|
127 | (5) |
|
8 Nonsmooth Functions and Subgradients |
|
|
132 | (21) |
|
8.1 Subgradients and Subdifferentials |
|
|
134 | (3) |
|
8.2 The Subdifferential and Directional Derivatives |
|
|
137 | (4) |
|
8.3 Calculus of Subdifferentials |
|
|
141 | (3) |
|
8.4 Convex Sets and Convex Constrained Optimization |
|
|
144 | (2) |
|
8.5 Optimality Conditions for Composite Nonsmooth Functions |
|
|
146 | (2) |
|
8.6 Proximal Operators and the Moreau Envelope |
|
|
148 | (5) |
|
9 Nonsmooth Optimization Methods |
|
|
153 | (17) |
|
|
155 | (1) |
|
9.2 The Subgradient Method |
|
|
156 | (4) |
|
|
158 | (2) |
|
9.3 Proximal-Gradient Algorithms for Regularized Optimization |
|
|
160 | (4) |
|
9.3.1 Convergence Rate for Convex |
|
|
162 | (2) |
|
9.4 Proximal Coordinate Descent for Structured Nonsmooth Functions |
|
|
164 | (3) |
|
9.5 Proximal Point Method |
|
|
167 | (3) |
|
10 Duality and Algorithms |
|
|
170 | (18) |
|
10.1 Quadratic Penalty Function |
|
|
170 | (2) |
|
10.2 Lagrangians and Duality |
|
|
172 | (2) |
|
10.3 First-Order Optimality Conditions |
|
|
174 | (4) |
|
|
178 | (1) |
|
|
179 | (3) |
|
|
179 | (1) |
|
10.5.2 Augmented Lagrangian Method |
|
|
180 | (1) |
|
10.5.3 Alternating Direction Method of Multipliers |
|
|
181 | (1) |
|
10.6 Some Applications of Dual Algorithms |
|
|
182 | (6) |
|
10.6.1 Consensus Optimization |
|
|
182 | (2) |
|
10.6.2 Utility Maximization |
|
|
184 | (1) |
|
10.6.3 Linear and Quadratic Programming |
|
|
185 | (3) |
|
11 Differentiation and Adjoints |
|
|
188 | (12) |
|
11.1 The Chain Rule for a Nested Composition of Vector Functions |
|
|
188 | (2) |
|
11.2 The Method of Adjoints |
|
|
190 | (1) |
|
11.3 Adjoints in Deep Learning |
|
|
191 | (1) |
|
11.4 Automatic Differentiation |
|
|
192 | (3) |
|
11.5 Derivations via the Lagrangian and Implicit Function Theorem |
|
|
195 | (5) |
|
11.5.1 A Constrained Optimization Formulation of the Progressive Function |
|
|
195 | (2) |
|
11.5.2 A General Perspective on Unconstrained and Constrained Formulations |
|
|
197 | (1) |
|
11.5.3 Extension: Control |
|
|
197 | (3) |
|
|
200 | (16) |
|
A.1 Definitions and Basic Concepts |
|
|
200 | (3) |
|
A.2 Convergence Rates and Iteration Complexity |
|
|
203 | (1) |
|
A.3 Algorithm 3.1 Is an Effective Line-Search Technique |
|
|
204 | (1) |
|
A.4 Linear Programming Duality, Theorems of the Alternative |
|
|
205 | (3) |
|
A.5 Limiting Feasible Directions |
|
|
208 | (1) |
|
|
209 | (4) |
|
A.7 Bounds for Degenerate Quadratic Functions |
|
|
213 | (3) |
Bibliography |
|
216 | (7) |
Index |
|
223 | |