Preface |
|
xv | |
|
|
1 | (6) |
|
2 The Lasso for Linear Models |
|
|
7 | (22) |
|
|
7 | (1) |
|
|
8 | (5) |
|
2.3 Cross-Validation and Inference |
|
|
13 | (1) |
|
2.4 Computation of the Lasso Solution |
|
|
14 | (3) |
|
2.4.1 Single Predictor: Soft Thresholding |
|
|
15 | (1) |
|
2.4.2 Multiple Predictors: Cyclic Coordinate Descent |
|
|
15 | (2) |
|
2.4.3 Soft-Thresholding and Orthogonal Bases |
|
|
17 | (1) |
|
|
17 | (2) |
|
2.6 Uniqueness of the Lasso Solutions |
|
|
19 | (1) |
|
2.7 A Glimpse at the Theory |
|
|
20 | (1) |
|
2.8 The Nonnegative Garrote |
|
|
20 | (2) |
|
2.9 Q Penalties and Bayes Estimates |
|
|
22 | (1) |
|
|
23 | (6) |
|
|
24 | (5) |
|
3 Generalized Linear Models |
|
|
29 | (26) |
|
|
29 | (2) |
|
|
31 | (5) |
|
3.2.1 Example: Document Classification |
|
|
32 | (3) |
|
|
35 | (1) |
|
3.3 Multiclass Logistic Regression |
|
|
36 | (4) |
|
3.3.1 Example: Handwritten Digits |
|
|
37 | (2) |
|
|
39 | (1) |
|
3.3.3 Grouped-Lasso Multinomial |
|
|
39 | (1) |
|
3.4 Log-Linear Models and the Poisson GLM |
|
|
40 | (2) |
|
3.4.1 Example: Distribution Smoothing |
|
|
40 | (2) |
|
3.5 Cox Proportional Hazards Models |
|
|
42 | (4) |
|
|
43 | (2) |
|
|
45 | (1) |
|
3.6 Support Vector Machines |
|
|
46 | (4) |
|
3.6.1 Logistic Regression with Separable Data |
|
|
49 | (1) |
|
3.7 Computational Details and glmnet |
|
|
50 | (5) |
|
|
52 | (1) |
|
|
53 | (2) |
|
4 Generalizations of the Lasso Penalty |
|
|
55 | (40) |
|
|
55 | (1) |
|
|
56 | (2) |
|
|
58 | (11) |
|
4.3.1 Computation for the Group Lasso |
|
|
62 | (2) |
|
|
64 | (1) |
|
4.3.3 The Overlap Group Lasso |
|
|
65 | (4) |
|
4.4 Sparse Additive Models and the Group Lasso |
|
|
69 | (7) |
|
4.4.1 Additive Models and Backfitting |
|
|
69 | (1) |
|
4.4.2 Sparse Additive Models and Backfitting |
|
|
70 | (2) |
|
4.4.3 Approaches Using Optimization and the Group Lasso |
|
|
72 | (2) |
|
4.4.4 Multiple Penalization for Sparse Additive Models |
|
|
74 | (2) |
|
|
76 | (8) |
|
4.5.1 Fitting the Fused Lasso |
|
|
77 | (1) |
|
4.5.1.1 Reparametrization |
|
|
78 | (1) |
|
|
79 | (1) |
|
4.5.1.3 A Dual Path Algorithm |
|
|
79 | (1) |
|
4.5.1.4 Dynamic Programming for the Fused Lasso |
|
|
80 | (1) |
|
|
81 | (2) |
|
4.5.3 Nearly Isotonic Regression |
|
|
83 | (1) |
|
|
84 | (11) |
|
|
86 | (2) |
|
|
88 | (7) |
|
|
95 | (44) |
|
|
95 | (1) |
|
5.2 Convex Optimality Conditions |
|
|
95 | (5) |
|
5.2.1 Optimality for Differentiable Problems |
|
|
95 | (3) |
|
5.2.2 Nondifferentiable Functions and Subgradients |
|
|
98 | (2) |
|
|
100 | (9) |
|
5.3.1 Unconstrained Gradient Descent |
|
|
101 | (1) |
|
5.3.2 Projected Gradient Methods |
|
|
102 | (1) |
|
5.3.3 Proximal Gradient Methods |
|
|
103 | (4) |
|
5.3.4 Accelerated Gradient Methods |
|
|
107 | (2) |
|
|
109 | (8) |
|
5.4.1 Separability and Coordinate Descent |
|
|
110 | (2) |
|
5.4.2 Linear Regression and the Lasso |
|
|
112 | (3) |
|
5.4.3 Logistic Regression and Generalized Linear Models |
|
|
115 | (2) |
|
|
117 | (1) |
|
5.6 Least Angle Regression |
|
|
118 | (3) |
|
5.7 Alternating Direction Method of Multipliers |
|
|
121 | (2) |
|
5.8 Minorization-Maximization Algorithms |
|
|
123 | (1) |
|
5.9 Biconvexity and Alternating Minimization |
|
|
124 | (3) |
|
|
127 | (12) |
|
|
131 | (1) |
|
|
132 | (2) |
|
|
134 | (5) |
|
|
139 | (28) |
|
|
139 | (3) |
|
|
142 | (5) |
|
6.3 Post-Selection Inference for the Lasso |
|
|
147 | (11) |
|
6.3.1 The Covariance Test |
|
|
147 | (3) |
|
6.3.2 A General Scheme for Post-Selection Inference |
|
|
150 | (4) |
|
6.3.2.1 Fixed-λ Inference for the Lasso |
|
|
154 | (2) |
|
6.3.2.2 The Spacing Test for LAR |
|
|
156 | (1) |
|
6.3.3 What Hypothesis Is Being Tested? |
|
|
157 | (1) |
|
6.3.4 Back to Forward Stepwise Regression |
|
|
158 | (1) |
|
6.4 Inference via a Debiased Lasso |
|
|
158 | (2) |
|
6.5 Other Proposals for Post-Selection Inference |
|
|
160 | (7) |
|
|
161 | (1) |
|
|
162 | (5) |
|
7 Matrix Decompositions, Approximations, and Completion |
|
|
167 | (34) |
|
|
167 | (2) |
|
7.2 The Singular Value Decomposition |
|
|
169 | (1) |
|
7.3 Missing Data and Matrix Completion |
|
|
169 | (15) |
|
7.3.1 The Netflix Movie Challenge |
|
|
170 | (4) |
|
7.3.2 Matrix Completion Using Nuclear Norm |
|
|
174 | (3) |
|
7.3.3 Theoretical Results for Matrix Completion |
|
|
177 | (4) |
|
7.3.4 Maximum Margin Factorization and Related Methods |
|
|
181 | (3) |
|
7.4 Reduced-Rank Regression |
|
|
184 | (1) |
|
7.5 A General Matrix Regression Framework |
|
|
185 | (2) |
|
7.6 Penalized Matrix Decomposition |
|
|
187 | (3) |
|
7.7 Additive Matrix Decomposition |
|
|
190 | (11) |
|
|
195 | (1) |
|
|
196 | (5) |
|
8 Sparse Multivariate Methods |
|
|
201 | (40) |
|
|
201 | (1) |
|
8.2 Sparse Principal Components Analysis |
|
|
202 | (11) |
|
|
202 | (2) |
|
8.2.2 Sparse Principal Components |
|
|
204 | (1) |
|
8.2.2.1 Sparsity from Maximum Variance |
|
|
204 | (2) |
|
8.2.2.2 Methods Based on Reconstruction |
|
|
206 | (1) |
|
8.2.3 Higher-Rank Solutions |
|
|
207 | (2) |
|
8.2.3.1 Illustrative Application of Sparse PCA |
|
|
209 | (1) |
|
8.2.4 Sparse PCA via Fantope Projection |
|
|
210 | (1) |
|
8.2.5 Sparse Autoencoders and Deep Learning |
|
|
210 | (2) |
|
8.2.6 Some Theory for Sparse PCA |
|
|
212 | (1) |
|
8.3 Sparse Canonical Correlation Analysis |
|
|
213 | (4) |
|
8.3.1 Example: Netflix Movie Rating Data |
|
|
215 | (2) |
|
8.4 Sparse Linear Discriminant Analysis |
|
|
217 | (10) |
|
8.4.1 Normal Theory and Bayes' Rule |
|
|
217 | (1) |
|
8.4.2 Nearest Shrunken Centroids |
|
|
218 | (3) |
|
8.4.3 Fisher's Linear Discriminant Analysis |
|
|
221 | (1) |
|
8.4.3.1 Example: Simulated Data with Five Classes |
|
|
222 | (3) |
|
|
225 | (1) |
|
8.4.4.1 Example: Face Silhouettes |
|
|
226 | (1) |
|
|
227 | (14) |
|
8.5.1 Some Background on Clustering |
|
|
227 | (1) |
|
8.5.1.1 Example: Simulated Data with Six Classes |
|
|
228 | (1) |
|
8.5.2 Sparse Hierarchical Clustering |
|
|
228 | (2) |
|
8.5.3 Sparse K-Means Clustering |
|
|
230 | (1) |
|
|
231 | (1) |
|
|
232 | (2) |
|
|
234 | (7) |
|
9 Graphs and Model Selection |
|
|
241 | (28) |
|
|
241 | (1) |
|
9.2 Basics of Graphical Models |
|
|
241 | (5) |
|
9.2.1 Factorization and Markov Properties |
|
|
241 | (1) |
|
9.2.1.1 Factorization Property |
|
|
242 | (1) |
|
|
243 | (1) |
|
9.2.1.3 Equivalence of Factorization and Markov Properties |
|
|
243 | (1) |
|
|
244 | (1) |
|
9.2.2.1 Discrete Graphical Models |
|
|
244 | (1) |
|
9.2.2.2 Gaussian Graphical Models |
|
|
245 | (1) |
|
9.3 Graph Selection via Penalized Likelihood |
|
|
246 | (8) |
|
9.3.1 Global Likelihoods for Gaussian Models |
|
|
247 | (1) |
|
9.3.2 Graphical Lasso Algorithm |
|
|
248 | (3) |
|
9.3.3 Exploiting Block-Diagonal Structure |
|
|
251 | (1) |
|
9.3.4 Theoretical Guarantees for the Graphical Lasso |
|
|
252 | (1) |
|
9.3.5 Global Likelihood for Discrete Models |
|
|
253 | (1) |
|
9.4 Graph Selection via Conditional Inference |
|
|
254 | (7) |
|
9.4.1 Neighborhood-Based Likelihood for Gaussians |
|
|
255 | (1) |
|
9.4.2 Neighborhood-Based Likelihood for Discrete Models |
|
|
256 | (3) |
|
9.4.3 Pseudo-Likelihood for Mixed Models |
|
|
259 | (2) |
|
9.5 Graphical Models with Hidden Variables |
|
|
261 | (8) |
|
|
261 | (2) |
|
|
263 | (6) |
|
10 Signal Approximation and Compressed Sensing |
|
|
269 | (20) |
|
|
269 | (1) |
|
10.2 Signals and Sparse Representations |
|
|
269 | (7) |
|
|
269 | (2) |
|
10.2.2 Approximation in Orthogonal Bases |
|
|
271 | (3) |
|
10.2.3 Reconstruction in Overcomplete Bases |
|
|
274 | (2) |
|
10.3 Random Projection and Approximation |
|
|
276 | (4) |
|
10.3.1 Johnson-Lindenstrauss Approximation |
|
|
277 | (1) |
|
10.3.2 Compressed Sensing |
|
|
278 | (2) |
|
10.4 Equivalence between 0 and 1 Recovery |
|
|
280 | (9) |
|
10.4.1 Restricted Nullspace Property |
|
|
281 | (1) |
|
10.4.2 Sufficient Conditions for Restricted Nullspace |
|
|
282 | (2) |
|
|
284 | (1) |
|
10.4.3.1 Proof of Theorem 10.1 |
|
|
284 | (1) |
|
10.4.3.2 Proof of Proposition 10.1 |
|
|
284 | (1) |
|
|
285 | (1) |
|
|
286 | (3) |
|
11 Theoretical Results for the Lasso |
|
|
289 | (26) |
|
|
289 | (2) |
|
11.1.1 Types of Loss Functions |
|
|
289 | (1) |
|
11.1.2 Types of Sparsity Models |
|
|
290 | (1) |
|
11.2 Bounds on Lasso 2-Error |
|
|
291 | (8) |
|
11.2.1 Strong Convexity in the Classical Setting |
|
|
291 | (2) |
|
11.2.2 Restricted Eigenvalues for Regression |
|
|
293 | (1) |
|
11.2.3 A Basic Consistency Result |
|
|
294 | (5) |
|
11.3 Bounds on Prediction Error |
|
|
299 | (2) |
|
11.4 Support Recovery in Linear Regression |
|
|
301 | (9) |
|
11.4.1 Variable-Selection Consistency for the Lasso |
|
|
301 | (2) |
|
11.4.1.1 Some Numerical Studies |
|
|
303 | (2) |
|
11.4.2 Proof of Theorem 11.3 |
|
|
305 | (5) |
|
11.5 Beyond the Basic Lasso |
|
|
310 | (5) |
|
|
311 | (1) |
|
|
312 | (3) |
Bibliography |
|
315 | (22) |
Author Index |
|
337 | (6) |
Index |
|
343 | |