Preface |
|
vii | |
|
|
1 | (14) |
|
|
15 | (44) |
|
2.1 What Is Statistical Learning? |
|
|
15 | (14) |
|
|
17 | (4) |
|
2.1.2 How Do We Estimate f? |
|
|
21 | (3) |
|
2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability |
|
|
24 | (2) |
|
2.1.4 Supervised Versus Unsupervised Learning |
|
|
26 | (2) |
|
2.1.5 Regression Versus Classification Problems |
|
|
28 | (1) |
|
2.2 Assessing Model Accuracy |
|
|
29 | (13) |
|
2.2.1 Measuring the Quality of Fit |
|
|
29 | (4) |
|
2.2.2 The Bias-Variance Trade-Off |
|
|
33 | (4) |
|
2.2.3 The Classification Setting |
|
|
37 | (5) |
|
2.3 Lab: Introduction to R |
|
|
42 | (10) |
|
|
43 | (2) |
|
|
45 | (2) |
|
|
47 | (1) |
|
|
48 | (2) |
|
2.3.5 Additional Graphical and Numerical Summaries |
|
|
50 | (2) |
|
|
52 | (7) |
|
|
59 | (70) |
|
3.1 Simple Linear Regression |
|
|
60 | (11) |
|
3.1.1 Estimating the Coefficients |
|
|
61 | (2) |
|
3.1.2 Assessing the Accuracy of the Coefficient Estimates |
|
|
63 | (5) |
|
3.1.3 Assessing the Accuracy of the Model |
|
|
68 | (3) |
|
3.2 Multiple Linear Regression |
|
|
71 | (12) |
|
3.2.1 Estimating the Regression Coefficients |
|
|
72 | (3) |
|
3.2.2 Some Important Questions |
|
|
75 | (8) |
|
3.3 Other Considerations in the Regression Model |
|
|
83 | (20) |
|
3.3.1 Qualitative Predictors |
|
|
83 | (4) |
|
3.3.2 Extensions of the Linear Model |
|
|
87 | (5) |
|
|
92 | (11) |
|
|
103 | (2) |
|
3.5 Comparison of Linear Regression with K-Nearest Neighbors |
|
|
105 | (5) |
|
3.6 Lab: Linear Regression |
|
|
110 | (11) |
|
|
110 | (1) |
|
3.6.2 Simple Linear Regression |
|
|
111 | (3) |
|
3.6.3 Multiple Linear Regression |
|
|
114 | (2) |
|
|
116 | (1) |
|
3.6.5 Non-linear Transformations of the Predictors |
|
|
116 | (3) |
|
3.6.6 Qualitative Predictors |
|
|
119 | (1) |
|
|
120 | (1) |
|
|
121 | (8) |
|
|
129 | (68) |
|
4.1 An Overview of Classification |
|
|
130 | (1) |
|
4.2 Why Not Linear Regression? |
|
|
131 | (2) |
|
|
133 | (8) |
|
|
133 | (2) |
|
4.3.2 Estimating the Regression Coefficients |
|
|
135 | (1) |
|
|
136 | (1) |
|
4.3.4 Multiple Logistic Regression |
|
|
137 | (3) |
|
4.3.5 Multinomial Logistic Regression |
|
|
140 | (1) |
|
4.4 Generative Models for Classification |
|
|
141 | (17) |
|
4.4.1 Linear Discriminant Analysis for p = 1 |
|
|
142 | (3) |
|
4.4.2 Linear Discriminant Analysis for p > 1 |
|
|
145 | (7) |
|
4.4.3 Quadratic Discriminant Analysis |
|
|
152 | (1) |
|
|
153 | (5) |
|
4.5 A Comparison of Classification Methods |
|
|
158 | (6) |
|
4.5.1 An Analytical Comparison |
|
|
158 | (3) |
|
4.5.2 An Empirical Comparison |
|
|
161 | (3) |
|
4.6 Generalized Linear Models |
|
|
164 | (7) |
|
4.6.1 Linear Regression on the Bikeshare Data |
|
|
164 | (3) |
|
4.6.2 Poisson Regression on the Bikeshare Data |
|
|
167 | (3) |
|
4.6.3 Generalized Linear Models in Greater Generality |
|
|
170 | (1) |
|
4.7 Lab: Classification Methods |
|
|
171 | (18) |
|
4.7.1 The Stock Market Data |
|
|
171 | (1) |
|
4.7.2 Logistic Regression |
|
|
172 | (5) |
|
4.7.3 Linear Discriminant Analysis |
|
|
177 | (2) |
|
4.7.4 Quadratic Discriminant Analysis |
|
|
179 | (1) |
|
|
180 | (1) |
|
4.7.6 K-Nearest Neighbors |
|
|
181 | (4) |
|
|
185 | (4) |
|
|
189 | (8) |
|
|
197 | (28) |
|
|
198 | (11) |
|
5.1.1 The Validation Set Approach |
|
|
198 | (2) |
|
5.1.2 Leave-One-Out Cross-Validation |
|
|
200 | (3) |
|
5.1.3 K-Fold Cross-Validation |
|
|
203 | (2) |
|
5.1.4 Bias-Variance Trade-Off for k-Fold Cross-Validation |
|
|
205 | (1) |
|
5.1.5 Cross-Validation on Classification Problems |
|
|
206 | (3) |
|
|
209 | (3) |
|
5.3 Lab: Cross-Validation and the Bootstrap |
|
|
212 | (7) |
|
5.3.1 The Validation Set Approach |
|
|
213 | (1) |
|
5.3.2 Leave-One-Out Cross-Validation |
|
|
214 | (1) |
|
5.3.3 FC-Fold Cross-Validation |
|
|
215 | (1) |
|
|
216 | (3) |
|
|
219 | (6) |
|
6 Linear Model Selection and Regularization |
|
|
225 | (64) |
|
|
227 | (10) |
|
6.1.1 Best Subset Selection |
|
|
227 | (2) |
|
|
229 | (3) |
|
6.1.3 Choosing the Optimal Model |
|
|
232 | (5) |
|
|
237 | (14) |
|
|
237 | (4) |
|
|
241 | (9) |
|
6.2.3 Selecting the Tuning Parameter |
|
|
250 | (1) |
|
6.3 Dimension Reduction Methods |
|
|
251 | (10) |
|
6.3.1 Principal Components Regression |
|
|
252 | (7) |
|
6.3.2 Partial Least Squares |
|
|
259 | (2) |
|
6.4 Considerations in High Dimensions |
|
|
261 | (6) |
|
6.4.1 High-Dimensional Data |
|
|
261 | (1) |
|
6.4.2 What Goes Wrong in High Dimensions? |
|
|
262 | (2) |
|
6.4.3 Regression in High Dimensions |
|
|
264 | (2) |
|
6.4.4 Interpreting Results in High Dimensions |
|
|
266 | (1) |
|
6.5 Lab: Linear Models and Regularization Methods |
|
|
267 | (15) |
|
6.5.1 Subset Selection Methods |
|
|
267 | (7) |
|
6.5.2 Ridge Regression and the Lasso |
|
|
274 | (5) |
|
6.5.3 PCR and PLS Regression |
|
|
279 | (3) |
|
|
282 | (7) |
|
7 Moving Beyond Linearity |
|
|
289 | (38) |
|
7.1 Polynomial Regression |
|
|
290 | (2) |
|
|
292 | (2) |
|
|
294 | (1) |
|
|
295 | (6) |
|
7.4.1 Piecewise Polynomials |
|
|
295 | (1) |
|
7.4.2 Constraints and Splines |
|
|
295 | (2) |
|
7.4.3 The Spline Basis Representation |
|
|
297 | (1) |
|
7.4.4 Choosing the Number and Locations of the Knots |
|
|
298 | (2) |
|
7.4.5 Comparison to Polynomial Regression |
|
|
300 | (1) |
|
|
301 | (3) |
|
7.5.1 An Overview of Smoothing Splines |
|
|
301 | (1) |
|
7.5.2 Choosing the Smoothing Parameter A |
|
|
302 | (2) |
|
|
304 | (2) |
|
7.7 Generalized Additive Models |
|
|
306 | (5) |
|
7.7.1 GAMs for Regression Problems |
|
|
307 | (3) |
|
7.7.2 GAMs for Classification Problems |
|
|
310 | (1) |
|
7.8 Lab: Non-linear Modeling |
|
|
311 | (10) |
|
7.8.1 Polynomial Regression and Step Functions |
|
|
312 | (5) |
|
|
317 | (1) |
|
|
318 | (3) |
|
|
321 | (6) |
|
|
327 | (40) |
|
8.1 The Basics of Decision Trees |
|
|
327 | (13) |
|
|
328 | (7) |
|
8.1.2 Classification Trees |
|
|
335 | (3) |
|
8.1.3 Trees Versus Linear Models |
|
|
338 | (1) |
|
8.1.4 Advantages and Disadvantages of Trees |
|
|
339 | (1) |
|
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees |
|
|
340 | (13) |
|
|
340 | (3) |
|
|
343 | (2) |
|
|
345 | (3) |
|
8.2.4 Bayesian Additive Regression Trees |
|
|
348 | (3) |
|
8.2.5 Summary of Tree Ensemble Methods |
|
|
351 | (2) |
|
|
353 | (8) |
|
8.3.1 Fitting Classification Trees |
|
|
353 | (3) |
|
8.3.2 Fitting Regression Trees |
|
|
356 | (1) |
|
8.3.3 Bagging and Random Forests |
|
|
357 | (2) |
|
|
359 | (1) |
|
8.3.5 Bayesian Additive Regression Trees |
|
|
360 | (1) |
|
|
361 | (6) |
|
9 Support Vector Machines |
|
|
367 | (36) |
|
9.1 Maximal Margin Classifier |
|
|
368 | (5) |
|
9.1.1 What Is a Hyperplane? |
|
|
368 | (1) |
|
9.1.2 Classification Using a Separating Hyperplane |
|
|
369 | (2) |
|
9.1.3 The Maximal Margin Classifier |
|
|
371 | (1) |
|
9.1.4 Construction of the Maximal Margin Classifier |
|
|
372 | (1) |
|
9.1.5 The Non-separable Case |
|
|
373 | (1) |
|
9.2 Support Vector Classifiers |
|
|
373 | (6) |
|
9.2.1 Overview of the Support Vector Classifier |
|
|
373 | (2) |
|
9.2.2 Details of the Support Vector Classifier |
|
|
375 | (4) |
|
9.3 Support Vector Machines |
|
|
379 | (6) |
|
9.3.1 Classification with Non-Linear Decision Boundaries |
|
|
379 | (1) |
|
9.3.2 The Support Vector Machine |
|
|
380 | (3) |
|
9.3.3 An Application to the Heart Disease Data |
|
|
383 | (2) |
|
9.4 SVMs with More than Two Classes |
|
|
385 | (1) |
|
9.4.1 One-Versus-One Classification |
|
|
385 | (1) |
|
9.4.2 One-Versus-All Classification |
|
|
385 | (1) |
|
9.5 Relationship to Logistic Regression |
|
|
386 | (2) |
|
9.6 Lab: Support Vector Machines |
|
|
388 | (10) |
|
9.6.1 Support Vector Classifier |
|
|
389 | (3) |
|
9.6.2 Support Vector Machine |
|
|
392 | (2) |
|
|
394 | (2) |
|
9.6.4 SVM with Multiple Classes |
|
|
396 | (1) |
|
9.6.5 Application to Gene Expression Data |
|
|
396 | (2) |
|
|
398 | (5) |
|
|
403 | (58) |
|
10.1 Single Layer Neural Networks |
|
|
404 | (3) |
|
10.2 Multilayer Neural Networks |
|
|
407 | (4) |
|
10.3 Convolutional Neural Networks |
|
|
411 | (8) |
|
10.3.1 Convolution Layers |
|
|
412 | (3) |
|
|
415 | (1) |
|
10.3.3 Architecture of a Convolutional Neural Network |
|
|
415 | (2) |
|
|
417 | (1) |
|
10.3.5 Results Using a Pretrained Classifier |
|
|
417 | (2) |
|
10.4 Document Classification |
|
|
419 | (2) |
|
10.5 Recurrent Neural Networks |
|
|
421 | (11) |
|
10.5.1 Sequential Models for Document Classification |
|
|
424 | (3) |
|
10.5.2 Time Series Forecasting |
|
|
427 | (4) |
|
|
431 | (1) |
|
10.6 When to Use Deep Learning |
|
|
432 | (2) |
|
10.7 Fitting a Neural Network |
|
|
434 | (5) |
|
|
435 | (1) |
|
10.7.2 Regularization and Stochastic Gradient Descent |
|
|
436 | (2) |
|
|
438 | (1) |
|
|
438 | (1) |
|
10.8 Interpolation and Double Descent |
|
|
439 | (4) |
|
|
443 | (15) |
|
10.9.1 A Single Layer Network on the Hitters Data |
|
|
443 | (2) |
|
10.9.2 A Multilayer Network on the MNIST Digit Data |
|
|
445 | (3) |
|
10.9.3 Convolutional Neural Networks |
|
|
448 | (3) |
|
10.9.4 Using Pretrained CNN Models |
|
|
451 | (1) |
|
10.9.5 IMDb Document Classification |
|
|
452 | (2) |
|
10.9.6 Recurrent Neural Networks |
|
|
454 | (4) |
|
|
458 | (3) |
|
11 Survival Analysis and Censored Data |
|
|
461 | (36) |
|
11.1 Survival and Censoring Times |
|
|
462 | (1) |
|
11.2 A Closer Look at Censoring |
|
|
463 | (1) |
|
11.3 The Kaplan-Meier Survival Curve |
|
|
464 | (2) |
|
|
466 | (3) |
|
11.5 Regression Models With a Survival Response |
|
|
469 | (9) |
|
11.5.1 The Hazard Function |
|
|
469 | (2) |
|
11.5.2 Proportional Hazards |
|
|
471 | (4) |
|
11.5.3 Example: Brain Cancer Data |
|
|
475 | (1) |
|
11.5.4 Example: Publication Data |
|
|
475 | (3) |
|
11.6 Shrinkage for the Cox Model |
|
|
478 | (2) |
|
|
480 | (3) |
|
11.7.1 Area Under the Curve for Survival Analysis |
|
|
480 | (1) |
|
11.7.2 Choice of Time Scale |
|
|
481 | (1) |
|
11.7.3 Time-Dependent Covariates |
|
|
481 | (1) |
|
11.7.4 Checking the Proportional Hazards Assumption |
|
|
482 | (1) |
|
|
482 | (1) |
|
11.8 Lab: Survival Analysis |
|
|
483 | (7) |
|
|
483 | (3) |
|
|
486 | (1) |
|
|
487 | (3) |
|
|
490 | (7) |
|
|
497 | (56) |
|
12.1 The Challenge of Unsupervised Learning |
|
|
497 | (1) |
|
12.2 Principal Components Analysis |
|
|
498 | (12) |
|
12.2.1 What Are Principal Components? |
|
|
499 | (4) |
|
12.2.2 Another Interpretation of Principal Components |
|
|
503 | (2) |
|
12.2.3 The Proportion of Variance Explained |
|
|
505 | (2) |
|
|
507 | (3) |
|
12.2.5 Other Uses for Principal Components |
|
|
510 | (1) |
|
12.3 Missing Values and Matrix Completion |
|
|
510 | (6) |
|
|
516 | (16) |
|
12.4.1 If-Means Clustering |
|
|
517 | (4) |
|
12.4.2 Hierarchical Clustering |
|
|
521 | (9) |
|
12.4.3 Practical Issues in Clustering |
|
|
530 | (2) |
|
12.5 Lab: Unsupervised Learning |
|
|
532 | (16) |
|
12.5.1 Principal Components Analysis |
|
|
532 | (3) |
|
|
535 | (3) |
|
|
538 | (4) |
|
12.5.4 NCI60 Data Example |
|
|
542 | (6) |
|
|
548 | (5) |
|
|
553 | (44) |
|
13.1 A Quick Review of Hypothesis Testing |
|
|
554 | (6) |
|
13.1.1 Testing a Hypothesis |
|
|
555 | (4) |
|
13.1.2 Type I and Type II Errors |
|
|
559 | (1) |
|
13.2 The Challenge of Multiple Testing |
|
|
560 | (1) |
|
13.3 The Family-Wise Error Rate |
|
|
561 | (10) |
|
13.3.1 What is the Family-Wise Error Rate? |
|
|
562 | (2) |
|
13.3.2 Approaches to Control the Family-Wise Error Rate |
|
|
564 | (6) |
|
13.3.3 Trade-Off Between the FWER and Power |
|
|
570 | (1) |
|
13.4 The False Discovery Rate |
|
|
571 | (4) |
|
13.4.1 Intuition for the False Discovery Rate |
|
|
571 | (2) |
|
13.4.2 The Benjamini-Hochberg Procedure |
|
|
573 | (2) |
|
13.5 A Re-Sampling Approach to p-Values and False Discovery Rates |
|
|
575 | (7) |
|
13.5.1 A Re-Sampling Approach to the p-Value |
|
|
576 | (2) |
|
13.5.2 A Re-Sampling Approach to the False Discovery Rate |
|
|
578 | (3) |
|
13.5.3 When Are Re-Sampling Approaches Useful? |
|
|
581 | (1) |
|
13.6 Lab: Multiple Testing |
|
|
582 | (9) |
|
13.6.1 Review of Hypothesis Tests |
|
|
582 | (1) |
|
13.6.2 The Family-Wise Error Rate |
|
|
583 | (3) |
|
13.6.3 The False Discovery Rate |
|
|
586 | (2) |
|
13.6.4 A Re-Sampling Approach |
|
|
588 | (3) |
|
|
591 | (6) |
Index |
|
597 | |