Preface to the Second Edition |
|
vii | |
Preface to the First Edition |
|
xi | |
|
|
1 | (8) |
|
Overview of Supervised Learning |
|
|
9 | (34) |
|
|
9 | (1) |
|
Variable Types and Terminology |
|
|
9 | (2) |
|
Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors |
|
|
11 | (7) |
|
Linear Models and Least Squares |
|
|
11 | (3) |
|
|
14 | (2) |
|
From Least Squares to Nearest Neighbors |
|
|
16 | (2) |
|
Statistical Decision Theory |
|
|
18 | (4) |
|
Local Methods in High Dimensions |
|
|
22 | (6) |
|
Statistical Models, Supervised Learning and Function Approximation |
|
|
28 | (4) |
|
A Statistical Model for the Joint Distribution Pr(X, Y) |
|
|
28 | (1) |
|
|
29 | (1) |
|
|
29 | (3) |
|
Structured Regression Models |
|
|
32 | (1) |
|
Difficulty of the Problem |
|
|
32 | (1) |
|
Classes of Restricted Estimators |
|
|
33 | (4) |
|
Roughness Penalty and Bayesian Methods |
|
|
34 | (1) |
|
Kernel Methods and Local Regression |
|
|
34 | (1) |
|
Basis Functions and Dictionary Methods |
|
|
35 | (2) |
|
Model Selection and the Bias--Variance Tradeoff |
|
|
37 | (2) |
|
|
39 | (1) |
|
|
39 | (4) |
|
Linear Methods for Regression |
|
|
43 | (58) |
|
|
43 | (1) |
|
Linear Regression Models and Least Squares |
|
|
44 | (13) |
|
|
49 | (2) |
|
The Gauss--Markov Theorem |
|
|
51 | (1) |
|
Multiple Regression from Simple Univariate Regression |
|
|
52 | (4) |
|
|
56 | (1) |
|
|
57 | (4) |
|
|
57 | (1) |
|
Forward- and Backward-Stepwise Selection |
|
|
58 | (2) |
|
Forward-Stagewise Regression |
|
|
60 | (1) |
|
Prostate Cancer Data Example (Continued) |
|
|
61 | (1) |
|
|
61 | (18) |
|
|
61 | (7) |
|
|
68 | (1) |
|
Discussion: Subset Selection, Ridge Regression and the Lasso |
|
|
69 | (4) |
|
|
73 | (6) |
|
Methods Using Derived Input Directions |
|
|
79 | (3) |
|
Principal Components Regression |
|
|
79 | (1) |
|
|
80 | (2) |
|
Discussion: A Comparison of the Selection and Shrinkage Methods |
|
|
82 | (2) |
|
Multiple Outcome Shrinkage and Selection |
|
|
84 | (2) |
|
More on the Lasso and Related Path Algorithms |
|
|
86 | (7) |
|
Incremental Forward Stagewise Regression |
|
|
86 | (3) |
|
Piecewise-Linear Path Algorithms |
|
|
89 | (1) |
|
|
89 | (1) |
|
|
90 | (1) |
|
Further Properties of the Lasso |
|
|
91 | (1) |
|
Pathwise Coordinate Optimization |
|
|
92 | (1) |
|
Computational Considerations |
|
|
93 | (1) |
|
|
94 | (1) |
|
|
94 | (7) |
|
Linear Methods for Classification |
|
|
101 | (38) |
|
|
101 | (2) |
|
Linear Regression of an Indicator Matrix |
|
|
103 | (3) |
|
Linear Discriminant Analysis |
|
|
106 | (13) |
|
Regularized Discriminant Analysis |
|
|
112 | (1) |
|
|
113 | (1) |
|
Reduced-Rank Linear Discriminant Analysis |
|
|
113 | (6) |
|
|
119 | (10) |
|
Fitting Logistic Regression Models |
|
|
120 | (2) |
|
Example: South African Heart Disease |
|
|
122 | (2) |
|
Quadratic Approximations and Inference |
|
|
124 | (1) |
|
L1 Regularized Logistic Regression |
|
|
125 | (2) |
|
Logistic Regression or LDA? |
|
|
127 | (2) |
|
|
129 | (6) |
|
Rosenblatt's Perceptron Learning Algorithm |
|
|
130 | (2) |
|
Optimal Separating Hyperplanes |
|
|
132 | (3) |
|
|
135 | (1) |
|
|
135 | (4) |
|
Basis Expansions and Regularization |
|
|
139 | (52) |
|
|
139 | (2) |
|
Piecewise Polynomials and Splines |
|
|
141 | (9) |
|
|
144 | (2) |
|
Example: South African Heart Disease (Continued) |
|
|
146 | (2) |
|
Example: Phoneme Recognition |
|
|
148 | (2) |
|
Filtering and Feature Extraction |
|
|
150 | (1) |
|
|
151 | (5) |
|
Degrees of Freedom and Smoother Matrices |
|
|
153 | (3) |
|
Automatic Selection of the Smoothing Parameters |
|
|
156 | (5) |
|
Fixing the Degrees of Freedom |
|
|
158 | (1) |
|
The Bias--Variance Tradeoff |
|
|
158 | (3) |
|
Nonparametric Logistic Regression |
|
|
161 | (1) |
|
|
162 | (5) |
|
Regularization and Reproducing Kernel Hilbert Spaces |
|
|
167 | (7) |
|
Spaces of Functions Generated by Kernels |
|
|
168 | (2) |
|
|
170 | (4) |
|
|
174 | (7) |
|
Wavelet Bases and the Wavelet Transform |
|
|
176 | (3) |
|
Adaptive Wavelet Filtering |
|
|
179 | (2) |
|
|
181 | (1) |
|
|
181 | (5) |
|
Appendix: Computational Considerations for Splines |
|
|
186 | (5) |
|
|
186 | (3) |
|
Appendix: Computations for Smoothing Splines |
|
|
189 | (2) |
|
|
191 | (28) |
|
One-Dimensional Kernel Smoothers |
|
|
192 | (6) |
|
|
194 | (3) |
|
Local Polynomial Regression |
|
|
197 | (1) |
|
Selecting the Width of the Kernel |
|
|
198 | (2) |
|
|
200 | (1) |
|
Structured Local Regression Models in IRp |
|
|
201 | (4) |
|
|
203 | (1) |
|
Structured Regression Functions |
|
|
203 | (2) |
|
Local Likelihood and Other Models |
|
|
205 | (3) |
|
Kernel Density Estimation and Classification |
|
|
208 | (4) |
|
Kernel Density Estimation |
|
|
208 | (2) |
|
Kernel Density Classification |
|
|
210 | (1) |
|
The Naive Bayes Classifier |
|
|
210 | (2) |
|
Radial Basis Functions and Kernels |
|
|
212 | (2) |
|
Mixture Models for Density Estimation and Classification |
|
|
214 | (2) |
|
Computational Considerations |
|
|
216 | (1) |
|
|
216 | (1) |
|
|
216 | (3) |
|
Model Assessment and Selection |
|
|
219 | (42) |
|
|
219 | (1) |
|
Bias, Variance and Model Complexity |
|
|
219 | (4) |
|
The Bias--Variance Decomposition |
|
|
223 | (5) |
|
Example: Bias--Variance Tradeoff |
|
|
226 | (2) |
|
Optimism of the Training Error Rate |
|
|
228 | (2) |
|
Estimates of In-Sample Prediction Error |
|
|
230 | (2) |
|
The Effective Number of Parameters |
|
|
232 | (1) |
|
The Bayesian Approach and BIC |
|
|
233 | (2) |
|
Minimum Description Length |
|
|
235 | (2) |
|
Vapnik--Chervonenkis Dimension |
|
|
237 | (4) |
|
|
239 | (2) |
|
|
241 | (8) |
|
|
241 | (4) |
|
The Wrong and Right Way to Do Cross-validation |
|
|
245 | (2) |
|
Does Cross-Validation Really Work? |
|
|
247 | (2) |
|
|
249 | (5) |
|
|
252 | (2) |
|
Conditional or Expected Test Error? |
|
|
254 | (3) |
|
|
257 | (1) |
|
|
257 | (4) |
|
Model Inference and Averaging |
|
|
261 | (34) |
|
|
261 | (1) |
|
The Bootstrap and Maximum Likelihood Methods |
|
|
261 | (6) |
|
|
261 | (4) |
|
Maximum Likelihood Inference |
|
|
265 | (2) |
|
Bootstrap versus Maximum Likelihood |
|
|
267 | (1) |
|
|
267 | (4) |
|
Relationship Between the Bootstrap and Bayesian Inference |
|
|
271 | (1) |
|
|
272 | (7) |
|
Two-Component Mixture Model |
|
|
272 | (4) |
|
The EM Algorithm in General |
|
|
276 | (1) |
|
EM as a Maximization--Maximization Procedure |
|
|
277 | (2) |
|
MCMC for Sampling from the Posterior |
|
|
279 | (3) |
|
|
282 | (6) |
|
Example: Trees with Simulated Data |
|
|
283 | (5) |
|
Model Averaging and Stacking |
|
|
288 | (2) |
|
Stochastic Search: Bumping |
|
|
290 | (2) |
|
|
292 | (1) |
|
|
293 | (2) |
|
Additive Models, Trees, and Related Methods |
|
|
295 | (42) |
|
Generalized Additive Models |
|
|
295 | (10) |
|
|
297 | (2) |
|
Example: Additive Logistic Regression |
|
|
299 | (5) |
|
|
304 | (1) |
|
|
305 | (12) |
|
|
305 | (2) |
|
|
307 | (1) |
|
|
308 | (2) |
|
|
310 | (3) |
|
|
313 | (4) |
|
|
317 | (4) |
|
|
320 | (1) |
|
MARS: Multivariate Adaptive Regression Splines |
|
|
321 | (8) |
|
|
326 | (1) |
|
|
327 | (1) |
|
|
328 | (1) |
|
Hierarchical Mixtures of Experts |
|
|
329 | (3) |
|
|
332 | (2) |
|
Computational Considerations |
|
|
334 | (1) |
|
|
334 | (1) |
|
|
335 | (2) |
|
Boosting and Additive Trees |
|
|
337 | (52) |
|
|
337 | (4) |
|
|
340 | (1) |
|
Boosting Fits an Additive Model |
|
|
341 | (1) |
|
Forward Stagewise Additive Modeling |
|
|
342 | (1) |
|
Exponential Loss and AdaBoost |
|
|
343 | (2) |
|
|
345 | (1) |
|
Loss Functions and Robustness |
|
|
346 | (4) |
|
``Off-the-Shelf'' Procedures for Data Mining |
|
|
350 | (2) |
|
|
352 | (1) |
|
|
353 | (5) |
|
Numerical Optimization via Gradient Boosting |
|
|
358 | (3) |
|
|
358 | (1) |
|
|
359 | (1) |
|
Implementations of Gradient Boosting |
|
|
360 | (1) |
|
Right-Sized Trees for Boosting |
|
|
361 | (3) |
|
|
364 | (3) |
|
|
364 | (1) |
|
|
365 | (2) |
|
|
367 | (4) |
|
Relative Importance of Predictor Variables |
|
|
367 | (2) |
|
|
369 | (2) |
|
|
371 | (9) |
|
|
371 | (4) |
|
|
375 | (4) |
|
|
379 | (1) |
|
|
380 | (4) |
|
|
384 | (5) |
|
|
389 | (28) |
|
|
389 | (1) |
|
Projection Pursuit Regression |
|
|
389 | (3) |
|
|
392 | (3) |
|
|
395 | (2) |
|
Some Issues in Training Neural Networks |
|
|
397 | (4) |
|
|
397 | (1) |
|
|
398 | (1) |
|
|
398 | (2) |
|
Number of Hidden Units and Layers |
|
|
400 | (1) |
|
|
400 | (1) |
|
|
401 | (3) |
|
|
404 | (4) |
|
|
408 | (1) |
|
Bayesian Neural Nets and the NIPS 2003 Challenge |
|
|
409 | (5) |
|
Bayes, Boosting and Bagging |
|
|
410 | (2) |
|
|
412 | (2) |
|
Computational Considerations |
|
|
414 | (1) |
|
|
415 | (1) |
|
|
415 | (2) |
|
Support Vector Machines and Flexible Discriminants |
|
|
417 | (42) |
|
|
417 | (1) |
|
The Support Vector Classifier |
|
|
417 | (6) |
|
Computing the Support Vector Classifier |
|
|
420 | (1) |
|
Mixture Example (Continued) |
|
|
421 | (2) |
|
Support Vector Machines and Kernels |
|
|
423 | (15) |
|
Computing the SVM for Classification |
|
|
423 | (3) |
|
The SVM as a Penalization Method |
|
|
426 | (2) |
|
Function Estimation and Reproducing Kernels |
|
|
428 | (3) |
|
SVMs and the Curse of Dimensionality |
|
|
431 | (1) |
|
A Path Algorithm for the SVM Classifier |
|
|
432 | (2) |
|
Support Vector Machines for Regression |
|
|
434 | (2) |
|
|
436 | (2) |
|
|
438 | (1) |
|
Generalizing Linear Discriminant Analysis |
|
|
438 | (2) |
|
Flexible Discriminant Analysis |
|
|
440 | (6) |
|
Computing the FDA Estimates |
|
|
444 | (2) |
|
Penalized Discriminant Analysis |
|
|
446 | (3) |
|
Mixture Discriminant Analysis |
|
|
449 | (6) |
|
|
451 | (4) |
|
|
455 | (1) |
|
|
455 | (4) |
|
Prototype Methods and Nearest-Neighbors |
|
|
459 | (26) |
|
|
459 | (1) |
|
|
459 | (4) |
|
|
460 | (2) |
|
Learning Vector Quantization |
|
|
462 | (1) |
|
|
463 | (1) |
|
k-Nearest-Neighbor Classifiers |
|
|
463 | (12) |
|
Example: A Comparative Study |
|
|
468 | (2) |
|
Example: k-Nearest-Neighbors and Image Scene Classification |
|
|
470 | (1) |
|
Invariant Metrics and Tangent Distance |
|
|
471 | (4) |
|
Adaptive Nearest-Neighbor Methods |
|
|
475 | (5) |
|
|
478 | (1) |
|
Global Dimension Reduction for Nearest-Neighbors |
|
|
479 | (1) |
|
Computational Considerations |
|
|
480 | (1) |
|
|
481 | (1) |
|
|
481 | (4) |
|
|
485 | (102) |
|
|
485 | (2) |
|
|
487 | (14) |
|
|
488 | (1) |
|
|
489 | (3) |
|
Example: Market Basket Analysis |
|
|
492 | (3) |
|
Unsupervised as Supervised Learning |
|
|
495 | (2) |
|
Generalized Association Rules |
|
|
497 | (2) |
|
Choice of Supervised Learning Method |
|
|
499 | (1) |
|
Example: Market Basket Analysis (Continued) |
|
|
499 | (2) |
|
|
501 | (27) |
|
|
503 | (1) |
|
Dissimilarities Based on Attributes |
|
|
503 | (2) |
|
|
505 | (2) |
|
|
507 | (1) |
|
|
507 | (2) |
|
|
509 | (1) |
|
Gaussian Mixtures as Soft K-means Clustering |
|
|
510 | (2) |
|
Example: Human Tumor Microarray Data |
|
|
512 | (2) |
|
|
514 | (1) |
|
|
515 | (3) |
|
|
518 | (2) |
|
|
520 | (8) |
|
|
528 | (6) |
|
Principal Components, Curves and Surfaces |
|
|
534 | (19) |
|
|
534 | (7) |
|
Principal Curves and Surfaces |
|
|
541 | (3) |
|
|
544 | (3) |
|
Kernel Principal Components |
|
|
547 | (3) |
|
Sparse Principal Components |
|
|
550 | (3) |
|
Non-negative Matrix Factorization |
|
|
553 | (4) |
|
|
554 | (3) |
|
Independent Component Analysis and Exploratory Projection Pursuit |
|
|
557 | (13) |
|
Latent Variables and Factor Analysis |
|
|
558 | (2) |
|
Independent Component Analysis |
|
|
560 | (5) |
|
Exploratory Projection Pursuit |
|
|
565 | (1) |
|
|
565 | (5) |
|
|
570 | (2) |
|
Nonlinear Dimension Reduction and Local Multidimensional Scaling |
|
|
572 | (4) |
|
The Google PageRank Algorithm |
|
|
576 | (2) |
|
|
578 | (1) |
|
|
579 | (8) |
|
|
587 | (18) |
|
|
587 | (1) |
|
Definition of Random Forests |
|
|
587 | (5) |
|
Details of Random Forests |
|
|
592 | (5) |
|
|
592 | (1) |
|
|
593 | (2) |
|
|
595 | (1) |
|
Random Forests and Overfitting |
|
|
596 | (1) |
|
Analysis of Random Forests |
|
|
597 | (5) |
|
Variance and the De-Correlation Effect |
|
|
597 | (3) |
|
|
600 | (1) |
|
Adaptive Nearest Neighbors |
|
|
601 | (1) |
|
|
602 | (1) |
|
|
603 | (2) |
|
|
605 | (20) |
|
|
605 | (2) |
|
Boosting and Regularization Paths |
|
|
607 | (9) |
|
|
607 | (3) |
|
The ``Bet on Sparsity'' Principle |
|
|
610 | (3) |
|
Regularization Paths, Over-fitting and Margins |
|
|
613 | (3) |
|
|
616 | (7) |
|
|
617 | (5) |
|
|
622 | (1) |
|
|
623 | (1) |
|
|
624 | (1) |
|
Undirected Graphical Models |
|
|
625 | (24) |
|
|
625 | (2) |
|
Markov Graphs and Their Properties |
|
|
627 | (3) |
|
Undirected Graphical Models for Continuous Variables |
|
|
630 | (8) |
|
Estimation of the Parameters when the Graph Structure is Known |
|
|
631 | (4) |
|
Estimation of the Graph Structure |
|
|
635 | (3) |
|
Undirected Graphical Models for Discrete Variables |
|
|
638 | (7) |
|
Estimation of the Parameters when the Graph Structure is Known |
|
|
639 | (2) |
|
|
641 | (1) |
|
Estimation of the Graph Structure |
|
|
642 | (1) |
|
Restricted Boltzmann Machines |
|
|
643 | (2) |
|
|
645 | (4) |
|
High-Dimensional Problems: p » N |
|
|
649 | (50) |
|
When p is Much Bigger than N |
|
|
649 | (2) |
|
Diagonal Linear Discriminant Analysis and Nearest Shrunken Centroids |
|
|
651 | (3) |
|
Linear Classifiers with Quadratic Regularization |
|
|
654 | (7) |
|
Regularized Discriminant Analysis |
|
|
656 | (1) |
|
Logistic Regression with Quadratic Regularization |
|
|
657 | (1) |
|
The Support Vector Classifier |
|
|
657 | (1) |
|
|
658 | (1) |
|
Computational Shortcuts When p » N |
|
|
659 | (2) |
|
Linear Classifiers with L1 Regularization |
|
|
661 | (7) |
|
Application of Lasso to Protein Mass Spectroscopy |
|
|
664 | (2) |
|
The Fused Lasso for Functional Data |
|
|
666 | (2) |
|
Classification When Features are Unavailable |
|
|
668 | (6) |
|
Example: String Kernels and Protein Classification |
|
|
668 | (2) |
|
Classification and Other Models Using Inner-Product Kernels and Pairwise Distances |
|
|
670 | (2) |
|
Example: Abstracts Classification |
|
|
672 | (2) |
|
High-Dimensional Regression: Supervised Principal Components |
|
|
674 | (9) |
|
Connection to Latent-Variable Modeling |
|
|
678 | (2) |
|
Relationship with Partial Least Squares |
|
|
680 | (1) |
|
Pre-Conditioning for Feature Selection |
|
|
681 | (2) |
|
Feature Assessment and the Multiple-Testing Problem |
|
|
683 | (10) |
|
|
687 | (3) |
|
Asymmetric Cutpoints and the SAM Procedure |
|
|
690 | (2) |
|
A Bayesian Interpretation of the FDR |
|
|
692 | (1) |
|
|
693 | (1) |
|
|
694 | (5) |
References |
|
699 | (30) |
Author Index |
|
729 | (8) |
Index |
|
737 | |