Preface |
|
xiii | |
|
|
1 | (36) |
|
1.1 Select topics in statistical and machine learning |
|
|
2 | (8) |
|
1.1.1 Statistical jargon and conventions |
|
|
3 | (1) |
|
1.1.2 Supervised learning |
|
|
4 | (1) |
|
|
5 | (1) |
|
|
6 | (1) |
|
1.1.2.3 Classification vs. regression |
|
|
7 | (1) |
|
1.1.2.4 Discrimination vs. prediction |
|
|
7 | (1) |
|
1.1.2.5 The bias-variance tradeoff |
|
|
8 | (2) |
|
1.1.3 Unsupervised learning |
|
|
10 | (1) |
|
|
10 | (7) |
|
1.2.1 A brief history of decision trees |
|
|
12 | (2) |
|
1.2.2 The anatomy of a simple decision tree |
|
|
14 | (1) |
|
1.2.2.1 Example: survival on the Titanic |
|
|
15 | (2) |
|
|
17 | (3) |
|
|
17 | (2) |
|
1.3.2 Software information and conventions |
|
|
19 | (1) |
|
1.4 Some example data sets |
|
|
20 | (15) |
|
|
21 | (1) |
|
1.4.2 New York air quality measurements |
|
|
21 | (2) |
|
1.4.3 The Friedman 1 benchmark problem |
|
|
23 | (1) |
|
|
24 | (1) |
|
|
25 | (3) |
|
|
28 | (1) |
|
1.4.7 Predicting home prices in Ames, Iowa |
|
|
29 | (1) |
|
1.4.8 Wine quality ratings |
|
|
30 | (1) |
|
1.4.9 Mayo Clinic primary biliary cholangitis study |
|
|
31 | (4) |
|
1.5 There ain't no such thing as a free lunch |
|
|
35 | (1) |
|
|
35 | (2) |
|
|
37 | (140) |
|
2 Binary recursive partitioning with CART |
|
|
39 | (72) |
|
|
39 | (2) |
|
|
41 | (17) |
|
2.2.1 Splits on ordered variables |
|
|
43 | (4) |
|
2.2.1.1 So which is it in practice, Gini or entropy? |
|
|
47 | (1) |
|
2.2.2 Example: Swiss banknotes |
|
|
48 | (3) |
|
2.2.3 Fitted values and predictions |
|
|
51 | (1) |
|
2.2.4 Class priors and misclassification costs |
|
|
52 | (2) |
|
|
54 | (1) |
|
2.2.4.2 Example: employee attrition |
|
|
55 | (3) |
|
|
58 | (4) |
|
2.3.1 Example: New York air quality measurements |
|
|
59 | (3) |
|
|
62 | (7) |
|
2.4.1 Example: mushroom edibility |
|
|
64 | (3) |
|
2.4.2 Be wary of categoricals with high cardinality |
|
|
67 | (1) |
|
2.4.3 To encode, or not to encode? |
|
|
68 | (1) |
|
2.5 Building a decision tree |
|
|
69 | (9) |
|
2.5.1 Cost-complexity pruning |
|
|
71 | (3) |
|
2.5.1.1 Example: mushroom edibility |
|
|
74 | (3) |
|
|
77 | (1) |
|
|
78 | (1) |
|
2.6 Hyperparameters and tuning |
|
|
78 | (1) |
|
2.7 Missing data and surrogate splits |
|
|
78 | (4) |
|
2.7.1 Other missing value strategies |
|
|
80 | (2) |
|
|
82 | (1) |
|
2.9 Software and examples |
|
|
83 | (22) |
|
2.9.1 Example: Swiss banknotes |
|
|
84 | (4) |
|
2.9.2 Example: mushroom edibility |
|
|
88 | (8) |
|
2.9.3 Example: predicting home prices |
|
|
96 | (4) |
|
2.9.4 Example: employee attrition |
|
|
100 | (3) |
|
2.9.5 Example: letter image recognition |
|
|
103 | (2) |
|
|
105 | (3) |
|
2.10.1 Advantages of CART |
|
|
105 | (1) |
|
2.10.2 Disadvantages of CART |
|
|
106 | (2) |
|
|
108 | (3) |
|
3 Conditional inference trees |
|
|
111 | (36) |
|
|
111 | (1) |
|
3.2 Early attempts at unbiased recursive partitioning |
|
|
112 | (2) |
|
3.3 A quick digression into conditional inference |
|
|
114 | (7) |
|
3.3.1 Example: X and Y are both univariate continuous |
|
|
117 | (1) |
|
3.3.2 Example: X and Y are both nominal categorical |
|
|
118 | (2) |
|
3.3.3 Which test statistic should you use? |
|
|
120 | (1) |
|
3.4 Conditional inference trees |
|
|
121 | (11) |
|
3.4.1 Selecting the splitting variable |
|
|
121 | (2) |
|
3.4.1.1 Example: New York air quality measurements |
|
|
123 | (1) |
|
3.4.1.2 Example: Swiss banknotes |
|
|
124 | (1) |
|
3.4.2 Finding the optimal split point |
|
|
125 | (1) |
|
3.4.2.1 Example: New York air quality measurements |
|
|
126 | (2) |
|
|
128 | (1) |
|
|
128 | (1) |
|
3.4.5 Choice of a, g(), and h() |
|
|
128 | (3) |
|
3.4.6 Fitted values and predictions |
|
|
131 | (1) |
|
|
131 | (1) |
|
3.4.8 Variable importance |
|
|
132 | (1) |
|
3.5 Software and examples |
|
|
132 | (11) |
|
3.5.1 Example: New York air quality measurements |
|
|
133 | (4) |
|
3.5.2 Example: wine quality ratings |
|
|
137 | (3) |
|
3.5.3 Example: Mayo Clinic liver transplant data |
|
|
140 | (3) |
|
|
143 | (4) |
|
4 The hitchhiker's GUIDE to modern decision trees |
|
|
147 | (30) |
|
|
148 | (2) |
|
4.2 A GUIDE for regression |
|
|
150 | (7) |
|
4.2.1 Piecewise constant models |
|
|
150 | (2) |
|
4.2.1.1 Example: New York air quality measurements |
|
|
152 | (1) |
|
|
153 | (1) |
|
|
154 | (1) |
|
4.2.3.1 Example: predicting home prices |
|
|
155 | (2) |
|
4.2.3.2 Bootstrap bias correction |
|
|
157 | (1) |
|
4.3 A GUIDE for classification |
|
|
157 | (5) |
|
4.3.1 Linear/oblique splits |
|
|
157 | (1) |
|
4.3.1.1 Example: classifying the Palmer penguins |
|
|
158 | (3) |
|
4.3.2 Priors and misclassification costs |
|
|
161 | (1) |
|
|
161 | (1) |
|
4.3.3.1 Kernel-based and k-nearest neighbor fits |
|
|
162 | (1) |
|
|
162 | (1) |
|
|
163 | (1) |
|
4.6 Fitted values and predictions |
|
|
163 | (1) |
|
|
163 | (1) |
|
|
164 | (1) |
|
4.9 Software and examples |
|
|
165 | (7) |
|
4.9.1 Example: credit card default |
|
|
165 | (7) |
|
|
172 | (5) |
|
|
177 | (182) |
|
|
179 | (24) |
|
5.1 Bootstrap aggregating (bagging) |
|
|
181 | (7) |
|
5.1.1 When does bagging work? |
|
|
184 | (1) |
|
5.1.2 Bagging from scratch: classifying email spam |
|
|
184 | (3) |
|
5.1.3 Sampling without replacement |
|
|
187 | (1) |
|
5.1.4 Hyperparameters and tuning |
|
|
187 | (1) |
|
|
188 | (1) |
|
|
188 | (7) |
|
5.2.1 AdaBoost. M1 for binary outcomes |
|
|
189 | (1) |
|
5.2.2 Boosting from scratch: classifying email spam |
|
|
190 | (2) |
|
|
192 | (1) |
|
5.2.4 Forward stagewise additive modeling and exponential loss |
|
|
192 | (2) |
|
|
194 | (1) |
|
5.3 Bagging or boosting: which should you use? |
|
|
195 | (1) |
|
|
195 | (1) |
|
5.5 Importance sampled learning ensembles |
|
|
196 | (6) |
|
5.5.1 Example: post-processing a bagged tree ensemble |
|
|
197 | (5) |
|
|
202 | (1) |
|
6 Peeking inside the "black box": post-hoc interpretability |
|
|
203 | (26) |
|
|
204 | (4) |
|
6.1.1 Permutation importance |
|
|
204 | (2) |
|
|
206 | (1) |
|
6.1.3 Example: predicting home prices |
|
|
206 | (2) |
|
|
208 | (9) |
|
|
208 | (1) |
|
6.2.1.1 Classification problems |
|
|
209 | (1) |
|
6.2.2 Interaction effects |
|
|
210 | (1) |
|
6.2.3 Individual conditional expectations |
|
|
210 | (1) |
|
|
211 | (1) |
|
6.2.5 Example: predicting home prices |
|
|
211 | (4) |
|
6.2.6 Example: Edgar Anderson's iris data |
|
|
215 | (2) |
|
6.3 Feature contributions |
|
|
217 | (8) |
|
|
217 | (2) |
|
6.3.2 Explaining predictions with Shapley values |
|
|
219 | (1) |
|
|
220 | (1) |
|
6.3.2.2 Monte Carlo-based Shapley explanations |
|
|
221 | (2) |
|
|
223 | (1) |
|
6.3.4 Example: predicting home prices |
|
|
223 | (2) |
|
6.4 Drawbacks of existing methods |
|
|
225 | (1) |
|
|
226 | (3) |
|
|
229 | (80) |
|
|
229 | (1) |
|
7.2 The random forest algorithm |
|
|
229 | (10) |
|
7.2.1 Voting and probability estimation |
|
|
232 | (2) |
|
7.2.1.1 Example: Mease model simulation |
|
|
234 | (2) |
|
7.2.2 Subsampling (without replacement) |
|
|
236 | (1) |
|
7.2.3 Random forest from scratch: predicting home prices |
|
|
237 | (2) |
|
7.3 Out-of-bag (OOB) data |
|
|
239 | (4) |
|
7.4 Hyperparameters and tuning |
|
|
243 | (2) |
|
|
245 | (4) |
|
7.5.1 Impurity-based importance |
|
|
245 | (2) |
|
7.5.2 OOB-based permutation importance |
|
|
247 | (1) |
|
7.5.2.1 Holdout permutation importance |
|
|
248 | (1) |
|
7.5.2.2 Conditional permutation importance |
|
|
249 | (1) |
|
|
249 | (7) |
|
7.6.1 Detecting anomalies and outliers |
|
|
251 | (1) |
|
7.6.1.1 Example: Swiss banknotes |
|
|
251 | (1) |
|
7.6.2 Missing value imputation |
|
|
252 | (1) |
|
7.6.3 Unsupervised random forests |
|
|
253 | (1) |
|
7.6.3.1 Example: Swiss banknotes |
|
|
254 | (1) |
|
7.6.4 Case-specific random forests |
|
|
254 | (2) |
|
7.7 Prediction standard errors |
|
|
256 | (2) |
|
7.7.1 Example: predicting email spam |
|
|
257 | (1) |
|
7.8 Random forest extensions |
|
|
258 | (18) |
|
7.8.1 Oblique random forests |
|
|
258 | (1) |
|
7.8.2 Quantile regression forests |
|
|
259 | (1) |
|
7.8.2.1 Example: predicting home prices (with prediction intervals) |
|
|
260 | (1) |
|
7.8.3 Rotation forests and random rotation forests |
|
|
261 | (2) |
|
7.8.3.1 Random rotation forests |
|
|
263 | (1) |
|
7.8.3.2 Example: Gaussian mixture data |
|
|
264 | (3) |
|
7.8.4 Extremely randomized trees |
|
|
267 | (2) |
|
7.8.5 Anomaly detection with isolation forests |
|
|
269 | (2) |
|
7.8.5.1 Extended isolation forests |
|
|
271 | (1) |
|
7.8.5.2 Example: detecting credit card fraud |
|
|
271 | (5) |
|
7.9 Software and examples |
|
|
276 | (30) |
|
7.9.1 Example: mushroom edibility |
|
|
277 | (1) |
|
7.9.2 Example: "deforesting" a random forest |
|
|
277 | (6) |
|
7.9.3 Example: survival on the Titanic |
|
|
283 | (1) |
|
7.9.3.1 Missing value imputation |
|
|
284 | (3) |
|
7.9.3.2 Analyzing the imputed data sets |
|
|
287 | (7) |
|
7.9.4 Example: class imbalance (the good, the bad, and the ugly) |
|
|
294 | (6) |
|
7.9.5 Example: partial dependence with Spark MLlib |
|
|
300 | (6) |
|
|
306 | (3) |
|
8 Gradient boosting machines |
|
|
309 | (50) |
|
8.1 Steepest descent (a brief overview) |
|
|
310 | (1) |
|
8.2 Gradient tree boosting |
|
|
311 | (6) |
|
|
314 | (3) |
|
8.2.0.2 Always a regression tree? |
|
|
317 | (1) |
|
8.2.0.3 Priors and missclassification cost |
|
|
317 | (1) |
|
8.3 Hyperparameters and tuning |
|
|
317 | (5) |
|
8.3.1 Boosting-specific hyperparameters |
|
|
318 | (1) |
|
8.3.1.1 The number of trees in the ensemble: B |
|
|
318 | (1) |
|
8.3.1.2 Regularization and shrinkage |
|
|
319 | (1) |
|
8.3.1.3 Example: predicting ALS progression |
|
|
320 | (1) |
|
8.3.2 Tree-specific hyperparameters |
|
|
321 | (1) |
|
8.3.3 A simple tuning strategy |
|
|
322 | (1) |
|
8.4 Stochastic gradient boosting |
|
|
322 | (1) |
|
|
323 | (1) |
|
8.5 Gradient tree boosting from scratch |
|
|
323 | (4) |
|
8.5.1 Example: predicting home prices |
|
|
326 | (1) |
|
|
327 | (5) |
|
8.6.1 Faster partial dependence with the recursion method |
|
|
328 | (1) |
|
8.6.1.1 Example: predicting email spam |
|
|
329 | (1) |
|
8.6.2 Monotonic constraints |
|
|
329 | (1) |
|
8.6.2.1 Example: bank marketing data |
|
|
330 | (2) |
|
|
332 | (3) |
|
8.7.1 Level-wise vs. leaf-wise tree induction |
|
|
332 | (1) |
|
|
333 | (1) |
|
8.7.3 Explainable boosting machines |
|
|
333 | (1) |
|
8.7.4 Probabilistic regression via natural gradient boosting |
|
|
334 | (1) |
|
8.8 Specialized implementations |
|
|
335 | (4) |
|
8.8.1 Extreme Gradient Boosting: XGBoost |
|
|
335 | (2) |
|
8.8.2 Light Gradient Boosting Machine: LightGBM |
|
|
337 | (1) |
|
|
338 | (1) |
|
8.9 Software and examples |
|
|
339 | (17) |
|
8.9.1 Example: Mayo Clinic liver transplant data |
|
|
339 | (7) |
|
8.9.2 Example: probabilistic predictions with NGBoost (in Python) |
|
|
346 | (1) |
|
8.9.3 Example: post-processing GBMs with the LASSO |
|
|
347 | (4) |
|
8.9.4 Example: direct marketing campaigns with XGBoost |
|
|
351 | (5) |
|
|
356 | (3) |
Bibliography |
|
359 | (22) |
Index |
|
381 | |