Foreword |
|
xxi | |
Preface |
|
xxiii | |
About the Author |
|
xxvii | |
I First Steps |
|
1 | (106) |
|
|
3 | (16) |
|
|
3 | (1) |
|
1.2 Scope, Terminology, Prediction, and Data |
|
|
4 | (3) |
|
|
5 | (1) |
|
1.2.2 Target Values and Predictions |
|
|
6 | (1) |
|
1.3 Putting the Machine in Machine Learning |
|
|
7 | (2) |
|
1.4 Examples of Learning Systems |
|
|
9 | (2) |
|
1.4.1 Predicting Categories: Examples of Classifiers |
|
|
9 | (1) |
|
1.4.2 Predicting Values: Examples of Regressors |
|
|
10 | (1) |
|
1.5 Evaluating Learning Systems |
|
|
11 | (2) |
|
|
11 | (1) |
|
1.5.2 Resource Consumption |
|
|
12 | (1) |
|
1.6 A Process for Building Learning Systems |
|
|
13 | (2) |
|
1.7 Assumptions and Reality of Learning |
|
|
15 | (2) |
|
1.8 End-of-Chapter Material |
|
|
17 | (2) |
|
|
17 | (1) |
|
|
17 | (2) |
|
2 Some Technical Background |
|
|
19 | (36) |
|
|
19 | (1) |
|
2.2 The Need for Mathematical Language |
|
|
19 | (1) |
|
2.3 Our Software for Tackling Machine Learning |
|
|
20 | (1) |
|
|
21 | (7) |
|
|
22 | (1) |
|
|
23 | (1) |
|
2.4.3 Conditional Probability |
|
|
24 | (1) |
|
|
25 | (3) |
|
2.5 Linear Combinations, Weighted Sums, and Dot Products |
|
|
28 | (6) |
|
|
30 | (2) |
|
|
32 | (1) |
|
2.5.3 Sum of Squared Errors |
|
|
33 | (1) |
|
2.6 A Geometric View: Points in Space |
|
|
34 | (9) |
|
|
34 | (5) |
|
|
39 | (4) |
|
2.7 Notation and the Plus-One Trick |
|
|
43 | (2) |
|
2.8 Getting Groovy, Breaking the Straight-Jacket, and Nonlinearity |
|
|
45 | (2) |
|
2.9 NumPy versus "All the Maths" |
|
|
47 | (5) |
|
2.9.1 Back to 1D versus 2D |
|
|
49 | (3) |
|
2.10 Floating-Point Issues |
|
|
52 | (1) |
|
|
53 | (2) |
|
|
53 | (1) |
|
|
54 | (1) |
|
3 Predicting Categories: Getting Started with Classification |
|
|
55 | (30) |
|
|
55 | (1) |
|
3.2 A Simple Classification Dataset |
|
|
56 | (3) |
|
3.3 Training and Testing: Don't Teach to the Test |
|
|
59 | (3) |
|
3.4 Evaluation: Grading the Exam |
|
|
62 | (1) |
|
3.5 Simple Classifier #1: Nearest Neighbors, Long Distance Relationships, and Assumptions |
|
|
63 | (5) |
|
3.5.1 Defining Similarity |
|
|
63 | (1) |
|
|
64 | (1) |
|
|
64 | (1) |
|
3.5.4 k-NN, Parameters, and Nonparametric Methods |
|
|
65 | (1) |
|
3.5.5 Building a k-NN Classification Model |
|
|
66 | (2) |
|
3.6 Simple Classifier #2: Naive Bayes, Probability, and Broken Promises |
|
|
68 | (2) |
|
3.7 Simplistic Evaluation of Classifiers |
|
|
70 | (11) |
|
3.7.1 Learning Performance |
|
|
70 | (1) |
|
3.7.2 Resource Utilization in Classification |
|
|
71 | (6) |
|
3.7.3 Stand-Alone Resource Evaluation |
|
|
77 | (4) |
|
|
81 | (4) |
|
3.8.1 Sophomore Warning: Limitations and Open Issues |
|
|
81 | (1) |
|
|
82 | (1) |
|
|
82 | (1) |
|
|
83 | (2) |
|
4 Predicting Numerical Values: Getting Started with Regression |
|
|
85 | (22) |
|
4.1 A Simple Regression Dataset |
|
|
85 | (2) |
|
4.2 Nearest-Neighbors Regression and Summary Statistics |
|
|
87 | (4) |
|
4.2.1 Measures of Center: Median and Mean |
|
|
88 | (2) |
|
4.2.2 Building a k-NN Regression Model |
|
|
90 | (1) |
|
4.3 Linear Regression and Errors |
|
|
91 | (7) |
|
4.3.1 No Flat Earth: Why We Need Slope |
|
|
92 | (2) |
|
|
94 | (3) |
|
4.3.3 Performing Linear Regression |
|
|
97 | (1) |
|
4.4 Optimization: Picking the Best Answer |
|
|
98 | (3) |
|
|
98 | (1) |
|
|
99 | (1) |
|
|
99 | (1) |
|
4.4.4 Calculated Shortcuts |
|
|
100 | (1) |
|
4.4.5 Application to Linear Regression |
|
|
101 | (1) |
|
4.5 Simple Evaluation and Comparison of Regressors |
|
|
101 | (3) |
|
4.5.1 Root Mean Squared Error |
|
|
101 | (1) |
|
4.5.2 Learning Performance |
|
|
102 | (1) |
|
4.5.3 Resource Utilization in Regression |
|
|
102 | (2) |
|
|
104 | (5) |
|
4.6.1 Limitations and Open Issues |
|
|
104 | (1) |
|
|
105 | (1) |
|
|
105 | (1) |
|
|
105 | (2) |
II Evaluation |
|
107 | (128) |
|
5 Evaluating and Comparing Learners |
|
|
109 | (50) |
|
5.1 Evaluation and Why Less Is More |
|
|
109 | (1) |
|
5.2 Terminology for Learning Phases |
|
|
110 | (6) |
|
5.2.1 Back to the Machines |
|
|
110 | (3) |
|
5.2.2 More Technically Speaking |
|
|
113 | (3) |
|
5.3 Major Tom, There's Something Wrong: Overfitting and Underfitting |
|
|
116 | (9) |
|
5.3.1 Synthetic Data and Linear Regression |
|
|
117 | (1) |
|
5.3.2 Manually Manipulating Model Complexity |
|
|
118 | (2) |
|
5.3.3 Goldilocks: Visualizing Overfitting, Underfitting, and "Just Right" |
|
|
120 | (4) |
|
|
124 | (1) |
|
5.3.5 Take-Home Notes on Overfitting |
|
|
124 | (1) |
|
|
125 | (3) |
|
|
125 | (1) |
|
|
126 | (1) |
|
|
127 | (1) |
|
5.5 (Re)Sampling: Making More from Less |
|
|
128 | (14) |
|
|
128 | (4) |
|
|
132 | (1) |
|
5.5.3 Repeated Train-Test Splits |
|
|
133 | (4) |
|
5.5.4 A Better Way and Shuffling |
|
|
137 | (3) |
|
5.5.5 Leave-One-Out Cross-Validation |
|
|
140 | (2) |
|
5.6 Break-It-Down: Deconstructing Error into Bias and Variance |
|
|
142 | (7) |
|
5.6.1 Variance of the Data |
|
|
143 | (1) |
|
5.6.2 Variance of the Model |
|
|
144 | (1) |
|
|
144 | (1) |
|
|
145 | (1) |
|
5.6.5 Examples of Bias-Variance Tradeoffs |
|
|
145 | (4) |
|
5.7 Graphical Evaluation and Comparison |
|
|
149 | (5) |
|
5.7.1 Learning Curves: How Much Data Do We Need? |
|
|
150 | (2) |
|
|
152 | (2) |
|
5.8 Comparing Learners with Cross-Validation |
|
|
154 | (1) |
|
|
155 | (4) |
|
|
155 | (1) |
|
|
155 | (2) |
|
|
157 | (2) |
|
|
159 | (46) |
|
|
159 | (2) |
|
6.2 Beyond Accuracy: Metrics for Classification |
|
|
161 | (9) |
|
6.2.1 Eliminating Confusion from the Confusion Matrix |
|
|
163 | (1) |
|
6.2.2 Ways of Being Wrong |
|
|
164 | (1) |
|
6.2.3 Metrics from the Confusion Matrix |
|
|
165 | (1) |
|
6.2.4 Coding the Confusion Matrix |
|
|
166 | (2) |
|
6.2.5 Dealing with Multiple Classes: Multiclass Averaging |
|
|
168 | (2) |
|
|
170 | (1) |
|
|
170 | (11) |
|
6.3.1 Patterns in the ROC |
|
|
173 | (1) |
|
|
174 | (3) |
|
6.3.3 AUC: Area-Under-the-(ROC)-Curve |
|
|
177 | (2) |
|
6.3.4 Multiclass Learners, One-versus-Rest, and ROC |
|
|
179 | (2) |
|
6.4 Another Take on Multiclass: One-versus-One |
|
|
181 | (4) |
|
6.4.1 Multiclass AUC Part Two: The Quest for a Single Value |
|
|
182 | (3) |
|
6.5 Precision-Recall Curves |
|
|
185 | (2) |
|
6.5.1 A Note on Precision-Recall Tradeoff |
|
|
185 | (1) |
|
6.5.2 Constructing a Precision-Recall Curve |
|
|
186 | (1) |
|
6.6 Cumulative Response and Lift Curves |
|
|
187 | (3) |
|
6.7 More Sophisticated Evaluation of Classifiers: Take Two |
|
|
190 | (11) |
|
|
190 | (5) |
|
6.7.2 A Novel Multiclass Problem |
|
|
195 | (6) |
|
|
201 | (4) |
|
|
201 | (1) |
|
|
202 | (1) |
|
|
203 | (2) |
|
|
205 | (30) |
|
|
205 | (2) |
|
7.2 Additional Measures for Regression |
|
|
207 | (7) |
|
7.2.1 Creating Our Own Evaluation Metric |
|
|
207 | (1) |
|
7.2.2 Other Built-in Regression Metrics |
|
|
208 | (1) |
|
|
209 | (5) |
|
|
214 | (7) |
|
|
215 | (2) |
|
|
217 | (4) |
|
7.4 A First Look at Standardization |
|
|
221 | (4) |
|
7.5 Evaluating Regressors in a More Sophisticated Way: Take Two |
|
|
225 | (7) |
|
7.5.1 Cross-Validated Results on Multiple Metrics |
|
|
226 | (4) |
|
7.5.2 Summarizing Cross-Validated Results |
|
|
230 | (1) |
|
|
230 | (2) |
|
|
232 | (5) |
|
|
232 | (1) |
|
|
232 | (2) |
|
|
234 | (1) |
III More Methods and Fundamentals |
|
235 | (150) |
|
8 More Classification Methods |
|
|
237 | (58) |
|
8.1 Revisiting Classification |
|
|
237 | (2) |
|
|
239 | (10) |
|
8.2.1 Tree-Building Algorithms |
|
|
242 | (3) |
|
8.2.2 Let's Go: Decision Tree Time |
|
|
245 | (4) |
|
8.2.3 Bias and Variance in Decision Trees |
|
|
249 | (1) |
|
8.3 Support Vector Classifiers |
|
|
249 | (10) |
|
|
253 | (3) |
|
8.3.2 Bias and Variance in SVCs |
|
|
256 | (3) |
|
|
259 | (10) |
|
|
259 | (3) |
|
8.4.2 Probabilities, Odds, and Log-Odds |
|
|
262 | (5) |
|
8.4.3 Just Do It: Logistic Regression Edition |
|
|
267 | (1) |
|
8.4.4 A Logistic Regression: A Space Oddity |
|
|
268 | (1) |
|
8.5 Discriminant Analysis |
|
|
269 | (16) |
|
|
270 | (12) |
|
|
282 | (1) |
|
|
283 | (2) |
|
8.6 Assumptions, Biases, and Classifiers |
|
|
285 | (2) |
|
8.7 Comparison of Classifiers: Take Three |
|
|
287 | (3) |
|
|
287 | (3) |
|
|
290 | (5) |
|
|
290 | (1) |
|
|
290 | (3) |
|
|
293 | (2) |
|
9 More Regression Methods |
|
|
295 | (26) |
|
9.1 Linear Regression in the Penalty Box: Regularization |
|
|
295 | (6) |
|
9.1.1 Performing Regularized Regression |
|
|
300 | (1) |
|
9.2 Support Vector Regression |
|
|
301 | (7) |
|
|
301 | (4) |
|
9.2.2 From Linear Regression to Regularized Regression to Support Vector Regression |
|
|
305 | (2) |
|
9.2.3 Just Do It-SVR Style |
|
|
307 | (1) |
|
9.3 Piecewise Constant Regression |
|
|
308 | (5) |
|
9.3.1 Implementing a Piecewise Constant Regressor |
|
|
310 | (1) |
|
9.3.2 General Notes on Implementing Models |
|
|
311 | (2) |
|
|
313 | (1) |
|
9.4.1 Performing Regression with Trees |
|
|
313 | (1) |
|
9.5 Comparison of Regressors: Take Three |
|
|
314 | (4) |
|
|
318 | (3) |
|
|
318 | (1) |
|
|
318 | (1) |
|
|
319 | (2) |
|
10 Manual Feature Engineering: Manipulating Data for Fun and Profit |
|
|
321 | (38) |
|
10.1 Feature Engineering Terminology and Motivation |
|
|
321 | (3) |
|
10.1.1 Why Engineer Features? |
|
|
322 | (1) |
|
10.1.2 When Does Engineering Happen? |
|
|
323 | (1) |
|
10.1.3 How Does Feature Engineering Occur? |
|
|
324 | (1) |
|
10.2 Feature Selection and Data Reduction: Taking out the Trash |
|
|
324 | (1) |
|
|
325 | (4) |
|
|
329 | (3) |
|
|
332 | (9) |
|
10.5.1 Another Way to Code and the Curious Case of the Missing Intercept |
|
|
334 | (7) |
|
10.6 Relationships and Interactions |
|
|
341 | (9) |
|
10.6.1 Manual Feature Construction |
|
|
341 | (2) |
|
|
343 | (5) |
|
10.6.3 Adding Features with Transformers |
|
|
348 | (2) |
|
10.7 Target Manipulations |
|
|
350 | (6) |
|
10.7.1 Manipulating the Input Space |
|
|
351 | (2) |
|
10.7.2 Manipulating the Target |
|
|
353 | (3) |
|
|
356 | (3) |
|
|
356 | (1) |
|
|
356 | (1) |
|
|
357 | (2) |
|
11 Tuning Hyperparameters and Pipelines |
|
|
359 | (26) |
|
11.1 Models, Parameters, Hyperparameters |
|
|
360 | (2) |
|
11.2 Tuning Hyperparameters |
|
|
362 | (8) |
|
11.2.1 A Note on Computer Science and Learning Terminology |
|
|
362 | (1) |
|
11.2.2 An Example of Complete Search |
|
|
362 | (6) |
|
11.2.3 Using Randomness to Search for a Needle in a Haystack |
|
|
368 | (2) |
|
11.3 Down the Recursive Rabbit Hole: Nested Cross-Validation |
|
|
370 | (7) |
|
11.3.1 Cross-Validation, Redux |
|
|
370 | (1) |
|
11.3.2 GridSearch as a Model |
|
|
371 | (1) |
|
11.3.3 Cross-Validation Nested within Cross-Validation |
|
|
372 | (3) |
|
11.3.4 Comments on Nested CV |
|
|
375 | (2) |
|
|
377 | (3) |
|
|
378 | (1) |
|
11.4.2 A More Complex Pipeline |
|
|
379 | (1) |
|
11.5 Pipelines and Tuning Together |
|
|
380 | (2) |
|
|
382 | (5) |
|
|
382 | (1) |
|
|
382 | (1) |
|
|
383 | (2) |
IV Adding Complexity |
|
385 | (144) |
|
|
387 | (22) |
|
|
387 | (2) |
|
|
389 | (1) |
|
12.3 Bagging and Random Forests |
|
|
390 | (8) |
|
|
390 | (4) |
|
12.3.2 From Bootstrapping to Bagging |
|
|
394 | (2) |
|
12.3.3 Through the Random Forest |
|
|
396 | (2) |
|
|
398 | (3) |
|
|
399 | (2) |
|
12.5 Comparing the Tree-Ensemble Methods |
|
|
401 | (4) |
|
|
405 | (4) |
|
|
405 | (1) |
|
|
405 | (1) |
|
|
406 | (3) |
|
13 Models That Engineer Features for Us |
|
|
409 | (60) |
|
|
411 | (17) |
|
13.1.1 Single-Step Filtering with Metric-Based Feature Selection |
|
|
412 | (11) |
|
13.1.2 Model-Based Feature Selection |
|
|
423 | (3) |
|
13.1.3 Integrating Feature Selection with a Learning Pipeline |
|
|
426 | (2) |
|
13.2 Feature Construction with Kernels |
|
|
428 | (17) |
|
13.2.1 A Kernel Motivator |
|
|
428 | (5) |
|
13.2.2 Manual Kernel Methods |
|
|
433 | (5) |
|
13.2.3 Kernel Methods and Kernel Options |
|
|
438 | (4) |
|
13.2.4 Kernelized SVCs: SVMs |
|
|
442 | (1) |
|
13.2.5 Take-Home Notes on SVM and an Example |
|
|
443 | (2) |
|
13.3 Principal Components Analysis: An Unsupervised Technique |
|
|
445 | (17) |
|
13.3.1 A Warm Up: Centering |
|
|
445 | (3) |
|
13.3.2 Finding a Different Best Line |
|
|
448 | (1) |
|
|
449 | (3) |
|
13.3.4 Under the Hood of PCA |
|
|
452 | (5) |
|
13.3.5 A Finale: Comments on General PCA |
|
|
457 | (1) |
|
13.3.6 Kernel PCA and Manifold Methods |
|
|
458 | (4) |
|
|
462 | (7) |
|
|
462 | (1) |
|
|
462 | (5) |
|
|
467 | (2) |
|
14 Feature Engineering for Domains: Domain-Specific Learning |
|
|
469 | (28) |
|
|
470 | (9) |
|
|
471 | (5) |
|
14.1.2 Example of Text Learning |
|
|
476 | (3) |
|
|
479 | (2) |
|
14.2.1 k-Means Clustering |
|
|
479 | (2) |
|
|
481 | (12) |
|
14.3.1 Bag of Visual Words |
|
|
481 | (1) |
|
|
482 | (1) |
|
14.3.3 An End-to-End System |
|
|
483 | (8) |
|
14.3.4 Complete Code of BoVW Transformer |
|
|
491 | (2) |
|
|
493 | (4) |
|
|
493 | (1) |
|
|
494 | (1) |
|
|
495 | (2) |
|
15 Connections, Extensions, and Further Directions |
|
|
497 | (32) |
|
|
497 | (3) |
|
15.2 Linear Regression from Raw Materials |
|
|
500 | (4) |
|
15.2.1 A Graphical View of Linear Regression |
|
|
504 | (1) |
|
15.3 Building Logistic Regression from Raw Materials |
|
|
504 | (6) |
|
15.3.1 Logistic Regression with Zero-One Coding |
|
|
506 | (2) |
|
15.3.2 Logistic Regression with Plus-One Minus-One Coding |
|
|
508 | (1) |
|
15.3.3 A Graphical View of Logistic Regression |
|
|
509 | (1) |
|
15.4 SVM from Raw Materials |
|
|
510 | (2) |
|
|
512 | (4) |
|
15.5.1 A NN View of Linear Regression |
|
|
512 | (3) |
|
15.5.2 A NN View of Logistic Regression |
|
|
515 | (1) |
|
15.5.3 Beyond Basic Neural Networks |
|
|
516 | (1) |
|
15.6 Probabilistic Graphical Models |
|
|
516 | (9) |
|
|
518 | (1) |
|
15.6.2 A PGM View of Linear Regression |
|
|
519 | (4) |
|
15.6.3 A PGM View of Logistic Regression |
|
|
523 | (2) |
|
|
525 | (4) |
|
|
525 | (1) |
|
|
526 | (1) |
|
|
527 | (2) |
A mlwpy.py Listing |
|
529 | (8) |
Index |
|
537 | |