Acknowledgements |
|
xiii | |
Notation and Vocabulary |
|
xv | |
|
1 Why We Wrote This Book and How You Should Read It |
|
|
1 | (4) |
|
2 Parametric Likelihood Fits |
|
|
5 | (34) |
|
|
5 | (7) |
|
2.1.1 Example: CP Violation via Mixing |
|
|
7 | (2) |
|
2.1.2 The Exponential Family |
|
|
9 | (1) |
|
2.1.3 Confidence Intervals |
|
|
10 | (1) |
|
|
11 | (1) |
|
2.2 Parametric Likelihood Fits |
|
|
12 | (9) |
|
2.2.1 Nuisance Parameters |
|
|
16 | (1) |
|
2.2.2 Confidence Intervals from Pivotal Quantities |
|
|
17 | (2) |
|
2.2.3 Asymptotic Inference |
|
|
19 | (1) |
|
|
20 | (1) |
|
2.2.5 Conditional Likelihood |
|
|
20 | (1) |
|
2.3 Fits for Small Statistics |
|
|
21 | (5) |
|
2.3.1 Sample Study of Coverage at Small Statistics |
|
|
22 | (3) |
|
2.3.2 When the pdf Goes Negative |
|
|
25 | (1) |
|
2.4 Results Near the Boundary of a Physical Region |
|
|
26 | (2) |
|
2.5 Likelihood Ratio Test for Presence of Signal |
|
|
28 | (3) |
|
|
31 | (4) |
|
|
35 | (4) |
|
|
37 | (2) |
|
|
39 | (24) |
|
3.1 Binned Goodness of Fit Tests |
|
|
41 | (5) |
|
3.2 Statistics Converging to Chi-Square |
|
|
46 | (3) |
|
3.3 Univariate Unbinned Goodness of Fit Tests |
|
|
49 | (3) |
|
3.3.1 Kolmogorov--Smirnov |
|
|
49 | (1) |
|
|
50 | (1) |
|
|
51 | (1) |
|
|
51 | (1) |
|
|
52 | (7) |
|
|
53 | (1) |
|
3.4.2 Transformations to a Uniform Distribution |
|
|
54 | (1) |
|
3.4.3 Local Density Tests |
|
|
55 | (1) |
|
|
56 | (1) |
|
|
57 | (1) |
|
|
58 | (1) |
|
|
59 | (4) |
|
|
61 | (2) |
|
|
63 | (26) |
|
|
63 | (2) |
|
|
65 | (5) |
|
4.2.1 Bootstrap Confidence Intervals |
|
|
68 | (2) |
|
|
70 | (1) |
|
4.2.3 Parametric Bootstrap |
|
|
70 | (1) |
|
|
70 | (6) |
|
4.4 BCa Confidence Intervals |
|
|
76 | (2) |
|
|
78 | (4) |
|
4.6 Resampling Weighted Observations |
|
|
82 | (4) |
|
|
86 | (3) |
|
|
86 | (3) |
|
|
89 | (32) |
|
5.1 Empirical Density Estimate |
|
|
90 | (1) |
|
|
90 | (2) |
|
|
92 | (1) |
|
5.3.1 Multivariate Kernel Estimation |
|
|
92 | (1) |
|
|
93 | (1) |
|
5.5 Parametric vs. Nonparametric Density Estimation |
|
|
93 | (1) |
|
|
94 | (6) |
|
5.6.1 Choosing Histogram Binning |
|
|
97 | (3) |
|
|
100 | (2) |
|
5.8 The Curse of Dimensionality |
|
|
102 | (1) |
|
5.9 Adaptive Kernel Estimation |
|
|
103 | (2) |
|
5.10 Naive Bayes Classification |
|
|
105 | (1) |
|
5.11 Multivariate Kernel Estimation |
|
|
106 | (2) |
|
5.12 Estimation Using Orthogonal Series |
|
|
108 | (3) |
|
5.13 Using Monte Carlo Models |
|
|
111 | (1) |
|
|
112 | (8) |
|
5.14.1 Unfolding: Regularization |
|
|
116 | (4) |
|
|
120 | (1) |
|
|
120 | (1) |
|
6 Basic Concepts and Definitions of Machine Learning |
|
|
121 | (8) |
|
6.1 Supervised, Unsupervised, and Semi-Supervised |
|
|
121 | (2) |
|
|
123 | (1) |
|
6.3 Batch and Online Learning |
|
|
124 | (1) |
|
|
125 | (2) |
|
6.5 Classification and Regression |
|
|
127 | (2) |
|
|
128 | (1) |
|
|
129 | (16) |
|
7.1 Categorical Variables |
|
|
129 | (3) |
|
|
132 | (7) |
|
7.2.1 Likelihood Optimization |
|
|
134 | (1) |
|
|
135 | (2) |
|
|
137 | (1) |
|
|
137 | (2) |
|
|
139 | (1) |
|
|
139 | (2) |
|
|
141 | (4) |
|
|
142 | (3) |
|
8 Linear Transformations and Dimensionality Reduction |
|
|
145 | (20) |
|
8.1 Centering, Scaling, Reflection and Rotation |
|
|
145 | (1) |
|
8.2 Rotation and Dimensionality Reduction |
|
|
146 | (1) |
|
8.3 Principal Component Analysis (PCA) |
|
|
147 | (11) |
|
|
148 | (1) |
|
8.3.2 Numerical Implementation |
|
|
149 | (1) |
|
|
150 | (1) |
|
8.3.4 How Many Principal Components Are Enough? |
|
|
151 | (3) |
|
8.3.5 Example: Apply PCA and Choose the Optimal Number of Components |
|
|
154 | (4) |
|
8.4 Independent Component Analysis (ICA) |
|
|
158 | (5) |
|
|
158 | (3) |
|
8.4.2 Numerical implementation |
|
|
161 | (1) |
|
|
162 | (1) |
|
|
163 | (2) |
|
|
163 | (2) |
|
9 Introduction to Classification |
|
|
165 | (30) |
|
9.1 Loss Functions: Hard Labels and Soft Scores |
|
|
165 | (3) |
|
9.2 Bias, Variance, and Noise |
|
|
168 | (5) |
|
9.3 Training, Validating and Testing: The Optimal Splitting Rule |
|
|
173 | (4) |
|
9.4 Resampling Techniques: Cross-Validation and Bootstrap |
|
|
177 | (5) |
|
|
177 | (2) |
|
|
179 | (2) |
|
9.4.3 Sampling with Stratification |
|
|
181 | (1) |
|
9.5 Data with Unbalanced Classes |
|
|
182 | (8) |
|
9.5.1 Adjusting Prior Probabilities |
|
|
183 | (1) |
|
9.5.2 Undersampling the Majority Class |
|
|
184 | (1) |
|
9.5.3 Oversampling the Minority Class |
|
|
185 | (1) |
|
9.5.4 Example: Classification of Forest Cover Type Data |
|
|
186 | (4) |
|
|
190 | (1) |
|
|
191 | (4) |
|
|
192 | (3) |
|
10 Assessing Classifier Performance |
|
|
195 | (26) |
|
10.1 Classification Error and Other Measures of Predictive Power |
|
|
195 | (1) |
|
10.2 Receiver Operating Characteristic (ROC) and Other Curves |
|
|
196 | (14) |
|
10.2.1 Empirical ROC curve |
|
|
196 | (2) |
|
10.2.2 Other Performance Measures |
|
|
198 | (1) |
|
10.2.3 Optimal Operating Point |
|
|
198 | (2) |
|
|
200 | (1) |
|
|
200 | (5) |
|
10.2.6 Confidence Bounds for ROC Curves |
|
|
205 | (5) |
|
10.3 Testing Equivalence of Two Classification Models |
|
|
210 | (5) |
|
10.4 Comparing Several Classifiers |
|
|
215 | (2) |
|
|
217 | (4) |
|
|
218 | (3) |
|
11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression |
|
|
221 | (30) |
|
11.1 Discriminant Analysis |
|
|
221 | (10) |
|
11.1.1 Estimating the Covariance Matrix |
|
|
223 | (2) |
|
11.1.2 Verifying Discriminant Analysis Assumptions |
|
|
225 | (1) |
|
11.1.3 Applying LDA When LDA Assumptions Are Invalid |
|
|
226 | (2) |
|
11.1.4 Numerical Implementation |
|
|
228 | (1) |
|
11.1.5 Regularized Discriminant Analysis |
|
|
228 | (1) |
|
11.1.6 LDA for Variable Transformation |
|
|
229 | (2) |
|
|
231 | (4) |
|
11.2.1 Binomial Logistic Regression: Theory and Numerical Implementation |
|
|
231 | (2) |
|
11.2.2 Properties of the Binomial Model |
|
|
233 | (1) |
|
11.2.3 Verifying Model Assumptions |
|
|
233 | (1) |
|
11.2.4 Logistic Regression with Multiple Classes |
|
|
234 | (1) |
|
11.3 Classification by Linear Regression |
|
|
235 | (1) |
|
11.4 Partial Least Squares Regression |
|
|
236 | (3) |
|
11.5 Example: Linear Models for MAGIC Telescope Data |
|
|
239 | (8) |
|
11.6 Choosing a Linear Classifier for Your Analysis |
|
|
247 | (1) |
|
|
247 | (4) |
|
|
248 | (3) |
|
|
251 | (14) |
|
|
251 | (3) |
|
12.2 The Feed-Forward Neural Network |
|
|
254 | (2) |
|
|
256 | (4) |
|
12.4 Bayes Neural Networks |
|
|
260 | (2) |
|
|
262 | (1) |
|
|
263 | (2) |
|
|
263 | (2) |
|
13 Local Learning and Kernel Expansion |
|
|
265 | (42) |
|
13.1 From Input Variables to the Feature Space |
|
|
266 | (4) |
|
|
269 | (1) |
|
|
270 | (8) |
|
13.2.1 Kernel Ridge Regression |
|
|
274 | (4) |
|
13.3 Making and Choosing Kernels |
|
|
278 | (1) |
|
13.4 Radial Basis Functions |
|
|
279 | (4) |
|
13.4.1 Example: RBF Classification for the MAGIC Telescope Data |
|
|
280 | (3) |
|
13.5 Support Vector Machines (SVM) |
|
|
283 | (10) |
|
13.5.1 SVM with Weighted Data |
|
|
286 | (2) |
|
13.5.2 SVM with Probabilistic Outputs |
|
|
288 | (1) |
|
13.5.3 Numerical Implementation |
|
|
288 | (5) |
|
13.5.4 Multiclass Extensions |
|
|
293 | (1) |
|
13.6 Empirical Local Methods |
|
|
293 | (9) |
|
13.6.1 Classification by Probability Density Estimation |
|
|
294 | (1) |
|
13.6.2 Locally Weighted Regression |
|
|
295 | (3) |
|
13.6.3 Nearest Neighbors and Fuzzy Rules |
|
|
298 | (4) |
|
13.7 Kernel Methods: The Good, the Bad and the Curse of Dimensionality |
|
|
302 | (1) |
|
|
303 | (4) |
|
|
304 | (3) |
|
|
307 | (24) |
|
|
308 | (4) |
|
14.2 Predicting by Decision Trees |
|
|
312 | (1) |
|
|
312 | (1) |
|
|
313 | (6) |
|
14.4.1 Example: Pruning a Classification Tree |
|
|
317 | (2) |
|
14.5 Trees for Multiple Classes |
|
|
319 | (1) |
|
14.6 Splits on Categorical Variables |
|
|
320 | (1) |
|
|
321 | (2) |
|
|
323 | (1) |
|
|
324 | (3) |
|
14.10 Why Are Decision Trees Good (or Bad)? |
|
|
327 | (1) |
|
|
328 | (3) |
|
|
329 | (2) |
|
|
331 | (40) |
|
|
332 | (26) |
|
|
332 | (1) |
|
15.1.2 AdaBoost for Two Classes |
|
|
333 | (3) |
|
15.1.3 Minimizing Convex Loss by Stagewise Additive Modeling |
|
|
336 | (7) |
|
15.1.4 Maximizing the Minimal Margin |
|
|
343 | (8) |
|
15.1.5 Nonconvex Loss and Robust Boosting |
|
|
351 | (6) |
|
15.1.6 Boosting for Multiple Classes |
|
|
357 | (1) |
|
15.2 Diversifying the Weak Learner: Bagging, Random Subspace and Random Forest |
|
|
358 | (7) |
|
15.2.1 Measures of Diversity |
|
|
359 | (2) |
|
15.2.2 Bagging and Random Forest |
|
|
361 | (2) |
|
|
363 | (1) |
|
15.2.4 Example: K/π Separation for BaBar PID |
|
|
364 | (1) |
|
15.3 Choosing an Ensemble for Your Analysis |
|
|
365 | (2) |
|
|
367 | (4) |
|
|
367 | (4) |
|
16 Reducing Multiclass to Binary |
|
|
371 | (10) |
|
|
372 | (3) |
|
|
375 | (3) |
|
16.3 Summary: Choosing the Right Design |
|
|
378 | (3) |
|
|
379 | (2) |
|
17 How to Choose the Right classifier for Your Analysis and Apply It Correctly |
|
|
381 | (4) |
|
17.1 Predictive Performance and Interpretability |
|
|
381 | (1) |
|
17.2 Matching Classifiers and Variables |
|
|
382 | (1) |
|
17.3 Using Classifier Predictions |
|
|
382 | (1) |
|
|
383 | (1) |
|
17.5 CPU and Memory Requirements |
|
|
383 | (2) |
|
18 Methods for Variable Ranking and Selection |
|
|
385 | (32) |
|
|
386 | (3) |
|
18.1.1 Variable Ranking and Selection |
|
|
386 | (1) |
|
18.1.2 Strong and Weak Relevance |
|
|
386 | (3) |
|
|
389 | (12) |
|
18.2.1 Filters: Correlation and Mutual Information |
|
|
390 | (4) |
|
18.2.2 Wrappers: Sequential Forward Selection (SFS), Sequential Backward Elimination (SBE), and Feature-based Sensitivity of Posterior Probabilities (FSPP) |
|
|
394 | (6) |
|
18.2.3 Embedded Methods: Estimation of Variable Importance by Decision Trees, Neural Networks, Nearest Neighbors, and Linear Models |
|
|
400 | (1) |
|
|
401 | (12) |
|
18.3.1 Optimal-Set Search Strategies |
|
|
401 | (2) |
|
18.3.2 Multiple Testing: Backward Elimination by change in Margin (BECM) |
|
|
403 | (7) |
|
18.3.3 Estimation of the Reference Distribution by Permutations: Artificial Contrasts with Ensembles (ACE) Algorithm |
|
|
410 | (3) |
|
|
413 | (4) |
|
|
414 | (3) |
|
19 Bump Hunting in Multivariate Data |
|
|
417 | (8) |
|
19.1 Voronoi Tessellation and SLEUTH Algorithm |
|
|
418 | (2) |
|
19.2 Identifying Box Regions by PRIM and Other Algorithms |
|
|
420 | (2) |
|
19.3 Bump Hunting Through Supervised Learning |
|
|
422 | (3) |
|
|
423 | (2) |
|
20 Software Packages for Machine Learning |
|
|
425 | (6) |
|
20.1 Tools Developed in HEP |
|
|
425 | (1) |
|
|
426 | (1) |
|
|
427 | (1) |
|
20.4 Tools for Java and Python |
|
|
428 | (1) |
|
20.5 What Software Tool Is Right for You? |
|
|
429 | (2) |
|
|
430 | (1) |
|
Appendix A Optimization Algorithms |
|
|
431 | (4) |
|
|
431 | (1) |
|
A.2 Linear Programming (LP) |
|
|
432 | (3) |
Index |
|
435 | |