Preface |
|
xvii | |
Notations |
|
xxi | |
|
|
1 | (20) |
|
1.1 What Is Machine Learning? |
|
|
1 | (3) |
|
1.2 Examples of Machine Learning Applications |
|
|
4 | (10) |
|
1.2.1 Learning Associations |
|
|
4 | (1) |
|
|
5 | (4) |
|
|
9 | (2) |
|
1.2.4 Unsupervised Learning |
|
|
11 | (2) |
|
1.2.5 Reinforcement Learning |
|
|
13 | (1) |
|
|
14 | (3) |
|
|
17 | (1) |
|
|
18 | (2) |
|
|
20 | (1) |
|
|
21 | (28) |
|
2.1 Learning a Class from Examples |
|
|
21 | (6) |
|
2.2 Vapnik-Chervonenkis Dimension |
|
|
27 | (2) |
|
2.3 Probably Approximately Correct Learning |
|
|
29 | (1) |
|
|
30 | (2) |
|
2.5 Learning Multiple Classes |
|
|
32 | (2) |
|
|
34 | (3) |
|
2.7 Model Selection and Generalization |
|
|
37 | (4) |
|
2.8 Dimensions of a Supervised Machine Learning Algorithm |
|
|
41 | (1) |
|
|
42 | (1) |
|
|
43 | (4) |
|
|
47 | (2) |
|
3 Bayesian Decision Theory |
|
|
49 | (16) |
|
|
49 | (2) |
|
|
51 | (2) |
|
|
53 | (2) |
|
3.4 Discriminant Functions |
|
|
55 | (1) |
|
|
56 | (3) |
|
|
59 | (1) |
|
|
60 | (4) |
|
|
64 | (1) |
|
|
65 | (28) |
|
|
65 | (1) |
|
4.2 Maximum Likelihood Estimation |
|
|
66 | (3) |
|
|
67 | (1) |
|
4.2.2 Multinomial Density |
|
|
68 | (1) |
|
4.2.3 Gaussian (Normal) Density |
|
|
68 | (1) |
|
4.3 Evaluating an Estimator: Bias and Variance |
|
|
69 | (1) |
|
|
70 | (3) |
|
4.5 Parametric Classification |
|
|
73 | (4) |
|
|
77 | (3) |
|
4.7 Tuning Model Complexity: Bias/Variance Dilemma |
|
|
80 | (3) |
|
4.8 Model Selection Procedures |
|
|
83 | (4) |
|
|
87 | (1) |
|
|
88 | (2) |
|
|
90 | (3) |
|
|
93 | (22) |
|
|
93 | (1) |
|
|
94 | (1) |
|
5.3 Estimation of Missing Values |
|
|
95 | (1) |
|
5.4 Multivariate Normal Distribution |
|
|
96 | (4) |
|
5.5 Multivariate Classification |
|
|
100 | (6) |
|
|
106 | (2) |
|
|
108 | (1) |
|
5.8 Multivariate Regression |
|
|
109 | (2) |
|
|
111 | (1) |
|
|
112 | (1) |
|
|
113 | (2) |
|
6 Dimensionality Reduction |
|
|
115 | (46) |
|
|
115 | (1) |
|
|
116 | (4) |
|
6.3 Principal Component Analysis |
|
|
120 | (7) |
|
|
127 | (3) |
|
|
130 | (5) |
|
6.6 Singular Value Decomposition and Matrix Factorization |
|
|
135 | (1) |
|
6.7 Multidimensional Scaling |
|
|
136 | (4) |
|
6.8 Linear Discriminant Analysis |
|
|
140 | (5) |
|
6.9 Canonical Correlation Analysis |
|
|
145 | (3) |
|
|
148 | (2) |
|
6.11 Locally Linear Embedding |
|
|
150 | (3) |
|
|
153 | (2) |
|
|
155 | (2) |
|
|
157 | (1) |
|
|
158 | (3) |
|
|
161 | (24) |
|
|
161 | (1) |
|
|
162 | (1) |
|
|
163 | (4) |
|
7.4 Expectation-Maximization Algorithm |
|
|
167 | (5) |
|
7.5 Mixtures of Latent Variable Models |
|
|
172 | (1) |
|
7.6 Supervised Learning after Clustering |
|
|
173 | (2) |
|
|
175 | (1) |
|
7.8 Hierarchical Clustering |
|
|
176 | (2) |
|
7.9 Choosing the Number of Clusters |
|
|
178 | (1) |
|
|
179 | (1) |
|
|
180 | (2) |
|
|
182 | (3) |
|
|
185 | (28) |
|
|
185 | (1) |
|
8.2 Nonparametric Density Estimation |
|
|
186 | (6) |
|
8.2.1 Histogram Estimator |
|
|
187 | (1) |
|
|
188 | (2) |
|
8.2.3 k-Nearest Neighbor Estimator |
|
|
190 | (2) |
|
8.3 Generalization to Multivariate Data |
|
|
192 | (1) |
|
8.4 Nonparametric Classification |
|
|
193 | (1) |
|
8.5 Condensed Nearest Neighbor |
|
|
194 | (2) |
|
8.6 Distance-Based Classification |
|
|
196 | (3) |
|
|
199 | (2) |
|
8.8 Nonparametric Regression: Smoothing Models |
|
|
201 | (3) |
|
8.8.1 Running Mean Smoother |
|
|
201 | (2) |
|
|
203 | (1) |
|
8.8.3 Running Line Smoother |
|
|
204 | (1) |
|
8.9 How to Choose the Smoothing Parameter |
|
|
204 | (1) |
|
|
205 | (3) |
|
|
208 | (2) |
|
|
210 | (3) |
|
|
213 | (26) |
|
|
213 | (2) |
|
|
215 | (7) |
|
9.2.1 Classification Trees |
|
|
216 | (4) |
|
|
220 | (2) |
|
|
222 | (3) |
|
9.4 Rule Extraction from Trees |
|
|
225 | (1) |
|
9.5 Learning Rules from Data |
|
|
226 | (4) |
|
|
230 | (2) |
|
|
232 | (3) |
|
|
235 | (2) |
|
|
237 | (2) |
|
|
239 | (28) |
|
|
239 | (2) |
|
10.2 Generalizing the Linear Model |
|
|
241 | (1) |
|
10.3 Geometry of the Linear Discriminant |
|
|
242 | (4) |
|
|
242 | (2) |
|
|
244 | (2) |
|
|
246 | (1) |
|
10.5 Parametric Discrimination Revisited |
|
|
247 | (1) |
|
|
248 | (2) |
|
10.7 Logistic Discrimination |
|
|
250 | (7) |
|
|
250 | (4) |
|
|
254 | (3) |
|
10.8 Discrimination by Regression |
|
|
257 | (3) |
|
|
260 | (3) |
|
|
263 | (1) |
|
|
263 | (3) |
|
|
266 | (1) |
|
11 Multilayer Perceptrons |
|
|
267 | (50) |
|
|
267 | (4) |
|
11.1.1 Understanding the Brain |
|
|
268 | (1) |
|
11.1.2 Neural Networks as a Paradigm for Parallel Processing |
|
|
269 | (2) |
|
|
271 | (3) |
|
11.3 Training a Perceptron |
|
|
274 | (3) |
|
11.4 Learning Boolean Functions |
|
|
277 | (2) |
|
11.5 Multilayer Perceptrons |
|
|
279 | (2) |
|
11.6 MLP as a Universal Approximator |
|
|
281 | (2) |
|
11.7 Backpropagation Algorithm |
|
|
283 | (7) |
|
11.7.1 Nonlinear Regression |
|
|
284 | (2) |
|
11.7.2 Two-Class Discrimination |
|
|
286 | (2) |
|
11.7.3 Multiclass Discrimination |
|
|
288 | (2) |
|
11.7.4 Multiple Hidden Layers |
|
|
290 | (1) |
|
|
290 | (7) |
|
11.8.1 Improving Convergence |
|
|
290 | (1) |
|
|
291 | (1) |
|
11.8.3 Structuring the Network |
|
|
292 | (3) |
|
|
295 | (2) |
|
11.9 Tuning the Network Size |
|
|
297 | (3) |
|
11.10 Bayesian View of Learning |
|
|
300 | (1) |
|
11.11 Dimensionality Reduction |
|
|
301 | (3) |
|
|
304 | (2) |
|
11.12.1 Time Delay Neural Networks |
|
|
304 | (1) |
|
11.12.2 Recurrent Networks |
|
|
305 | (1) |
|
|
306 | (3) |
|
|
309 | (2) |
|
|
311 | (2) |
|
|
313 | (4) |
|
|
317 | (32) |
|
|
317 | (1) |
|
12.2 Competitive Learning |
|
|
318 | (8) |
|
|
318 | (5) |
|
12.2.2 Adaptive Resonance Theory |
|
|
323 | (1) |
|
12.2.3 Self-Organizing Maps |
|
|
324 | (2) |
|
12.3 Radial Basis Functions |
|
|
326 | (6) |
|
12.4 Incorporating Rule-Based Knowledge |
|
|
332 | (1) |
|
12.5 Normalized Basis Functions |
|
|
333 | (2) |
|
12.6 Competitive Basis Functions |
|
|
335 | (3) |
|
12.7 Learning Vector Quantization |
|
|
338 | (1) |
|
12.8 The Mixture of Experts |
|
|
338 | (4) |
|
12.8.1 Cooperative Experts |
|
|
341 | (1) |
|
12.8.2 Competitive Experts |
|
|
342 | (1) |
|
12.9 Hierarchical Mixture of Experts |
|
|
342 | (1) |
|
|
343 | (1) |
|
|
344 | (3) |
|
|
347 | (2) |
|
|
349 | (38) |
|
|
349 | (2) |
|
13.2 Optimal Separating Hyperplane |
|
|
351 | (4) |
|
13.3 The Nonseparable Case: Soft Margin Hyperplane |
|
|
355 | (3) |
|
|
358 | (1) |
|
|
359 | (2) |
|
|
361 | (3) |
|
|
364 | (1) |
|
13.8 Multiple Kernel Learning |
|
|
365 | (2) |
|
13.9 Multiclass Kernel Machines |
|
|
367 | (1) |
|
13.10 Kernel Machines for Regression |
|
|
368 | (5) |
|
13.11 Kernel Machines for Ranking |
|
|
373 | (1) |
|
13.12 One-Class Kernel Machines |
|
|
374 | (3) |
|
13.13 Large Margin Nearest Neighbor Classifier |
|
|
377 | (2) |
|
13.14 Kernel Dimensionality Reduction |
|
|
379 | (1) |
|
|
380 | (2) |
|
|
382 | (1) |
|
|
383 | (4) |
|
|
387 | (30) |
|
|
387 | (2) |
|
14.2 Canonical Cases for Conditional Independence |
|
|
389 | (7) |
|
|
396 | (3) |
|
|
399 | (1) |
|
|
399 | (8) |
|
|
400 | (2) |
|
|
402 | (2) |
|
|
404 | (2) |
|
|
406 | (1) |
|
14.6 Undirected Graphs: Markov Random Fields |
|
|
407 | (3) |
|
14.7 Learning the Structure of a Graphical Model |
|
|
410 | (1) |
|
|
411 | (1) |
|
|
412 | (1) |
|
|
413 | (2) |
|
|
415 | (2) |
|
|
417 | (28) |
|
|
417 | (1) |
|
15.2 Discrete Markov Processes |
|
|
418 | (3) |
|
15.3 Hidden Markov Models |
|
|
421 | (2) |
|
15.4 Three Basic Problems of HMMs |
|
|
423 | (1) |
|
|
423 | (4) |
|
15.6 Finding the State Sequence |
|
|
427 | (2) |
|
15.7 Learning Model Parameters |
|
|
429 | (3) |
|
15.8 Continuous Observations |
|
|
432 | (1) |
|
15.9 The HMM as a Graphical Model |
|
|
433 | (3) |
|
15.10 Model Selection in HMMs |
|
|
436 | (2) |
|
|
438 | (2) |
|
|
440 | (3) |
|
|
443 | (2) |
|
|
445 | (42) |
|
|
445 | (4) |
|
16.2 Bayesian Estimation of the Parameters of a Discrete Distribution |
|
|
449 | (2) |
|
16.2.1 K > 2 States: Dirichlet Distribution |
|
|
449 | (1) |
|
16.2.2 K = 2 States: Beta Distribution |
|
|
450 | (1) |
|
16.3 Bayesian Estimation of the Parameters of a Gaussian Distribution |
|
|
451 | (5) |
|
16.3.1 Univariate Case: Unknown Mean, Known Variance |
|
|
451 | (2) |
|
16.3.2 Univariate Case: Unknown Mean, Unknown Variance |
|
|
453 | (2) |
|
16.3.3 Multivariate Case: Unknown Mean, Unknown Covariance |
|
|
455 | (1) |
|
16.4 Bayesian Estimation of the Parameters of a Function |
|
|
456 | (10) |
|
|
456 | (4) |
|
16.4.2 Regression with Prior on Noise Precision |
|
|
460 | (1) |
|
16.4.3 The Use of Basis/Kernel Functions |
|
|
461 | (2) |
|
16.4.4 Bayesian Classification |
|
|
463 | (3) |
|
|
466 | (1) |
|
16.6 Bayesian Model Comparison |
|
|
467 | (3) |
|
16.7 Bayesian Estimation of a Mixture Model |
|
|
470 | (3) |
|
16.8 Nonparametric Bayesian Modeling |
|
|
473 | (1) |
|
|
474 | (4) |
|
16.10 Dirichlet Processes and Chinese Restaurants |
|
|
478 | (2) |
|
16.11 Latent Dirichlet Allocation |
|
|
480 | (2) |
|
16.12 Beta Processes and Indian Buffets |
|
|
482 | (1) |
|
|
483 | (1) |
|
|
484 | (1) |
|
|
485 | (2) |
|
17 Combining Multiple Learners |
|
|
487 | (30) |
|
|
487 | (1) |
|
17.2 Generating Diverse Learners |
|
|
488 | (3) |
|
17.3 Model Combination Schemes |
|
|
491 | (1) |
|
|
492 | (4) |
|
17.5 Error-Correcting Output Codes |
|
|
496 | (2) |
|
|
498 | (1) |
|
|
499 | (3) |
|
17.8 The Mixture of Experts Revisited |
|
|
502 | (2) |
|
17.9 Stacked Generalization |
|
|
504 | (1) |
|
17.10 Fine-Tuning an Ensemble |
|
|
505 | (2) |
|
17.10.1 Choosing a Subset of the Ensemble |
|
|
506 | (1) |
|
17.10.2 Constructing Metalearners |
|
|
506 | (1) |
|
|
507 | (2) |
|
|
509 | (2) |
|
|
511 | (2) |
|
|
513 | (4) |
|
18 Reinforcement Learning |
|
|
517 | (30) |
|
|
517 | (2) |
|
18.2 Single State Case: K-Armed Bandit |
|
|
519 | (1) |
|
18.3 Elements of Reinforcement Learning |
|
|
520 | (3) |
|
18.4 Model-Based Learning |
|
|
523 | (2) |
|
|
523 | (1) |
|
|
524 | (1) |
|
18.5 Temporal Difference Learning |
|
|
525 | (6) |
|
18.5.1 Exploration Strategies |
|
|
525 | (1) |
|
18.5.2 Deterministic Rewards and Actions |
|
|
526 | (1) |
|
18.5.3 Nondeterministic Rewards and Actions |
|
|
527 | (3) |
|
18.5.4 Eligibility Traces |
|
|
530 | (1) |
|
|
531 | (3) |
|
18.7 Partially Observable States |
|
|
534 | (7) |
|
|
534 | (2) |
|
18.7.2 Example: The Tiger Problem |
|
|
536 | (5) |
|
|
541 | (1) |
|
|
542 | (2) |
|
|
544 | (3) |
|
19 Design and Analysis of Machine Learning Experiments |
|
|
547 | (46) |
|
|
547 | (3) |
|
19.2 Factors, Response, and Strategy of Experimentation |
|
|
550 | (3) |
|
19.3 Response Surface Design |
|
|
553 | (1) |
|
19.4 Randomization, Replication, and Blocking |
|
|
554 | (1) |
|
19.5 Guidelines for Machine Learning Experiments |
|
|
555 | (3) |
|
19.6 Cross-Validation and Resampling Methods |
|
|
558 | (3) |
|
19.6.1 K-Fold Cross-Validation |
|
|
559 | (1) |
|
19.6.2 5 X 2 Cross-Validation |
|
|
560 | (1) |
|
|
561 | (1) |
|
19.7 Measuring Classifier Performance |
|
|
561 | (3) |
|
|
564 | (4) |
|
|
568 | (2) |
|
19.10 Assessing a Classification Algorithm's Performance |
|
|
570 | (3) |
|
|
571 | (1) |
|
19.10.2 Approximate Normal Test |
|
|
572 | (1) |
|
|
572 | (1) |
|
19.11 Comparing Two Classification Algorithms |
|
|
573 | (3) |
|
|
573 | (1) |
|
19.11.2 K-Fold Cross-Validated Paired T Test |
|
|
573 | (1) |
|
19.11.3 5 X 2 cv Paired T Test |
|
|
574 | (1) |
|
19.11.4 5 X 2 cv Paired F Test |
|
|
575 | (1) |
|
19.12 Comparing Multiple Algorithms: Analysis of Variance |
|
|
576 | (4) |
|
19.13 Comparison over Multiple Datasets |
|
|
580 | (4) |
|
19.13.1 Comparing Two Algorithms |
|
|
581 | (2) |
|
19.13.2 Multiple Algorithms |
|
|
583 | (1) |
|
|
584 | (3) |
|
19.14.1 Comparing Two Algorithms |
|
|
585 | (1) |
|
19.14.2 Comparing Multiple Algorithms |
|
|
586 | (1) |
|
|
587 | (1) |
|
|
588 | (2) |
|
|
590 | (3) |
|
|
593 | (12) |
|
A.1 Elements of Probability |
|
|
593 | (2) |
|
A.1.1 Axioms of Probability |
|
|
594 | (1) |
|
A.1.2 Conditional Probability |
|
|
594 | (1) |
|
|
595 | (4) |
|
A.2.1 Probability Distribution and Density Functions |
|
|
595 | (1) |
|
A.2.2 Joint Distribution and Density Functions |
|
|
596 | (1) |
|
A.2.3 Conditional Distributions |
|
|
596 | (1) |
|
|
597 | (1) |
|
|
597 | (1) |
|
|
598 | (1) |
|
A.2.7 Weak Law of Large Numbers |
|
|
599 | (1) |
|
A.3 Special Random Variables |
|
|
599 | (4) |
|
A.3.1 Bernoulli Distribution |
|
|
599 | (1) |
|
A.3.2 Binomial Distribution |
|
|
600 | (1) |
|
A.3.3 Multinomial Distribution |
|
|
600 | (1) |
|
A.3.4 Uniform Distribution |
|
|
600 | (1) |
|
A.3.5 Normal (Gaussian) Distribution |
|
|
601 | (1) |
|
A.3.6 Chi-Square Distribution |
|
|
602 | (1) |
|
|
603 | (1) |
|
|
603 | (1) |
|
|
603 | (2) |
Index |
|
605 | |