Preface |
|
xix | |
Notations |
|
xxiii | |
|
|
1 | (22) |
|
1.1 What Is Machine Learning? |
|
|
1 | (3) |
|
1.2 Examples of Machine Learning Applications |
|
|
4 | (9) |
|
|
4 | (1) |
|
|
4 | (5) |
|
|
9 | (2) |
|
1.2.4 Unsupervised Learning |
|
|
11 | (1) |
|
1.2.5 Reinforcement Learning |
|
|
12 | (1) |
|
|
13 | (2) |
|
|
15 | (3) |
|
1.4.1 High-Performance Computing |
|
|
15 | (1) |
|
1.4.2 Data Privacy and Security |
|
|
16 | (1) |
|
1.4.3 Model Interpretability and Trust |
|
|
17 | (1) |
|
|
18 | (1) |
|
|
18 | (2) |
|
|
20 | (3) |
|
|
23 | (194) |
|
2.1 Learning a Class from Examples |
|
|
23 | (6) |
|
2.2 Vapnik-Chervonenkis Dimension |
|
|
29 | (2) |
|
2.3 Probably Approximately Correct Learning |
|
|
31 | (1) |
|
|
32 | (2) |
|
2.5 Learning Multiple Classes |
|
|
34 | (156) |
|
8.2 Nonparametric Density Estimation |
|
|
190 | (6) |
|
8.2.1 Histogram Estimator |
|
|
191 | (1) |
|
|
192 | (2) |
|
8.2.3 fe-Nearest Neighbor Estimator |
|
|
194 | (2) |
|
8.3 Generalization to Multivariate Data |
|
|
196 | (1) |
|
8.4 Nonparametric Classification |
|
|
197 | (1) |
|
8.5 Condensed Nearest Neighbor |
|
|
198 | (2) |
|
8.6 Distance-Based Classification |
|
|
200 | (3) |
|
|
203 | (2) |
|
8.8 Nonparametric Regression: Smoothing Models |
|
|
205 | (3) |
|
8.8.1 Running Mean Smoother |
|
|
205 | (2) |
|
|
207 | (1) |
|
8.8.3 Running Line Smoother |
|
|
208 | (1) |
|
8.9 How to Choose the Smoothing Parameter |
|
|
208 | (1) |
|
|
209 | (3) |
|
|
212 | (2) |
|
|
214 | (3) |
|
|
217 | (26) |
|
|
217 | (2) |
|
|
219 | (7) |
|
9.2.1 Classification Trees |
|
|
220 | (4) |
|
|
224 | (2) |
|
|
226 | (3) |
|
9.4 Rule Extraction from Trees |
|
|
229 | (1) |
|
9.5 Learning Rules from Data |
|
|
230 | (4) |
|
|
234 | (2) |
|
|
236 | (3) |
|
|
239 | (2) |
|
|
241 | (2) |
|
|
243 | (28) |
|
|
243 | (2) |
|
10.2 Generalizing the Linear Model |
|
|
245 | (1) |
|
10.3 Geometry of the Linear Discriminant |
|
|
246 | (4) |
|
|
246 | (2) |
|
|
248 | (2) |
|
|
250 | (1) |
|
10.5 Parametric Discrimination Revisited |
|
|
251 | (1) |
|
|
252 | (2) |
|
10.7 Logistic Discrimination |
|
|
254 | (10) |
|
|
254 | (3) |
|
|
257 | (6) |
|
|
263 | (1) |
|
|
264 | (1) |
|
|
265 | (2) |
|
|
267 | (2) |
|
|
269 | (2) |
|
11 Multilayer Perceptrons |
|
|
271 | (42) |
|
|
271 | (4) |
|
11.1.1 Understanding the Brain |
|
|
272 | (1) |
|
11.1.2 Neural Networks as a Paradigm for Parallel Processing |
|
|
273 | (2) |
|
|
275 | (3) |
|
11.3 Training a Perceptron |
|
|
278 | (4) |
|
11.4 Learning Boolean Functions |
|
|
282 | (1) |
|
11.5 Multilayer Perceptrons |
|
|
283 | (3) |
|
11.6 MLP as a Universal Approximator |
|
|
286 | (2) |
|
11.7 Backpropagation Algorithm |
|
|
288 | (7) |
|
11.7.1 Nonlinear Regression |
|
|
288 | (3) |
|
11.7.2 Two-Class Discrimination |
|
|
291 | (1) |
|
11.7.3 Multiclass Discrimination |
|
|
292 | (2) |
|
11.7.4 Multilabel Discrimination |
|
|
294 | (1) |
|
|
295 | (1) |
|
11.9 Learning Hidden Representations |
|
|
296 | (5) |
|
|
301 | (2) |
|
11.11 Word2vec Architecture |
|
|
303 | (4) |
|
|
307 | (2) |
|
|
309 | (1) |
|
|
310 | (3) |
|
|
313 | (48) |
|
|
313 | (4) |
|
12.2 How to Train Multiple Hidden Layers |
|
|
317 | (4) |
|
12.2.1 Rectified Linear Unit |
|
|
317 | (1) |
|
|
317 | (1) |
|
12.2.3 Generalizing Backpropagation to Multiple Hidden Layers |
|
|
318 | (3) |
|
12.3 Improving Training Convergence |
|
|
321 | (4) |
|
|
321 | (1) |
|
12.3.2 Adaptive Learning Factor |
|
|
322 | (1) |
|
12.3.3 Batch Normalization |
|
|
323 | (2) |
|
|
325 | (6) |
|
|
325 | (2) |
|
|
327 | (3) |
|
|
330 | (1) |
|
12.5 Convolutional Layers |
|
|
331 | (9) |
|
|
331 | (2) |
|
|
333 | (4) |
|
12.5.3 Examples: LeNet-5 and AlexNet |
|
|
337 | (1) |
|
|
338 | (2) |
|
12.5.5 Multimodal Deep Networks |
|
|
340 | (1) |
|
12.6 Tuning the Network Structure |
|
|
340 | (4) |
|
12.6.1 Structure and Hyperparameter Search |
|
|
340 | (2) |
|
|
342 | (1) |
|
|
343 | (1) |
|
|
344 | (6) |
|
|
344 | (1) |
|
12.7.2 Time-Delay Neural Networks |
|
|
345 | (1) |
|
12.7.3 Recurrent Networks |
|
|
345 | (3) |
|
12.7.4 Long Short-Term Memory Unit |
|
|
348 | (1) |
|
12.7.5 Gated Recurrent Unit |
|
|
349 | (1) |
|
12.8 Generative Adversarial Network |
|
|
350 | (3) |
|
|
353 | (1) |
|
|
354 | (2) |
|
|
356 | (5) |
|
|
361 | (34) |
|
|
361 | (1) |
|
13.2 Competitive Learning |
|
|
362 | (8) |
|
|
362 | (5) |
|
13.2.2 Adaptive Resonance Theory |
|
|
367 | (1) |
|
13.2.3 Self-Organizing Maps |
|
|
368 | (2) |
|
13.3 Radial Basis Functions |
|
|
370 | (6) |
|
13.4 Incorporating Rule-Based Knowledge |
|
|
376 | (1) |
|
13.5 Normalized Basis Functions |
|
|
377 | (2) |
|
13.6 Competitive Basis Functions |
|
|
379 | (3) |
|
13.7 Learning Vector Quantization |
|
|
382 | (1) |
|
13.8 The Mixture of Experts |
|
|
382 | (4) |
|
13.8.1 Cooperative Experts |
|
|
385 | (1) |
|
13.8.2 Competitive Experts |
|
|
386 | (1) |
|
13.9 Hierarchical Mixture of Experts and Soft Decision Trees |
|
|
386 | (2) |
|
|
388 | (1) |
|
|
389 | (3) |
|
|
392 | (3) |
|
|
395 | (38) |
|
|
395 | (2) |
|
14.2 Optimal Separating Hyperplane |
|
|
397 | (4) |
|
14.3 The Nonseparable Case: Soft Margin Hyperplane |
|
|
401 | (3) |
|
|
404 | (1) |
|
|
405 | (2) |
|
|
407 | (3) |
|
|
410 | (1) |
|
14.8 Multiple Kernel Learning |
|
|
411 | (2) |
|
14.9 Multiclass Kernel Machines |
|
|
413 | (1) |
|
14.10 Kernel Machines for Regression |
|
|
414 | (5) |
|
14.11 Kernel Machines for Ranking |
|
|
419 | (1) |
|
14.12 One-Class Kernel Machines |
|
|
420 | (3) |
|
14.13 Large Margin Nearest Neighbor Classifier |
|
|
423 | (2) |
|
14.14 Kernel Dimensionality Reduction |
|
|
425 | (1) |
|
|
426 | (2) |
|
|
428 | (1) |
|
|
429 | (4) |
|
|
433 | (30) |
|
|
433 | (2) |
|
15.2 Canonical Cases for Conditional Independence |
|
|
435 | (7) |
|
|
442 | (3) |
|
|
445 | (1) |
|
|
445 | (8) |
|
|
446 | (2) |
|
|
448 | (2) |
|
|
450 | (2) |
|
|
452 | (1) |
|
15.6 Undirected Graphs: Markov Random Fields |
|
|
453 | (3) |
|
15.7 Learning the Structure of a Graphical Model |
|
|
456 | (1) |
|
|
457 | (1) |
|
|
458 | (1) |
|
|
459 | (2) |
|
|
461 | (2) |
|
|
463 | (28) |
|
|
463 | (1) |
|
16.2 Discrete Markov Processes |
|
|
464 | (3) |
|
16.3 Hidden Markov Models |
|
|
467 | (2) |
|
16.4 Three Basic Problems of HMMs |
|
|
469 | (1) |
|
|
469 | (4) |
|
16.6 Finding the State Sequence |
|
|
473 | (2) |
|
16.7 Learning Model Parameters |
|
|
475 | (3) |
|
16.8 Continuous Observations |
|
|
478 | (1) |
|
16.9 The HMM as a Graphical Model |
|
|
479 | (3) |
|
16.10 Model Selection in HMMs |
|
|
482 | (2) |
|
|
484 | (2) |
|
|
486 | (3) |
|
|
489 | (2) |
|
|
491 | (42) |
|
|
491 | (4) |
|
17.2 Bayesian Estimation of the Parameters of a Discrete Distribution |
|
|
495 | (2) |
|
17.2.1 K > 2 States: Dirichlet Distribution |
|
|
495 | (1) |
|
17.2.2 K = 2 States: Beta Distribution |
|
|
496 | (1) |
|
17.3 Bayesian Estimation of the Parameters of a Gaussian Distribution |
|
|
497 | (5) |
|
17.3.1 Univariate Case: Unknown Mean, Known Variance |
|
|
497 | (2) |
|
17.3.2 Univariate Case: Unknown Mean, Unknown Variance |
|
|
499 | (2) |
|
17.3.3 Multivariate Case: Unknown Mean, Unknown Covariance |
|
|
501 | (1) |
|
17.4 Bayesian Estimation of the Parameters of a Function |
|
|
502 | (10) |
|
|
502 | (4) |
|
17.4.2 Regression with Prior on Noise Precision |
|
|
506 | (1) |
|
17.4.3 The Use of Basis/Kernel Functions |
|
|
507 | (2) |
|
17.4.4 Bayesian Classification |
|
|
509 | (3) |
|
|
512 | (1) |
|
17.6 Bayesian Model Comparison |
|
|
513 | (3) |
|
17.7 Bayesian Estimation of a Mixture Model |
|
|
516 | (3) |
|
17.8 Nonparametric Bayesian Modeling |
|
|
519 | (1) |
|
|
520 | (4) |
|
17.10 Dirichlet Processes and Chinese Restaurants |
|
|
524 | (2) |
|
17.11 Latent Dirichlet Allocation |
|
|
526 | (2) |
|
17.12 Beta Processes and Indian Buffets |
|
|
528 | (1) |
|
|
529 | (1) |
|
|
530 | (1) |
|
|
531 | (2) |
|
18 Combining Multiple Learners |
|
|
533 | (30) |
|
|
533 | (1) |
|
18.2 Generating Diverse Learners |
|
|
534 | (3) |
|
18.3 Model Combination Schemes |
|
|
537 | (1) |
|
|
538 | (4) |
|
18.5 Error-Correcting Output Codes |
|
|
542 | (2) |
|
|
544 | (1) |
|
|
545 | (3) |
|
18.8 The Mixture of Experts Revisited |
|
|
548 | (2) |
|
18.9 Stacked Generalization |
|
|
550 | (1) |
|
18.10 Fine-Tunmg an Ensemble |
|
|
551 | (2) |
|
18.10.1 Choosing a Subset of the Ensemble |
|
|
552 | (1) |
|
18.10.2 Constructing Metalearners |
|
|
552 | (1) |
|
|
553 | (2) |
|
|
555 | (2) |
|
|
557 | (2) |
|
|
559 | (4) |
|
19 Reinforcement Learning |
|
|
563 | (34) |
|
|
563 | (2) |
|
19.2 Single State Case: K-Armed Bandit |
|
|
565 | (1) |
|
19.3 Elements of Reinforcement Learning |
|
|
566 | (3) |
|
19.4 Model-Based Learning |
|
|
569 | (2) |
|
|
569 | (1) |
|
|
570 | (1) |
|
19.5 Temporal Difference Learning |
|
|
571 | (6) |
|
19.5.1 Exploration Strategies |
|
|
571 | (1) |
|
19.5.2 Deterministic Rewards and Actions |
|
|
572 | (1) |
|
19.5.3 Nondeterrninistic Rewards and Actions |
|
|
573 | (3) |
|
19.5.4 Eligibility Traces |
|
|
576 | (1) |
|
|
577 | (3) |
|
19.7 Partially Observable States |
|
|
580 | (7) |
|
|
580 | (2) |
|
19.7.2 Example: The Tiger Problem |
|
|
582 | (5) |
|
|
587 | (1) |
|
|
588 | (3) |
|
19.10 Learning to Play Backgammon and Go |
|
|
591 | (1) |
|
|
592 | (1) |
|
|
593 | (2) |
|
|
595 | (2) |
|
20 Design and Analysis of Machine Learning Experiments |
|
|
597 | (46) |
|
|
597 | (3) |
|
20.2 Factors, Response, and Strategy of Experimentation |
|
|
600 | (3) |
|
20.3 Response Surface Design |
|
|
603 | (1) |
|
20.4 Randomization, Replication, and Blocking |
|
|
604 | (1) |
|
20.5 Guidelines for Machine Learning Experiments |
|
|
605 | (3) |
|
20.6 Cross-Validation and Resampling Methods |
|
|
608 | (3) |
|
20.6.1 K-Fold Cross-Validation |
|
|
609 | (1) |
|
20.6.2 5 x 2 Cross-Validation |
|
|
610 | (1) |
|
|
611 | (1) |
|
20.7 Measuring Classifier Performance |
|
|
611 | (3) |
|
|
614 | (4) |
|
|
618 | (2) |
|
20.10 Assessing a Classification Algorithm's Performance |
|
|
620 | (3) |
|
|
621 | (1) |
|
20.10.2 Approximate Normal Test |
|
|
622 | (1) |
|
|
622 | (1) |
|
20.11 Comparing Two Classification Algorithms |
|
|
623 | (3) |
|
|
623 | (1) |
|
20.11.2 K-Fold Cross-Validated Paired t Test |
|
|
623 | (1) |
|
20.11.3 5 × 2 cv Paired t Test |
|
|
624 | (1) |
|
20.11.4 5 × 2 cv Paired F Test |
|
|
625 | (1) |
|
20.12 Comparing Multiple Algorithms: Analysis of Variance |
|
|
626 | (4) |
|
20.13 Comparison over Multiple Datasets |
|
|
630 | (4) |
|
20.13.1 Comparing Two Algorithms |
|
|
631 | (2) |
|
20.13.2 Multiple Algorithms |
|
|
633 | (1) |
|
|
634 | (3) |
|
20.14.1 Comparing Two Algorithms |
|
|
635 | (1) |
|
20.14.2 Comparing Multiple Algorithms |
|
|
636 | (1) |
|
|
637 | (1) |
|
|
638 | (2) |
|
|
640 | (3) |
|
|
643 | (12) |
|
A.1 Elements of Probability |
|
|
643 | (2) |
|
A.1.1 Axioms of Probability |
|
|
644 | (1) |
|
A.1.2 Conditional Probability |
|
|
644 | (1) |
|
|
645 | (4) |
|
A.2.1 Probability Distribution and Density Functions |
|
|
645 | (1) |
|
A.2.2 Joint Distribution and Density Functions |
|
|
646 | (1) |
|
A.2.3 Conditional Distributions |
|
|
646 | (1) |
|
|
647 | (1) |
|
|
647 | (1) |
|
|
648 | (1) |
|
A.2.7 Weak Law of Large Numbers |
|
|
649 | (1) |
|
A.3 Special Random Variables |
|
|
649 | (4) |
|
A.3.1 Bernoulli Distribution |
|
|
649 | (1) |
|
A.3.2 Binomial Distribution |
|
|
650 | (1) |
|
A.3.3 Multinomial Distribution |
|
|
650 | (1) |
|
A.3.4 Uniform Distribution |
|
|
650 | (1) |
|
A.3.5 Normal (Gaussian) Distribution |
|
|
651 | (1) |
|
A.3.6 Chi-Square Distribution |
|
|
652 | (1) |
|
|
653 | (1) |
|
|
653 | (1) |
|
|
653 | (2) |
|
|
655 | (10) |
|
|
655 | (2) |
|
|
657 | (1) |
|
B.3 Similarity of Vectors |
|
|
658 | (1) |
|
|
659 | (1) |
|
B.5 Linear Dependence and Ranks |
|
|
659 | (1) |
|
|
660 | (1) |
|
B.7 Positive Definite Matrices |
|
|
660 | (1) |
|
B.8 Trace and Determinant |
|
|
660 | (1) |
|
B.9 Eigenvalues and Eigenvectors |
|
|
661 | (1) |
|
B.10 Spectral Decomposition |
|
|
662 | (1) |
|
B.11 Singular Value Decomposition |
|
|
662 | (1) |
|
|
663 | (2) |
|
|
665 | (8) |
|
|
665 | (2) |
|
|
667 | (1) |
|
|
667 | (1) |
|
|
668 | (2) |
|
|
670 | (1) |
|
|
671 | (2) |
Index |
|
673 | |