Muutke küpsiste eelistusi

E-raamat: Machine Learning: A Probabilistic Perspective

4.34/5 (1021 hinnangut Goodreads-ist)
  • Formaat: 1104 pages
  • Sari: Machine Learning
  • Ilmumisaeg: 07-Sep-2012
  • Kirjastus: MIT Press
  • Keel: eng
  • ISBN-13: 9780262305242
Teised raamatud teemal:
  • Formaat - PDF+DRM
  • Hind: 260,00 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: 1104 pages
  • Sari: Machine Learning
  • Ilmumisaeg: 07-Sep-2012
  • Kirjastus: MIT Press
  • Keel: eng
  • ISBN-13: 9780262305242
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

Muu info

Winner of Winner, 2013 DeGroot Prize awarded by the International Society for Bayesian Analysis 2013.
Preface xxvii
1 Introduction 1(26)
1.1 Machine learning: what and why?
1(2)
1.1.1 Types of machine learning
2(1)
1.2 Supervised learning
3(6)
1.2.1 Classification
3(5)
1.2.2 Regression
8(1)
1.3 Unsupervised learning
9(7)
1.3.1 Discovering clusters
10(1)
1.3.2 Discovering latent factors
11(2)
1.3.3 Discovering graph structure
13(1)
1.3.4 Matrix completion
14(2)
1.4 Some basic concepts in machine learning
16(11)
1.4.1 Parametric vs non-parametric models
16(1)
1.4.2 A simple non-parametric classifier: K-nearest neighbors
16(2)
1.4.3 The curse of dimensionality
18(1)
1.4.4 Parametric models for classification and regression
19(1)
1.4.5 Linear regression
19(2)
1.4.6 Logistic regression
21(1)
1.4.7 Overfitting
22(1)
1.4.8 Model selection
22(2)
1.4.9 No free lunch theorem
24(3)
2 Probability 27(38)
2.1 Introduction
27(1)
2.2 A brief review of probability theory
28(6)
2.2.1 Discrete random variables
28(1)
2.2.2 Fundamental rules
28(1)
2.2.3 Bayes rule
29(1)
2.2.4 Independence and conditional independence
30(2)
2.2.5 Continuous random variables
32(1)
2.2.6 Quantiles
33(1)
2.2.7 Mean and variance
33(1)
2.3 Some common discrete distributions
34(4)
2.3.1 The binomial and Bernoulli distributions
34(1)
2.3.2 The multinornial and multinoulli distributions
35(2)
2.3.3 The Poisson distribution
37(1)
2.3.4 The empirical distribution
37(1)
2.4 Some common continuous distributions
38(6)
2.4.1 Gaussian (normal) distribution
38(1)
2.4.2 Degenerate pdf
39(2)
2.4.3 Laplace distribution
41(1)
2.4.4 The gamma distribution
41(1)
2.4.5 The beta distribution
42(1)
2.4.6 Pareto distribution
43(1)
2.5 Joint probability distributions
44(5)
2.5.1 Covariance and correlation
44(2)
2.5.2 The multivariate Gaussian
46(1)
2.5.3 Multivariate Student t distribution
46(1)
2.5.4 Dirichlet distribution
47(2)
2.6 Transformations of random variables
49(3)
2.6.1 Linear transformations
49(1)
2.6.2 General transformations
50(1)
2.6.3 Central limit theorem
51(1)
2.7 Monte Carlo approximation
52(4)
2.7.1 Example: change of variables, the MC way
53(1)
2.7.2 Example: estimating π by Monte Carlo integration
54(1)
2.7.3 Accuracy of Monte Carlo approximation
54(2)
2.8 Information theory
56(9)
2.8.1 Entropy
56(1)
2.8.2 KL divergence
57(2)
2.8.3 Mutual information
59(6)
3 Generative models for discrete data 65(32)
3.1 introduction
65(1)
3.2 Bayesian concept learning
65(7)
3.2.1 likelihood
67(1)
3.2.2 Prior
67(1)
3.2.3 Posterior
68(3)
3.2.4 Posterior predictive distribution
71(1)
3.2.5 A more complex prior
72(1)
3.3 The beta-binomial model
72(6)
3.3.1 Likelihood
73(1)
3.3.2 Prior
74(1)
3.3.3 Posterior
75(2)
3.3.4 Posterior predictive distribution
77(1)
3.4 The Dirichlet-multinomial model
78(4)
3.4.1 Likelihood
79(1)
3.4.2 Prior
79(1)
3.4.3 Posterior
79(2)
3.4.4 Posterior predictive
81(1)
3.5 Naive Bayes classifiers
82(15)
3.5.1 Model fitting
83(2)
3.5.2 Using the model for prediction
85(1)
3.5.3 The log-sum-exp trick
86(1)
3.5.4 Feature selection using mutual information
86(1)
3.5.5 Classifying documents using bag of words
87(10)
4 Gaussian models 97(52)
4.1 Introduction
97(4)
4.1.1 Notation
97(1)
4.1.2 Basics
97(2)
4.1.3 MLE for an MVN
99(2)
4.1.4 Maximum entropy derivation of the Gaussian
101(1)
4.2 Gaussian discriminant analysis
101(9)
4.2.1 Quadratic discriminant analysis (QDA)
102(1)
4.2.2 Linear discriminant analysis (LDA)
103(1)
4.2.3 Two-class LDA
104(2)
4.2.4 MLE for discriminant analysis
106(1)
4.2.5 Strategies for preventing overfitting
106(1)
4.2.6 Regularized LDA
107(1)
4.2.7 Diagonal LDA
108(1)
4.2.8 Nearest shrunken centroids classifier
109(1)
4.3 Inference in jointly Gaussian distributions
110(9)
4.3.1 Statement of the result
111(1)
4.3.2 Examples
111(4)
4.3.3 Information form
115(1)
4.3.4 Proof of the result
116(3)
4.4 Linear Gaussian systems
119(6)
4.4.1 Statement of the result
119(1)
4.4.2 Examples
120(4)
4.4.3 Proof of the result
124(1)
4.5 Digression: The Wishart distribution
125(2)
4.5.1 Inverse Wishart distribution
126(1)
4.5.2 Visualizing the Wishart distribution
127(1)
4.6 Inferring the parameters of an MVN
127(22)
4.6.1 Posterior distribution of µ
128(1)
4.6.2 Posterior distribution of Σ
128(4)
4.6.3 Posterior distribution of µ and Σ
132(6)
4.6.4 Sensor fusion with unknown precisions
138(11)
5 Bayesian statistics 149(42)
5.1 Introduction
149(1)
5.2 Summarizing posterior distributions
149(6)
5.2.1 MAP estimation
149(3)
5.2.2 Credible intervals
152(2)
5.2.3 Inference for a difference in proportions
154(1)
5.3 Bayesian model selection
155(10)
5.3.1 Bayesian Occam's razor
156(2)
5.3.2 Computing the marginal likelihood (evidence)
158(5)
5.3.3 Bayes factors
163(1)
5.3.4 Jeffreys-Lindley paradox
164(1)
5.4 Priors
165(6)
5.4.1 Uninformative priors
165(1)
5.4.2 Jeffreys priors
166(2)
5.4.3 Robust priors
168(1)
5.4.4 Mixtures of conjugate priors
168(3)
5.5 Hierarchical Bayes
171(1)
5.5.1 Example: modeling related cancer rates
171(1)
5.6 Empirical Bayes
172(4)
5.6.1 Example: beta-binomial model
173(1)
5.6.2 Example: Gaussian-Gaussian model
173(3)
5.7 Bayesian decision theory
176(15)
5.7.1 Bayes estimators for common loss functions
177(3)
5.7.2 The false positive vs false negative tradeoff
180(4)
5.7.3 Other topics
184(7)
6 Frequentist statistics 191(26)
6.1 Introduction
191(1)
6.2 Sampling distribution of an estimator
191(3)
6.2.1 Bootstrap
192(1)
6.2.2 Large sample theory for the MLE
193(1)
6.3 Frequentist decision theory
194(6)
6.3.1 Bayes risk
195(1)
6.3.2 Minimax risk
196(1)
6.3.3 Admissible estimators
197(3)
6.4 Desirable properties of estimators
200(4)
6.4.1 Consistent estimators
200(1)
6.4.2 Unbiased estimators
200(1)
6.4.3 Minimum variance estimators
201(1)
6.4.4 The bias-variance tradeoff
202(2)
6.5 Empirical risk minimization
204(7)
6.5.1 Regularized risk minimization
205(1)
6.5.2 Structural risk minimization
206(1)
6.5.3 Estimating the risk using cross validation
206(3)
6.5.4 Upper bounding the risk using statistical learning theory
209(1)
6.5.5 Surrogate loss functions
210(1)
6.6 Pathologies of frequentist statistics
211(6)
6.6.1 Counter-intuitive behavior of confidence intervals
212(1)
6.6.2 p-values considered harmful
213(1)
6.6.3 The likelihood principle
214(1)
6.6.4 Why isn't everyone a Bayesian?
215(2)
7 Linear regression 217(28)
7.1 Introduction
217(1)
7.2 Model specification
217(1)
7.3 Maximum likelihood estimation (least squares)
217(6)
7.3.1 Derivation of the MLE
219(1)
7.3.2 Geometric interpretation
220(1)
7.3.3 Convexity
221(2)
7.4 Robust linear regression
223(2)
7.5 Ridge regression
225(6)
7.5.1 Basic idea
225(2)
7.5.2 Numerically stable computation
227(1)
7.5.3 Connection with PCA
228(2)
7.5.4 Regularization effects of big data
230(1)
7.6 Bayesian linear regression
231(14)
7.6.1 Computing the posterior
232(1)
7.6.2 Computing the posterior predictive
233(1)
7.6.3 Bayesian inference when σ2 is unknown
234(4)
7.6.4 EB for linear regression (evidence procedure)
238(7)
8 Logistic regression 245(36)
8.1 Introduction
245(1)
8.2 Model specification
245(1)
8.3 Model fitting
245(9)
8.3.1 MLE
246(1)
8.3.2 Steepest descent
247(2)
8.3.3 Newton's method
249(1)
8.3.4 Iteratively reweighted least squares (IRLS)
250(1)
8.3.5 Quasi-Newton (variable metric) methods
251(1)
8.3.6 l2 regularization
252(1)
8.3.7 Multi-class logistic regression
252(2)
8.4 Bayesian logistic regression
254(7)
8.4.1 Laplace approximation
255(1)
8.4.2 Derivation of the BIC
255(1)
8.4.3 Gaussian approximation for logistic regression
256(1)
8.4.4 Approximating the posterior predictive
256(4)
8.4.5 Residual analysis (outlier detection)
260(1)
8.5 Online learning and stochastic optimization
261(6)
8.5.1 Online learning and regret minimization
262(1)
8.5.2 Stochastic optimization and risk minimization
262(2)
8.5.3 The LMS algorithm
264(1)
8.5.4 The perceptron algorithm
265(1)
8.5.5 A Bayesian view
266(1)
8.6 Generative vs discriminative classifiers
267(14)
8.6.1 Pros and cons of each approach
268(1)
8.6.2 Dealing with missing data
269(2)
8.6.3 Fisher's linear discriminant analysis (FLDA)
271(10)
9 Generalized linear models and the exponential family 281(26)
9.1 Introduction
281(1)
9.2 The exponential family
281(9)
9.2.1 Definition
282(1)
9.2.2 Examples
282(2)
9.2.3 Log partition function
284(2)
9.2.4 MLE for the exponential family
286(1)
9.2.5 Bayes for the exponential family
287(2)
9.2.6 Maximum entropy derivation of the exponential family
289(1)
9.3 Generalized linear models (GLMs)
290(3)
9.3.1 Basics
290(2)
9.3.2 ML and MAP estimation
292(1)
9.3.3 Bayesian inference
293(1)
9.4 Probit regression
293(3)
9.4.1 ML/MAP estimation using gradient-based optimization
294(1)
9.4.2 Latent variable interpretation
294(1)
9.4.3 Ordinal probit regression
295(1)
9.4.4 Multinomial probit models
295(1)
9.5 Multi-task learning
296(2)
9.5.1 Hierarchical Bayes for multi-task learning
296(1)
9.5.2 Application to personalized email spam filtering
296(1)
9.5.3 Application to domain adaptation
297(1)
9.5.4 Other kinds of prior
297(1)
9.6 Generalized linear mixed models
298(2)
9.6.1 Example: semi-parametric GLMMs for medical data
298(2)
9.6.2 Computational issues
300(1)
9.7 Learning to rank
300(7)
9.7.1 The pointwise approach
301(1)
9.7.2 The pairwise approach
301(1)
9.7.3 The listwise approach
302(1)
9.7.4 Loss functions for ranking
303(4)
10 Directed graphical models (Bayes nets) 307(30)
10.1 Introduction
307(4)
10.1.1 Chain rule
307(1)
10.1.2 Conditional independence
308(1)
10.1.3 Graphical models
308(1)
10.1.4 Graph terminology
309(1)
10.1.5 Directed graphical models
310(1)
10.2 Examples
311(8)
10.2.1 Naive Bayes classifiers
311(1)
10.2.2 Markov and hidden Markov models
312(1)
10.2.3 Medical diagnosis
313(2)
10.2.4 Genetic linkage analysis
315(3)
10.2.5 Directed Gaussian graphical models
318(1)
10.3 Inference
319(1)
10.4 Learning
320(4)
10.4.1 Plate notation
320(2)
10.4.2 Learning from complete data
322(1)
10.4.3 Learning with missing and/or latent variables
323(1)
10.5 Conditional independence properties of DGMs
324(4)
10.5.1 d-separation and the Bayes Ball algorithm (global Markov properties)
324(3)
10.5.2 Other Markov properties of DGMs
327(1)
10.5.3 Markov blanket and full conditionals
327(1)
10.6 Influence (decision) diagrams
328(9)
11 Mixture models and the EM algorithm 337(44)
11.1 Latent variable models
337(1)
11.2 Mixture models
337(8)
11.2.1 Mixtures of Gaussians
339(1)
11.2.2 Mixture of multinoullis
340(1)
11.2.3 Using mixture models for clustering
340(2)
11.2.4 Mixtures of experts
342(3)
11.3 Parameter estimation for mixture models
345(3)
11.3.1 Unidentifiability
346(1)
11.3.2 Computing a MAP estimate is non-convex
347(1)
11.4 The EM algorithm
348(22)
11.4.1 Basic idea
349(1)
11.4.2 EM for GMMs
350(7)
11.4.3 EM for mixture of experts
357(1)
11.4.4 EM for DGMs with hidden variables
358(1)
11.4.5 EM for the Student distribution
359(3)
11.4.6 EM for probit regression
362(1)
11.4.7 Theoretical basis for EM
363(2)
11.4.8 Online EM
365(2)
11.4.9 Other EM variants
367(3)
11.5 Model selection for latent variable models
370(2)
11.5.1 Model selection for probabilistic models
370(1)
11.5.2 Model selection for non-probabilistic methods
370(2)
11.6 Fitting models with missing data
372(9)
11.6.1 EM for the MLE of an MVN with missing data
373(8)
12 Latent linear models 381(40)
12.1 Factor analysis
381(6)
12.1.1 FA is a low rank parameterization of an MVN
381(1)
12.1.2 Inference of the latent factors
382(1)
12.1.3 Unidentifiability
383(2)
12.1.4 Mixtures of factor analysers
385(1)
12.1.5 EM for factor analysis models
386(1)
12.1.6 Fitting FA models with missing data
387(1)
12.2 Principal components analysis (PCA)
387(11)
12.2.1 Classical PCA: statement of the theorem
387(2)
12.2.2 Proof
389(3)
12.2.3 Singular value decomposition (SVD)
392(3)
12.2.4 Probabilistic PCA
395(1)
12.2.5 EM algorithm for PCA
396(2)
12.3 Choosing the number of latent dimensions
398(4)
12.3.1 Model selection for FA/PPCA
398(1)
12.3.2 Model selection for PCA
399(3)
12.4 PCA for categorical data
402(2)
12.5 PCA for paired and multi-view data
404(3)
12.5.1 Supervised PCA (latent factor regression)
405(1)
12.5.2 Partial least squares
406(1)
12.5.3 Canonical correlation analysis
407(1)
12.6 Independent Component Analysis (ICA)
407(14)
12.6.1 Maximum likelihood estimation
410(1)
12.6.2 The FastICA algorithm
411(3)
12.6.3 Using EM
414(1)
12.6.4 Other estimation principles
415(6)
13 Sparse linear models 421(58)
13.1 Introduction
421(1)
13.2 Bayesian variable selection
422(7)
13.2.1 The spike and slab model
424(1)
13.2.2 From the Bernoulli-Gaussian model to l0 regularization
425(1)
13.2.3 Algorithms
426(3)
13.3 l1 regularization: basics
429(12)
13.3.1 Why does l1 regularization yield sparse solutions?
430(1)
13.3.2 Optimality conditions for lasso
431(4)
13.3.3 Comparison of least squares, lasso, ridge and subset selection
435(1)
13.3.4 Regularization path
436(3)
13.3.5 Model selection
439(1)
13.3.6 Bayesian inference for linear models with Laplace priors
440(1)
13.4 l1 regularization: algorithms
441(8)
13.4.1 Coordinate descent
441(1)
13.4.2 LARS and other homotopy methods
441(1)
13.4.3 Proximal and gradient projection methods
442(5)
13.4.4 EM for lasso
447(2)
13.5 l1 regularization: extensions
449(8)
13.5.1 Group Lasso
449(5)
13.5.2 Fused lasso
454(1)
13.5.3 Elastic net (ridge and lasso combined)
455(2)
13.6 Non-convex regularizers
457(6)
13.6.1 Bridge regression
458(1)
13.6.2 Hierarchical adaptive lasso
458(4)
13.6.3 Other hierarchical priors
462(1)
13.7 Automatic relevance determination (ARD)/sparse Bayesian learning (SBL)
463(5)
13.7.1 ARD for linear regression
463(2)
13.7.2 Whence sparsity?
465(1)
13.7.3 Connection to MAP estimation
465(1)
13.7.4 Algorithms for ARD
466(2)
13.7.5 ARD for logistic regression
468(1)
13.8 Sparse coding
468(11)
13.8.1 Learning a sparse coding dictionary
469(1)
13.8.2 Results of dictionary learning from image patches
470(2)
13.8.3 Compressed sensing
472(1)
13.8.4 Image inpainting and denoising
472(7)
14 Kernels 479(36)
14.1 Introduction
479(1)
14.2 Kernel functions
479(7)
14.2.1 RBF kernels
480(1)
14.2.2 Kernels for comparing documents
480(1)
14.2.3 Mercer (positive definite) kernels
481(1)
14.2.4 Linear kernels
482(1)
14.2.5 Matern kernels
482(1)
14.2.6 String kernels
483(1)
14.2.7 Pyramid match kernels
484(1)
14.2.8 Kernels derived from probabilistic generative models
485(1)
14.3 Using kernels inside GLMs
486(2)
14.3.1 Kernel machines
486(1)
14.3.2 L1VMs, RVMs, and other sparse vector machines
487(1)
14.4 The kernel trick
488(8)
14.4.1 Kernelized nearest neighbor classification
489(1)
14.4.2 Kernelized K-medoids clustering
489(3)
14.4.3 Kernelized ridge regression
492(1)
14.4.4 Kernel PCA
493(3)
14.5 Support vector machines (SVMs)
496(9)
14.5.1 SVMs for regression
497(1)
14.5.2 SVMs for classification
498(6)
14.5.3 Choosing C
504(1)
14.5.4 Summary of key points
504(1)
14.5.5 A probabilistic interpretation of SVMs
505(1)
14.6 Comparison of discriminative kernel methods
505(2)
14.7 Kernels for building generative models
507(8)
14.7.1 Smoothing kernels
507(1)
14.7.2 Kernel density estimation (KDE)
508(1)
14.7.3 From KDE to KNN
509(1)
14.7.4 Kernel regression
510(2)
14.7.5 Locally weighted regression
512(3)
15 Gaussian processes 515(28)
15.1 Introduction
515(1)
15.2 GPs for regression
516(9)
15.2.1 Predictions using noise-free observations
517(1)
15.2.2 Predictions using noisy observations
518(1)
15.2.3 Effect of the kernel parameters
519(2)
15.2.4 Estimating the kernel parameters
521(3)
15.2.5 Computational and numerical issues
524(1)
15.2.6 Semi-parametric GPs
524(1)
15.3 GPs meet GLMs
525(7)
15.3.1 Binary classification
525(3)
15.3.2 Multi-class classification
528(3)
15.3.3 GPs for Poisson regression
531(1)
15.4 Connection with other methods
532(8)
15.4.1 Linear models compared to GPs
532(1)
15.4.2 Linear smoothers compared to GPs
533(1)
15.4.3 SVMs compared to GPs
534(1)
15.4.4 LIVM and Rs compared to GPs
534(1)
15.4.5 Neural networks compared to GPs
535(1)
15.4.6 Smoothing splines compared to GPs
536(2)
15.4.7 RKHS methods compared to GPs
538(2)
15.5 GP latent variable model
540(2)
15.6 Approximation methods for large datasets
542(1)
16 Adaptive basis function models 543(46)
16.1 Introduction
543(1)
16.2 Classification and regression trees (CART)
544(8)
16.2.1 Basics
544(1)
16.2.2 Growing a tree
545(4)
16.2.3 Pruning a tree
549(1)
16.2.4 Pros and cons of trees
550(1)
16.2.5 Random forests
550(1)
16.2.6 CART compared to hierarchical mixture of experts
551(1)
16.3 Generalized additive models
552(2)
16.3.1 Backfitting
552(1)
16.3.2 Computational efficiency
553(1)
16.3.3 Multivariate adaptive regression splines (MARS)
553(1)
16.4 Boosting
554(9)
16.4.1 Forward stagewise additive modeling
555(2)
16.4.2 L2boosting
557(1)
16.4.3 AdaBoost
558(1)
16.4.4 LogitBoost
559(1)
16.4.5 Boosting as functional gradient descent
560(1)
16.4.6 Sparse boosting
561(1)
16.4.7 Multivariate adaptive regression trees (MART)
562(1)
16.4.8 Why does boosting work so well?
562(1)
16.4.9 A Bayesian view
563(1)
16.5 Feedforward neural networks (multilayer perceptrons)
563(17)
16.5.1 Convolutional neural networks
564(4)
16.5.2 Other kinds of neural networks
568(1)
16.5.3 A brief history of the field
568(1)
16.5.4 The backpropagation algorithm
569(3)
16.5.5 Identifiability
572(1)
16.5.6 Regularization
572(4)
16.5.7 Bayesian inference
576(4)
16.6 Ensemble learning
580(2)
16.6.1 Stacking
580(1)
16.6.2 Error-correcting output codes
581(1)
16.6.3 Ensemble learning is not equivalent to Bayes model averaging
581(1)
16.7 Experimental comparison
582(3)
16.7.1 Low-dimensional features
582(1)
16.7.2 High-dimensional features
583(2)
16.8 Interpreting black-box models
585(4)
17 Markov and hidden Markov models 589(42)
17.1 Introduction
589(1)
17.2 Markov models
589(14)
17.2.1 Transition matrix
589(2)
17.2.2 Application: Language modeling
591(5)
17.2.3 Stationary distribution of a Markov chain
596(4)
17.2.4 Application: Google's PageRank algorithm for web page ranking
600(3)
17.3 Hidden Markov models
603(3)
17.3.1 Applications of HMMs
604(2)
17.4 Inference in HMMs
606(11)
17.4.1 Types of inference problems for temporal models
606(3)
17.4.2 The forwards algorithm
609(1)
17.4.3 The forwards-backwards algorithm
610(2)
17.4.4 The Viterbi algorithm
612(4)
17.4.5 Forwards filtering, backwards sampling
616(1)
17.5 Learning for HMMs
617(4)
17.5.1 Training with fully observed data
617(1)
17.5.2 EM for HMMs (the Baum-Welch algorithm)
618(2)
17.5.3 Bayesian methods for "fitting" HMMs
620(1)
17.5.4 Discriminative training
620(1)
17.5.5 Model selection
621(1)
17.6 Generalizations of HMMs
621(10)
17.6.1 Variable duration (semi-Markov) HMMs
622(2)
17.6.2 Hierarchical HMMs
624(1)
17.6.3 Input-output HMMs
625(1)
17.6.4 Auto-regressive and buried HMMs
626(1)
17.6.5 Factorial HMM
627(1)
17.6.6 Coupled HMM and the influence model
628(1)
17.6.7 Dynamic Bayesian networks (DBNs)
628(3)
18 State space models 631(30)
18.1 Introduction
631(1)
18.2 Applications of SSMs
632(8)
18.2.1 SSMs for object tracking
632(1)
18.2.2 Robotic SLAM
633(3)
18.2.3 Online parameter learning using recursive least squares
636(1)
18.2.4 SSM for time series forecasting
637(3)
18.3 Inference in LG-SSM
640(6)
18.3.1 The Kalman filtering algorithm
640(3)
18.3.2 The Kalman smoothing algorithm
643(3)
18.4 Learning for LG-SSM
646(1)
18.4.1 Identifiability and numerical stability
646(1)
18.4.2 Training with fully observed data
647(1)
18.4.3 EM for LG-SSM
647(1)
18.4.4 Subspace methods
647(1)
18.4.5 Bayesian methods for "fitting" LG-SSMs
647(1)
18.5 Approximate online inference for non-linear, non-Gaussian SSMs
647(8)
18.5.1 Extended Kalman filter (EKF)
648(2)
18.5.2 Unscented Kalman filter (UKF)
650(2)
18.5.3 Assumed density filtering (ADF)
652(3)
18.6 Hybrid discrete/continuous SSMs
655(6)
18.6.1 Inference
656(2)
18.6.2 Application: data association and multi-target tracking
658(1)
18.6.3 Application: fault diagnosis
659(1)
18.6.4 Application: econometric forecasting
660(1)
19 Undirected graphical models (Markov random fields) 661(46)
19.1 Introduction
661(1)
19.2 Conditional independence properties of UGMs
661(4)
19.2.1 Key properties
661(2)
19.2.2 An undirected alternative to d-separation
663(1)
19.2.3 Comparing directed and undirected graphical models
664(1)
19.3 Parameterization of MRFs
665(3)
19.3.1 The Hammersley-Clifford theorem
665(2)
19.3.2 Representing potential functions
667(1)
19.4 Examples of MRFs
668(8)
19.4.1 Ising model
668(1)
19.4.2 Hopfield networks
669(2)
19.4.3 Potts model
671(1)
19.4.4 Gaussian MRFs
672(2)
19.4.5 Markov logic networks
674(2)
19.5 Learning
676(8)
19.5.1 Training maxent models using gradient methods
676(1)
19.5.2 Training partially observed maxent models
677(1)
19.5.3 Approximate methods for computing the MLEs of MRFs
678(1)
19.5.4 Pseudo likelihood
678(1)
19.5.5 Stochastic maximum likelihood
679(1)
19.5.6 Feature induction for maxent models
680(1)
19.5.7 Iterative proportional fitting (IPF)
681(3)
19.6 Conditional random fields (CRFs)
684(9)
19.6.1 Chain-structured CRFs, MEMMs and the label-bias problem
684(2)
19.6.2 Applications of CRFs
686(6)
19.6.3 CRF training
692(1)
19.7 Structural SVMs
693(14)
19.7.1 SSVMs: a probabilistic view
693(2)
19.7.2 SSVMs: a non-probabilistic view
695(3)
19.7.3 Cutting plane methods for fitting SSVMs
698(2)
19.7.4 Online algorithms for fitting SSVMs
700(1)
19.7.5 Latent structural SVMs
701(6)
20 Exact inference for graphical models 707(24)
20.1 Introduction
707(1)
20.2 Belief propagation for trees
707(7)
20.2.1 Serial protocol
707(2)
20.2.2 Parallel protocol
709(1)
20.2.3 Gaussian BP
710(2)
20.2.4 Other BP variants
712(2)
20.3 The variable elimination algorithm
714(6)
20.3.1 The generalized distributive law
717(1)
20.3.2 Computational complexity of VE
717(3)
20.3.3 A weakness of VE
720(1)
20.4 The junction tree algorithm
720(6)
20.4.1 Creating a junction tree
720(2)
20.4.2 Message passing on a junction tree
722(3)
20.4.3 Computational complexity of JTA
725(1)
20.4.4 JTA generalizations
726(1)
20.5 Computational intractability of exact inference in the worst case
726(5)
20.5.1 Approximate inference
727(4)
21 Variational inference 731(36)
21.1 Introduction
731(1)
21.2 Variational inference
732(3)
21.2.1 Alternative interpretations of the variational objective
733(1)
21.2.2 Forward or reverse KL?
733(2)
21.3 The mean field method
735(4)
21.3.1 Derivation of the mean field update equations
736(1)
21.3.2 Example: mean field for the Ising model
737(2)
21.4 Structured mean field
739(3)
21.4.1 Example: factorial HMM
740(2)
21.5 Variational Bayes
742(7)
21.5.1 Example: VB for a univariate Gaussian
742(4)
21.5.2 Example: VB for linear regression
746(3)
21.6 Variational Bayes EM
749(7)
21.6.1 Example: VBEM for mixtures of Gaussians
750(6)
21.7 Variational message passing and VIBES
756(1)
21.8 Local variational bounds
756(11)
21.8.1 Motivating applications
756(2)
21.8.2 Bohning's quadratic bound to the log-sum-exp function
758(2)
21.8.3 Bounds for the sigmoid function
760(2)
21.8.4 Other bounds and approximations to the log-sum-exp function
762(1)
21.8.5 Variational inference based on upper bounds
763(4)
22 More variational inference 767(48)
22.1 Introduction
767(1)
22.2 Loopy belief propagation: algorithmic issues
767(9)
22.2.1 A brief history
767(1)
22.2.2 LBP on pairwise models
768(1)
22.2.3 LBP on a factor graph
769(2)
22.2.4 Convergence
771(3)
22.2.5 Accuracy of LBP
774(1)
22.2.6 Other speedup tricks for LBP
775(1)
22.3 Loopy belief propagation: theoretical issues
776(7)
22.3.1 UGMs represented in exponential family form
776(1)
22.3.2 The marginal polytope
777(1)
22.3.3 Exact inference as a variational optimization problem
778(1)
22.3.4 Mean field as a variational optimization problem
779(1)
22.3.5 LBP as a variational optimization problem
779(4)
22.3.6 Loopy BP vs mean field
783(1)
22.4 Extensions of belief propagation
783(4)
22.4.1 Generalized belief propagation
783(2)
22.4.2 Convex belief propagation
785(2)
22.5 Expectation propagation
787(12)
22.5.1 EP as a variational inference problem
788(1)
22.5.2 Optimizing the EP objective using moment matching
789(2)
22.5.3 EP for the clutter problem
791(1)
22.5.4 LBP is a special case of EP
792(1)
22.5.5 Ranking players using TrueSkill
793(6)
22.5.6 Other applications of EP
799(1)
22.6 MAP state estimation
799(16)
22.6.1 Linear programming relaxation
799(1)
22.6.2 Max-product belief propagation
800(1)
22.6.3 Graphcuts
801(3)
22.6.4 Experimental comparison of graphcuts and BP
804(2)
22.6.5 Dual decomposition
806(9)
23 Monte Carlo inference 815(22)
23.1 Introduction
815(1)
23.2 Sampling from standard distributions
815(2)
23.2.1 Using the cdf
815(2)
23.2.2 Sampling from a Gaussian (Box-Muller method)
817(1)
23.3 Rejection sampling
817(3)
23.3.1 Basic idea
817(1)
23.3.2 Example
818(1)
23.3.3 Application to Bayesian statistics
819(1)
23.3.4 Adaptive rejection sampling
819(1)
23.3.5 Rejection sampling in high dimensions
820(1)
23.4 Importance sampling
820(3)
23.4.1 Basic idea
820(1)
23.4.2 Handling unnormalized distributions
821(1)
23.4.3 Importance sampling for a DGM: likelihood weighting
822(1)
23.4.4 Sampling importance resampling (SIR)
822(1)
23.5 Particle filtering
823(8)
23.5.1 Sequential importance sampling
824(1)
23.5.2 The degeneracy problem
825(1)
23.5.3 The resampling step
825(2)
23.5.4 The proposal distribution
827(1)
23.5.5 Application: robot localization
828(1)
23.5.6 Application: visual object tracking
828(3)
23.5.7 Application: time series forecasting
831(1)
23.6 Rao-Blackwellised particle filtering (RBPF)
831(6)
23.6.1 RBPF for switching LG-SSMs
831(1)
23.6.2 Application: tracking a maneuvering target
832(2)
23.6.3 Application: Fast SLAM
834(3)
24 Markov chain Monte Carlo (MCMC) inference 837(38)
24.1 Introduction
837(1)
24.2 Gibbs sampling
838(10)
24.2.1 Basic idea
838(1)
24.2.2 Example: Gibbs sampling for the Ising model
838(2)
24.2.3 Example: Gibbs sampling for inferring the parameters of a GMM
840(1)
24.2.4 Collapsed Gibbs sampling
841(3)
24.2.5 Gibbs sampling for hierarchical GLMs
844(2)
24.2.6 BUGS and JAGS
846(1)
24.2.7 The Imputation Posterior (IP) algorithm
847(1)
24.2.8 Blocking Gibbs sampling
847(1)
24.3 Metropolis Hastings algorithm
848(8)
24.3.1 Basic idea
848(1)
24.3.2 Gibbs sampling is a special case of MH
849(1)
24.3.3 Proposal distributions
850(3)
24.3.4 Adaptive MCMC
853(1)
24.3.5 Initialization and mode hopping
854(1)
24.3.6 Why MH works
854(1)
24.3.7 Reversible jump (trans-dimensional) MCMC
855(1)
24.4 Speed and accuracy of MCMC
856(7)
24.4.1 The burn-in phase
856(1)
24.4.2 Mixing rates of Markov chains
857(1)
24.4.3 Practical convergence diagnostics
858(2)
24.4.4 Accuracy of MCMC
860(2)
24.4.5 How many chains?
862(1)
25.5 Auxiliary variable MCMC
863(5)
24.5.1 Auxiliary variable sampling for logistic regression
863(1)
24.5.2 Slice sampling
864(2)
24.5.3 Swendsen Wang
866(2)
24.5.4 Hybrid/Hamiltonian MCMC
868(1)
24.6 Annealing methods
868(4)
24.6.1 Simulated annealing
869(2)
24.6.2 Annealed importance sampling
871(1)
24.6.3 Parallel tempering
871(1)
24.7 Approximating the marginal likelihood
872(3)
24.7.1 The candidate method
872(1)
24.7.2 Harmonic mean estimate
872(1)
24.7.3 Annealed importance sampling
873(2)
25 Clustering 875(32)
25.1 Introduction
875(4)
25.1.1 Measuring (dis)similarity
875(1)
25.1.2 Evaluating the output of clustering methods
876(3)
25.2 Dirichlet process mixture models
879(8)
25.2.1 From finite to infinite mixture models
879(3)
25.2.2 The Dirichlet process
882(3)
25.2.3 Applying Dirichlet processes to mixture modeling
885(1)
25.2.4 Fitting a DP mixture model
886(1)
25.3 Affinity propagation
887(3)
25.4 Spectral clustering
890(3)
25.4.1 Graph Laplacian
891(1)
25.4.2 Normalized graph Laplacian
892(1)
25.4.3 Example
893(1)
25.5 Hierarchical clustering
893(8)
25.5.1 Agglomerative clustering
895(3)
25.5.2 Divisive clustering
898(1)
25.5.3 Choosing the number of dusters
899(1)
25.5.4 Bayesian hierarchical clustering
899(2)
25.6 Clustering datapoints and features
901(6)
25.6.1 Biclustering
903(1)
25.6.2 Multi-view clustering
903(4)
26 Graphical model structure learning 907(38)
26.1 Introduction
907(1)
26.2 Structure learning for knowledge discovery
908(2)
26.2.1 Relevance networks
908(1)
26.2.2 Dependency networks
909(1)
26.3 Learning tree structures
910(4)
26.3.1 Directed or undirected tree?
911(1)
26.3.2 Chow-Liu algorithm for finding the ML tree structure
912(1)
26.3.3 Finding the MAP forest
912(2)
26.3.4 Mixtures of trees
914(1)
26.4 Learning DAG structures
914(8)
26.4.1 Markov equivalence
914(2)
26.4.2 Exact structural inference
916(4)
26.4.3 Scaling up to larger graphs
920(2)
26.5 Learning DAG structure with latent variables
922(9)
26.5.1 Approximating the marginal likelihood when we have missing data
922(3)
26.5.2 Structural EM
925(1)
26.5.3 Discovering hidden variables
926(2)
26.5.4 Case study: Google's Rephil
928(1)
26.5.5 Structural equation models
929(2)
26.6 Learning causal DAGs
931(7)
26.6.1 Causal interpretation of DAGs
931(2)
26.6.2 Using causal DAGs to resolve Simpson's paradox
933(2)
26.6.3 Learning causal DAG structures
935(3)
26.7 Learning undirected Gaussian graphical models
938(4)
26.7.1 MLE for a GGM
938(1)
26.7.2 Graphical lasso
939(2)
26.7.3 Bayesian inference for GGM structure
941(1)
26.7.4 Handling non-Gaussian data using copulas
942(1)
26.8 Learning undirected discrete graphical models
942(3)
26.8.1 Graphical lasso for MRFs/CRFs
942(2)
26.8.2 Thin junction trees
944(1)
27 Latent variable models for discrete data 945(50)
27.1 Introduction
945(1)
27.2 Distributed state LVMs for discrete data
946(4)
27.2.1 Mixture models
946(1)
27.2.2 Exponential family PCA
947(1)
27.2.3 LDA and rnPCA
948(1)
27.2.4 GaP model and non-negative matrix factorization
949(1)
27.3 Latent Dirichlet allocation (LDA)
950(11)
27.3.1 Basics
950(3)
27.3.2 Unsupervised discovery of topics
953(1)
27.3.3 Quantitatively evaluating LDA as a language model
953(2)
27.3.4 Fitting using (collapsed) Gibbs sampling
955(1)
27.3.5 Example
956(1)
27.3.6 Fitting using batch variational inference
957(2)
27.3.7 Fitting using online variational inference
959(1)
27.3.8 Determining the number of topics
960(1)
27.4 Extensions of LDA
961(9)
27.4.1 Correlated topic model
961(1)
27.4.2 Dynamic topic model
962(1)
27.4.3 LDA-HMM
963(4)
27.4.4 Supervised LDA
967(3)
27.5 LVMs for graph-structured data
970(5)
27.5.1 Stochastic block model
971(2)
27.5.2 Mixed membership stochastic block model
973(1)
27.5.3 Relational topic model
974(1)
27.6 LVMs for relational data
975(8)
27.6.1 Infinite relational model
976(3)
27.6.2 Probabilistic matrix factorization for collaborative filtering
979(4)
27.7 Restricted Boltzmann machines (RBMs)
983(12)
27.7.1 Varieties of RBMs
985(2)
27.7.2 Learning RBMs
987(4)
27.7.3 Applications of RBMs
991(4)
28 Deep learning 995(14)
28.1 Introduction
995(1)
28.2 Deep generative models
995(4)
28.2.1 Deep directed networks
996(1)
28.2.2 Deep Boltzmann machines
996(1)
28.2.3 Deep belief networks
997(1)
28.2.4 Greedy layer-wise learning of DBNs
998(1)
28.3 Deep neural networks
999(2)
28.3.1 Deep multi-layer perceptrons
999(1)
28.3.2 Deep auto-encoders
1000(1)
28.3.3 Stacked denoising auto-encoders
1001(1)
28.4 Applications of deep networks
1001(4)
28.4.1 Handwritten digit classification using DBNs
1001(1)
28.4.2 Data visualization and feature discovery using deep auto-encoders
1002(1)
28.4.3 Information retrieval using deep auto-encoders (semantic hashing)
1003(1)
28.4.4 Learning audio features using 1d convolutional DBNs
1004(1)
28.4.5 Learning image features using 2d convolutional DBNs
1005(1)
28.5 Discussion
1005(4)
Notation 1009(6)
Bibliography 1015(32)
Indexes 1047
Index to code
1047(3)
Index to keywords
1050