Preface |
|
xxvii | |
1 Introduction |
|
1 | (26) |
|
1.1 Machine learning: what and why? |
|
|
1 | (2) |
|
1.1.1 Types of machine learning |
|
|
2 | (1) |
|
|
3 | (6) |
|
|
3 | (5) |
|
|
8 | (1) |
|
1.3 Unsupervised learning |
|
|
9 | (7) |
|
1.3.1 Discovering clusters |
|
|
10 | (1) |
|
1.3.2 Discovering latent factors |
|
|
11 | (2) |
|
1.3.3 Discovering graph structure |
|
|
13 | (1) |
|
|
14 | (2) |
|
1.4 Some basic concepts in machine learning |
|
|
16 | (11) |
|
1.4.1 Parametric vs non-parametric models |
|
|
16 | (1) |
|
1.4.2 A simple non-parametric classifier: K-nearest neighbors |
|
|
16 | (2) |
|
1.4.3 The curse of dimensionality |
|
|
18 | (1) |
|
1.4.4 Parametric models for classification and regression |
|
|
19 | (1) |
|
|
19 | (2) |
|
1.4.6 Logistic regression |
|
|
21 | (1) |
|
|
22 | (1) |
|
|
22 | (2) |
|
1.4.9 No free lunch theorem |
|
|
24 | (3) |
2 Probability |
|
27 | (38) |
|
|
27 | (1) |
|
2.2 A brief review of probability theory |
|
|
28 | (6) |
|
2.2.1 Discrete random variables |
|
|
28 | (1) |
|
|
28 | (1) |
|
|
29 | (1) |
|
2.2.4 Independence and conditional independence |
|
|
30 | (2) |
|
2.2.5 Continuous random variables |
|
|
32 | (1) |
|
|
33 | (1) |
|
|
33 | (1) |
|
2.3 Some common discrete distributions |
|
|
34 | (4) |
|
2.3.1 The binomial and Bernoulli distributions |
|
|
34 | (1) |
|
2.3.2 The multinornial and multinoulli distributions |
|
|
35 | (2) |
|
2.3.3 The Poisson distribution |
|
|
37 | (1) |
|
2.3.4 The empirical distribution |
|
|
37 | (1) |
|
2.4 Some common continuous distributions |
|
|
38 | (6) |
|
2.4.1 Gaussian (normal) distribution |
|
|
38 | (1) |
|
|
39 | (2) |
|
2.4.3 Laplace distribution |
|
|
41 | (1) |
|
2.4.4 The gamma distribution |
|
|
41 | (1) |
|
2.4.5 The beta distribution |
|
|
42 | (1) |
|
2.4.6 Pareto distribution |
|
|
43 | (1) |
|
2.5 Joint probability distributions |
|
|
44 | (5) |
|
2.5.1 Covariance and correlation |
|
|
44 | (2) |
|
2.5.2 The multivariate Gaussian |
|
|
46 | (1) |
|
2.5.3 Multivariate Student t distribution |
|
|
46 | (1) |
|
2.5.4 Dirichlet distribution |
|
|
47 | (2) |
|
2.6 Transformations of random variables |
|
|
49 | (3) |
|
2.6.1 Linear transformations |
|
|
49 | (1) |
|
2.6.2 General transformations |
|
|
50 | (1) |
|
2.6.3 Central limit theorem |
|
|
51 | (1) |
|
2.7 Monte Carlo approximation |
|
|
52 | (4) |
|
2.7.1 Example: change of variables, the MC way |
|
|
53 | (1) |
|
2.7.2 Example: estimating π by Monte Carlo integration |
|
|
54 | (1) |
|
2.7.3 Accuracy of Monte Carlo approximation |
|
|
54 | (2) |
|
|
56 | (9) |
|
|
56 | (1) |
|
|
57 | (2) |
|
|
59 | (6) |
3 Generative models for discrete data |
|
65 | (32) |
|
|
65 | (1) |
|
3.2 Bayesian concept learning |
|
|
65 | (7) |
|
|
67 | (1) |
|
|
67 | (1) |
|
|
68 | (3) |
|
3.2.4 Posterior predictive distribution |
|
|
71 | (1) |
|
3.2.5 A more complex prior |
|
|
72 | (1) |
|
3.3 The beta-binomial model |
|
|
72 | (6) |
|
|
73 | (1) |
|
|
74 | (1) |
|
|
75 | (2) |
|
3.3.4 Posterior predictive distribution |
|
|
77 | (1) |
|
3.4 The Dirichlet-multinomial model |
|
|
78 | (4) |
|
|
79 | (1) |
|
|
79 | (1) |
|
|
79 | (2) |
|
3.4.4 Posterior predictive |
|
|
81 | (1) |
|
3.5 Naive Bayes classifiers |
|
|
82 | (15) |
|
|
83 | (2) |
|
3.5.2 Using the model for prediction |
|
|
85 | (1) |
|
3.5.3 The log-sum-exp trick |
|
|
86 | (1) |
|
3.5.4 Feature selection using mutual information |
|
|
86 | (1) |
|
3.5.5 Classifying documents using bag of words |
|
|
87 | (10) |
4 Gaussian models |
|
97 | (52) |
|
|
97 | (4) |
|
|
97 | (1) |
|
|
97 | (2) |
|
|
99 | (2) |
|
4.1.4 Maximum entropy derivation of the Gaussian |
|
|
101 | (1) |
|
4.2 Gaussian discriminant analysis |
|
|
101 | (9) |
|
4.2.1 Quadratic discriminant analysis (QDA) |
|
|
102 | (1) |
|
4.2.2 Linear discriminant analysis (LDA) |
|
|
103 | (1) |
|
|
104 | (2) |
|
4.2.4 MLE for discriminant analysis |
|
|
106 | (1) |
|
4.2.5 Strategies for preventing overfitting |
|
|
106 | (1) |
|
|
107 | (1) |
|
|
108 | (1) |
|
4.2.8 Nearest shrunken centroids classifier |
|
|
109 | (1) |
|
4.3 Inference in jointly Gaussian distributions |
|
|
110 | (9) |
|
4.3.1 Statement of the result |
|
|
111 | (1) |
|
|
111 | (4) |
|
|
115 | (1) |
|
4.3.4 Proof of the result |
|
|
116 | (3) |
|
4.4 Linear Gaussian systems |
|
|
119 | (6) |
|
4.4.1 Statement of the result |
|
|
119 | (1) |
|
|
120 | (4) |
|
4.4.3 Proof of the result |
|
|
124 | (1) |
|
4.5 Digression: The Wishart distribution |
|
|
125 | (2) |
|
4.5.1 Inverse Wishart distribution |
|
|
126 | (1) |
|
4.5.2 Visualizing the Wishart distribution |
|
|
127 | (1) |
|
4.6 Inferring the parameters of an MVN |
|
|
127 | (22) |
|
4.6.1 Posterior distribution of µ |
|
|
128 | (1) |
|
4.6.2 Posterior distribution of Σ |
|
|
128 | (4) |
|
4.6.3 Posterior distribution of µ and Σ |
|
|
132 | (6) |
|
4.6.4 Sensor fusion with unknown precisions |
|
|
138 | (11) |
5 Bayesian statistics |
|
149 | (42) |
|
|
149 | (1) |
|
5.2 Summarizing posterior distributions |
|
|
149 | (6) |
|
|
149 | (3) |
|
|
152 | (2) |
|
5.2.3 Inference for a difference in proportions |
|
|
154 | (1) |
|
5.3 Bayesian model selection |
|
|
155 | (10) |
|
5.3.1 Bayesian Occam's razor |
|
|
156 | (2) |
|
5.3.2 Computing the marginal likelihood (evidence) |
|
|
158 | (5) |
|
|
163 | (1) |
|
5.3.4 Jeffreys-Lindley paradox |
|
|
164 | (1) |
|
|
165 | (6) |
|
5.4.1 Uninformative priors |
|
|
165 | (1) |
|
|
166 | (2) |
|
|
168 | (1) |
|
5.4.4 Mixtures of conjugate priors |
|
|
168 | (3) |
|
|
171 | (1) |
|
5.5.1 Example: modeling related cancer rates |
|
|
171 | (1) |
|
|
172 | (4) |
|
5.6.1 Example: beta-binomial model |
|
|
173 | (1) |
|
5.6.2 Example: Gaussian-Gaussian model |
|
|
173 | (3) |
|
5.7 Bayesian decision theory |
|
|
176 | (15) |
|
5.7.1 Bayes estimators for common loss functions |
|
|
177 | (3) |
|
5.7.2 The false positive vs false negative tradeoff |
|
|
180 | (4) |
|
|
184 | (7) |
6 Frequentist statistics |
|
191 | (26) |
|
|
191 | (1) |
|
6.2 Sampling distribution of an estimator |
|
|
191 | (3) |
|
|
192 | (1) |
|
6.2.2 Large sample theory for the MLE |
|
|
193 | (1) |
|
6.3 Frequentist decision theory |
|
|
194 | (6) |
|
|
195 | (1) |
|
|
196 | (1) |
|
6.3.3 Admissible estimators |
|
|
197 | (3) |
|
6.4 Desirable properties of estimators |
|
|
200 | (4) |
|
6.4.1 Consistent estimators |
|
|
200 | (1) |
|
6.4.2 Unbiased estimators |
|
|
200 | (1) |
|
6.4.3 Minimum variance estimators |
|
|
201 | (1) |
|
6.4.4 The bias-variance tradeoff |
|
|
202 | (2) |
|
6.5 Empirical risk minimization |
|
|
204 | (7) |
|
6.5.1 Regularized risk minimization |
|
|
205 | (1) |
|
6.5.2 Structural risk minimization |
|
|
206 | (1) |
|
6.5.3 Estimating the risk using cross validation |
|
|
206 | (3) |
|
6.5.4 Upper bounding the risk using statistical learning theory |
|
|
209 | (1) |
|
6.5.5 Surrogate loss functions |
|
|
210 | (1) |
|
6.6 Pathologies of frequentist statistics |
|
|
211 | (6) |
|
6.6.1 Counter-intuitive behavior of confidence intervals |
|
|
212 | (1) |
|
6.6.2 p-values considered harmful |
|
|
213 | (1) |
|
6.6.3 The likelihood principle |
|
|
214 | (1) |
|
6.6.4 Why isn't everyone a Bayesian? |
|
|
215 | (2) |
7 Linear regression |
|
217 | (28) |
|
|
217 | (1) |
|
|
217 | (1) |
|
7.3 Maximum likelihood estimation (least squares) |
|
|
217 | (6) |
|
7.3.1 Derivation of the MLE |
|
|
219 | (1) |
|
7.3.2 Geometric interpretation |
|
|
220 | (1) |
|
|
221 | (2) |
|
7.4 Robust linear regression |
|
|
223 | (2) |
|
|
225 | (6) |
|
|
225 | (2) |
|
7.5.2 Numerically stable computation |
|
|
227 | (1) |
|
7.5.3 Connection with PCA |
|
|
228 | (2) |
|
7.5.4 Regularization effects of big data |
|
|
230 | (1) |
|
7.6 Bayesian linear regression |
|
|
231 | (14) |
|
7.6.1 Computing the posterior |
|
|
232 | (1) |
|
7.6.2 Computing the posterior predictive |
|
|
233 | (1) |
|
7.6.3 Bayesian inference when σ2 is unknown |
|
|
234 | (4) |
|
7.6.4 EB for linear regression (evidence procedure) |
|
|
238 | (7) |
8 Logistic regression |
|
245 | (36) |
|
|
245 | (1) |
|
|
245 | (1) |
|
|
245 | (9) |
|
|
246 | (1) |
|
|
247 | (2) |
|
|
249 | (1) |
|
8.3.4 Iteratively reweighted least squares (IRLS) |
|
|
250 | (1) |
|
8.3.5 Quasi-Newton (variable metric) methods |
|
|
251 | (1) |
|
|
252 | (1) |
|
8.3.7 Multi-class logistic regression |
|
|
252 | (2) |
|
8.4 Bayesian logistic regression |
|
|
254 | (7) |
|
8.4.1 Laplace approximation |
|
|
255 | (1) |
|
8.4.2 Derivation of the BIC |
|
|
255 | (1) |
|
8.4.3 Gaussian approximation for logistic regression |
|
|
256 | (1) |
|
8.4.4 Approximating the posterior predictive |
|
|
256 | (4) |
|
8.4.5 Residual analysis (outlier detection) |
|
|
260 | (1) |
|
8.5 Online learning and stochastic optimization |
|
|
261 | (6) |
|
8.5.1 Online learning and regret minimization |
|
|
262 | (1) |
|
8.5.2 Stochastic optimization and risk minimization |
|
|
262 | (2) |
|
|
264 | (1) |
|
8.5.4 The perceptron algorithm |
|
|
265 | (1) |
|
|
266 | (1) |
|
8.6 Generative vs discriminative classifiers |
|
|
267 | (14) |
|
8.6.1 Pros and cons of each approach |
|
|
268 | (1) |
|
8.6.2 Dealing with missing data |
|
|
269 | (2) |
|
8.6.3 Fisher's linear discriminant analysis (FLDA) |
|
|
271 | (10) |
9 Generalized linear models and the exponential family |
|
281 | (26) |
|
|
281 | (1) |
|
9.2 The exponential family |
|
|
281 | (9) |
|
|
282 | (1) |
|
|
282 | (2) |
|
9.2.3 Log partition function |
|
|
284 | (2) |
|
9.2.4 MLE for the exponential family |
|
|
286 | (1) |
|
9.2.5 Bayes for the exponential family |
|
|
287 | (2) |
|
9.2.6 Maximum entropy derivation of the exponential family |
|
|
289 | (1) |
|
9.3 Generalized linear models (GLMs) |
|
|
290 | (3) |
|
|
290 | (2) |
|
9.3.2 ML and MAP estimation |
|
|
292 | (1) |
|
|
293 | (1) |
|
|
293 | (3) |
|
9.4.1 ML/MAP estimation using gradient-based optimization |
|
|
294 | (1) |
|
9.4.2 Latent variable interpretation |
|
|
294 | (1) |
|
9.4.3 Ordinal probit regression |
|
|
295 | (1) |
|
9.4.4 Multinomial probit models |
|
|
295 | (1) |
|
|
296 | (2) |
|
9.5.1 Hierarchical Bayes for multi-task learning |
|
|
296 | (1) |
|
9.5.2 Application to personalized email spam filtering |
|
|
296 | (1) |
|
9.5.3 Application to domain adaptation |
|
|
297 | (1) |
|
9.5.4 Other kinds of prior |
|
|
297 | (1) |
|
9.6 Generalized linear mixed models |
|
|
298 | (2) |
|
9.6.1 Example: semi-parametric GLMMs for medical data |
|
|
298 | (2) |
|
9.6.2 Computational issues |
|
|
300 | (1) |
|
|
300 | (7) |
|
9.7.1 The pointwise approach |
|
|
301 | (1) |
|
9.7.2 The pairwise approach |
|
|
301 | (1) |
|
9.7.3 The listwise approach |
|
|
302 | (1) |
|
9.7.4 Loss functions for ranking |
|
|
303 | (4) |
10 Directed graphical models (Bayes nets) |
|
307 | (30) |
|
|
307 | (4) |
|
|
307 | (1) |
|
10.1.2 Conditional independence |
|
|
308 | (1) |
|
|
308 | (1) |
|
|
309 | (1) |
|
10.1.5 Directed graphical models |
|
|
310 | (1) |
|
|
311 | (8) |
|
10.2.1 Naive Bayes classifiers |
|
|
311 | (1) |
|
10.2.2 Markov and hidden Markov models |
|
|
312 | (1) |
|
|
313 | (2) |
|
10.2.4 Genetic linkage analysis |
|
|
315 | (3) |
|
10.2.5 Directed Gaussian graphical models |
|
|
318 | (1) |
|
|
319 | (1) |
|
|
320 | (4) |
|
|
320 | (2) |
|
10.4.2 Learning from complete data |
|
|
322 | (1) |
|
10.4.3 Learning with missing and/or latent variables |
|
|
323 | (1) |
|
10.5 Conditional independence properties of DGMs |
|
|
324 | (4) |
|
10.5.1 d-separation and the Bayes Ball algorithm (global Markov properties) |
|
|
324 | (3) |
|
10.5.2 Other Markov properties of DGMs |
|
|
327 | (1) |
|
10.5.3 Markov blanket and full conditionals |
|
|
327 | (1) |
|
10.6 Influence (decision) diagrams |
|
|
328 | (9) |
11 Mixture models and the EM algorithm |
|
337 | (44) |
|
11.1 Latent variable models |
|
|
337 | (1) |
|
|
337 | (8) |
|
11.2.1 Mixtures of Gaussians |
|
|
339 | (1) |
|
11.2.2 Mixture of multinoullis |
|
|
340 | (1) |
|
11.2.3 Using mixture models for clustering |
|
|
340 | (2) |
|
11.2.4 Mixtures of experts |
|
|
342 | (3) |
|
11.3 Parameter estimation for mixture models |
|
|
345 | (3) |
|
|
346 | (1) |
|
11.3.2 Computing a MAP estimate is non-convex |
|
|
347 | (1) |
|
|
348 | (22) |
|
|
349 | (1) |
|
|
350 | (7) |
|
11.4.3 EM for mixture of experts |
|
|
357 | (1) |
|
11.4.4 EM for DGMs with hidden variables |
|
|
358 | (1) |
|
11.4.5 EM for the Student distribution |
|
|
359 | (3) |
|
11.4.6 EM for probit regression |
|
|
362 | (1) |
|
11.4.7 Theoretical basis for EM |
|
|
363 | (2) |
|
|
365 | (2) |
|
|
367 | (3) |
|
11.5 Model selection for latent variable models |
|
|
370 | (2) |
|
11.5.1 Model selection for probabilistic models |
|
|
370 | (1) |
|
11.5.2 Model selection for non-probabilistic methods |
|
|
370 | (2) |
|
11.6 Fitting models with missing data |
|
|
372 | (9) |
|
11.6.1 EM for the MLE of an MVN with missing data |
|
|
373 | (8) |
12 Latent linear models |
|
381 | (40) |
|
|
381 | (6) |
|
12.1.1 FA is a low rank parameterization of an MVN |
|
|
381 | (1) |
|
12.1.2 Inference of the latent factors |
|
|
382 | (1) |
|
|
383 | (2) |
|
12.1.4 Mixtures of factor analysers |
|
|
385 | (1) |
|
12.1.5 EM for factor analysis models |
|
|
386 | (1) |
|
12.1.6 Fitting FA models with missing data |
|
|
387 | (1) |
|
12.2 Principal components analysis (PCA) |
|
|
387 | (11) |
|
12.2.1 Classical PCA: statement of the theorem |
|
|
387 | (2) |
|
|
389 | (3) |
|
12.2.3 Singular value decomposition (SVD) |
|
|
392 | (3) |
|
|
395 | (1) |
|
12.2.5 EM algorithm for PCA |
|
|
396 | (2) |
|
12.3 Choosing the number of latent dimensions |
|
|
398 | (4) |
|
12.3.1 Model selection for FA/PPCA |
|
|
398 | (1) |
|
12.3.2 Model selection for PCA |
|
|
399 | (3) |
|
12.4 PCA for categorical data |
|
|
402 | (2) |
|
12.5 PCA for paired and multi-view data |
|
|
404 | (3) |
|
12.5.1 Supervised PCA (latent factor regression) |
|
|
405 | (1) |
|
12.5.2 Partial least squares |
|
|
406 | (1) |
|
12.5.3 Canonical correlation analysis |
|
|
407 | (1) |
|
12.6 Independent Component Analysis (ICA) |
|
|
407 | (14) |
|
12.6.1 Maximum likelihood estimation |
|
|
410 | (1) |
|
12.6.2 The FastICA algorithm |
|
|
411 | (3) |
|
|
414 | (1) |
|
12.6.4 Other estimation principles |
|
|
415 | (6) |
13 Sparse linear models |
|
421 | (58) |
|
|
421 | (1) |
|
13.2 Bayesian variable selection |
|
|
422 | (7) |
|
13.2.1 The spike and slab model |
|
|
424 | (1) |
|
13.2.2 From the Bernoulli-Gaussian model to l0 regularization |
|
|
425 | (1) |
|
|
426 | (3) |
|
13.3 l1 regularization: basics |
|
|
429 | (12) |
|
13.3.1 Why does l1 regularization yield sparse solutions? |
|
|
430 | (1) |
|
13.3.2 Optimality conditions for lasso |
|
|
431 | (4) |
|
13.3.3 Comparison of least squares, lasso, ridge and subset selection |
|
|
435 | (1) |
|
13.3.4 Regularization path |
|
|
436 | (3) |
|
|
439 | (1) |
|
13.3.6 Bayesian inference for linear models with Laplace priors |
|
|
440 | (1) |
|
13.4 l1 regularization: algorithms |
|
|
441 | (8) |
|
13.4.1 Coordinate descent |
|
|
441 | (1) |
|
13.4.2 LARS and other homotopy methods |
|
|
441 | (1) |
|
13.4.3 Proximal and gradient projection methods |
|
|
442 | (5) |
|
|
447 | (2) |
|
13.5 l1 regularization: extensions |
|
|
449 | (8) |
|
|
449 | (5) |
|
|
454 | (1) |
|
13.5.3 Elastic net (ridge and lasso combined) |
|
|
455 | (2) |
|
13.6 Non-convex regularizers |
|
|
457 | (6) |
|
|
458 | (1) |
|
13.6.2 Hierarchical adaptive lasso |
|
|
458 | (4) |
|
13.6.3 Other hierarchical priors |
|
|
462 | (1) |
|
13.7 Automatic relevance determination (ARD)/sparse Bayesian learning (SBL) |
|
|
463 | (5) |
|
13.7.1 ARD for linear regression |
|
|
463 | (2) |
|
|
465 | (1) |
|
13.7.3 Connection to MAP estimation |
|
|
465 | (1) |
|
13.7.4 Algorithms for ARD |
|
|
466 | (2) |
|
13.7.5 ARD for logistic regression |
|
|
468 | (1) |
|
|
468 | (11) |
|
13.8.1 Learning a sparse coding dictionary |
|
|
469 | (1) |
|
13.8.2 Results of dictionary learning from image patches |
|
|
470 | (2) |
|
13.8.3 Compressed sensing |
|
|
472 | (1) |
|
13.8.4 Image inpainting and denoising |
|
|
472 | (7) |
14 Kernels |
|
479 | (36) |
|
|
479 | (1) |
|
|
479 | (7) |
|
|
480 | (1) |
|
14.2.2 Kernels for comparing documents |
|
|
480 | (1) |
|
14.2.3 Mercer (positive definite) kernels |
|
|
481 | (1) |
|
|
482 | (1) |
|
|
482 | (1) |
|
|
483 | (1) |
|
14.2.7 Pyramid match kernels |
|
|
484 | (1) |
|
14.2.8 Kernels derived from probabilistic generative models |
|
|
485 | (1) |
|
14.3 Using kernels inside GLMs |
|
|
486 | (2) |
|
|
486 | (1) |
|
14.3.2 L1VMs, RVMs, and other sparse vector machines |
|
|
487 | (1) |
|
|
488 | (8) |
|
14.4.1 Kernelized nearest neighbor classification |
|
|
489 | (1) |
|
14.4.2 Kernelized K-medoids clustering |
|
|
489 | (3) |
|
14.4.3 Kernelized ridge regression |
|
|
492 | (1) |
|
|
493 | (3) |
|
14.5 Support vector machines (SVMs) |
|
|
496 | (9) |
|
14.5.1 SVMs for regression |
|
|
497 | (1) |
|
14.5.2 SVMs for classification |
|
|
498 | (6) |
|
|
504 | (1) |
|
14.5.4 Summary of key points |
|
|
504 | (1) |
|
14.5.5 A probabilistic interpretation of SVMs |
|
|
505 | (1) |
|
14.6 Comparison of discriminative kernel methods |
|
|
505 | (2) |
|
14.7 Kernels for building generative models |
|
|
507 | (8) |
|
|
507 | (1) |
|
14.7.2 Kernel density estimation (KDE) |
|
|
508 | (1) |
|
|
509 | (1) |
|
|
510 | (2) |
|
14.7.5 Locally weighted regression |
|
|
512 | (3) |
15 Gaussian processes |
|
515 | (28) |
|
|
515 | (1) |
|
|
516 | (9) |
|
15.2.1 Predictions using noise-free observations |
|
|
517 | (1) |
|
15.2.2 Predictions using noisy observations |
|
|
518 | (1) |
|
15.2.3 Effect of the kernel parameters |
|
|
519 | (2) |
|
15.2.4 Estimating the kernel parameters |
|
|
521 | (3) |
|
15.2.5 Computational and numerical issues |
|
|
524 | (1) |
|
15.2.6 Semi-parametric GPs |
|
|
524 | (1) |
|
|
525 | (7) |
|
15.3.1 Binary classification |
|
|
525 | (3) |
|
15.3.2 Multi-class classification |
|
|
528 | (3) |
|
15.3.3 GPs for Poisson regression |
|
|
531 | (1) |
|
15.4 Connection with other methods |
|
|
532 | (8) |
|
15.4.1 Linear models compared to GPs |
|
|
532 | (1) |
|
15.4.2 Linear smoothers compared to GPs |
|
|
533 | (1) |
|
15.4.3 SVMs compared to GPs |
|
|
534 | (1) |
|
15.4.4 LIVM and Rs compared to GPs |
|
|
534 | (1) |
|
15.4.5 Neural networks compared to GPs |
|
|
535 | (1) |
|
15.4.6 Smoothing splines compared to GPs |
|
|
536 | (2) |
|
15.4.7 RKHS methods compared to GPs |
|
|
538 | (2) |
|
15.5 GP latent variable model |
|
|
540 | (2) |
|
15.6 Approximation methods for large datasets |
|
|
542 | (1) |
16 Adaptive basis function models |
|
543 | (46) |
|
|
543 | (1) |
|
16.2 Classification and regression trees (CART) |
|
|
544 | (8) |
|
|
544 | (1) |
|
|
545 | (4) |
|
|
549 | (1) |
|
16.2.4 Pros and cons of trees |
|
|
550 | (1) |
|
|
550 | (1) |
|
16.2.6 CART compared to hierarchical mixture of experts |
|
|
551 | (1) |
|
16.3 Generalized additive models |
|
|
552 | (2) |
|
|
552 | (1) |
|
16.3.2 Computational efficiency |
|
|
553 | (1) |
|
16.3.3 Multivariate adaptive regression splines (MARS) |
|
|
553 | (1) |
|
|
554 | (9) |
|
16.4.1 Forward stagewise additive modeling |
|
|
555 | (2) |
|
|
557 | (1) |
|
|
558 | (1) |
|
|
559 | (1) |
|
16.4.5 Boosting as functional gradient descent |
|
|
560 | (1) |
|
|
561 | (1) |
|
16.4.7 Multivariate adaptive regression trees (MART) |
|
|
562 | (1) |
|
16.4.8 Why does boosting work so well? |
|
|
562 | (1) |
|
|
563 | (1) |
|
16.5 Feedforward neural networks (multilayer perceptrons) |
|
|
563 | (17) |
|
16.5.1 Convolutional neural networks |
|
|
564 | (4) |
|
16.5.2 Other kinds of neural networks |
|
|
568 | (1) |
|
16.5.3 A brief history of the field |
|
|
568 | (1) |
|
16.5.4 The backpropagation algorithm |
|
|
569 | (3) |
|
|
572 | (1) |
|
|
572 | (4) |
|
16.5.7 Bayesian inference |
|
|
576 | (4) |
|
|
580 | (2) |
|
|
580 | (1) |
|
16.6.2 Error-correcting output codes |
|
|
581 | (1) |
|
16.6.3 Ensemble learning is not equivalent to Bayes model averaging |
|
|
581 | (1) |
|
16.7 Experimental comparison |
|
|
582 | (3) |
|
16.7.1 Low-dimensional features |
|
|
582 | (1) |
|
16.7.2 High-dimensional features |
|
|
583 | (2) |
|
16.8 Interpreting black-box models |
|
|
585 | (4) |
17 Markov and hidden Markov models |
|
589 | (42) |
|
|
589 | (1) |
|
|
589 | (14) |
|
|
589 | (2) |
|
17.2.2 Application: Language modeling |
|
|
591 | (5) |
|
17.2.3 Stationary distribution of a Markov chain |
|
|
596 | (4) |
|
17.2.4 Application: Google's PageRank algorithm for web page ranking |
|
|
600 | (3) |
|
17.3 Hidden Markov models |
|
|
603 | (3) |
|
17.3.1 Applications of HMMs |
|
|
604 | (2) |
|
|
606 | (11) |
|
17.4.1 Types of inference problems for temporal models |
|
|
606 | (3) |
|
17.4.2 The forwards algorithm |
|
|
609 | (1) |
|
17.4.3 The forwards-backwards algorithm |
|
|
610 | (2) |
|
17.4.4 The Viterbi algorithm |
|
|
612 | (4) |
|
17.4.5 Forwards filtering, backwards sampling |
|
|
616 | (1) |
|
|
617 | (4) |
|
17.5.1 Training with fully observed data |
|
|
617 | (1) |
|
17.5.2 EM for HMMs (the Baum-Welch algorithm) |
|
|
618 | (2) |
|
17.5.3 Bayesian methods for "fitting" HMMs |
|
|
620 | (1) |
|
17.5.4 Discriminative training |
|
|
620 | (1) |
|
|
621 | (1) |
|
17.6 Generalizations of HMMs |
|
|
621 | (10) |
|
17.6.1 Variable duration (semi-Markov) HMMs |
|
|
622 | (2) |
|
|
624 | (1) |
|
|
625 | (1) |
|
17.6.4 Auto-regressive and buried HMMs |
|
|
626 | (1) |
|
|
627 | (1) |
|
17.6.6 Coupled HMM and the influence model |
|
|
628 | (1) |
|
17.6.7 Dynamic Bayesian networks (DBNs) |
|
|
628 | (3) |
18 State space models |
|
631 | (30) |
|
|
631 | (1) |
|
18.2 Applications of SSMs |
|
|
632 | (8) |
|
18.2.1 SSMs for object tracking |
|
|
632 | (1) |
|
|
633 | (3) |
|
18.2.3 Online parameter learning using recursive least squares |
|
|
636 | (1) |
|
18.2.4 SSM for time series forecasting |
|
|
637 | (3) |
|
|
640 | (6) |
|
18.3.1 The Kalman filtering algorithm |
|
|
640 | (3) |
|
18.3.2 The Kalman smoothing algorithm |
|
|
643 | (3) |
|
|
646 | (1) |
|
18.4.1 Identifiability and numerical stability |
|
|
646 | (1) |
|
18.4.2 Training with fully observed data |
|
|
647 | (1) |
|
|
647 | (1) |
|
|
647 | (1) |
|
18.4.5 Bayesian methods for "fitting" LG-SSMs |
|
|
647 | (1) |
|
18.5 Approximate online inference for non-linear, non-Gaussian SSMs |
|
|
647 | (8) |
|
18.5.1 Extended Kalman filter (EKF) |
|
|
648 | (2) |
|
18.5.2 Unscented Kalman filter (UKF) |
|
|
650 | (2) |
|
18.5.3 Assumed density filtering (ADF) |
|
|
652 | (3) |
|
18.6 Hybrid discrete/continuous SSMs |
|
|
655 | (6) |
|
|
656 | (2) |
|
18.6.2 Application: data association and multi-target tracking |
|
|
658 | (1) |
|
18.6.3 Application: fault diagnosis |
|
|
659 | (1) |
|
18.6.4 Application: econometric forecasting |
|
|
660 | (1) |
19 Undirected graphical models (Markov random fields) |
|
661 | (46) |
|
|
661 | (1) |
|
19.2 Conditional independence properties of UGMs |
|
|
661 | (4) |
|
|
661 | (2) |
|
19.2.2 An undirected alternative to d-separation |
|
|
663 | (1) |
|
19.2.3 Comparing directed and undirected graphical models |
|
|
664 | (1) |
|
19.3 Parameterization of MRFs |
|
|
665 | (3) |
|
19.3.1 The Hammersley-Clifford theorem |
|
|
665 | (2) |
|
19.3.2 Representing potential functions |
|
|
667 | (1) |
|
|
668 | (8) |
|
|
668 | (1) |
|
|
669 | (2) |
|
|
671 | (1) |
|
|
672 | (2) |
|
19.4.5 Markov logic networks |
|
|
674 | (2) |
|
|
676 | (8) |
|
19.5.1 Training maxent models using gradient methods |
|
|
676 | (1) |
|
19.5.2 Training partially observed maxent models |
|
|
677 | (1) |
|
19.5.3 Approximate methods for computing the MLEs of MRFs |
|
|
678 | (1) |
|
|
678 | (1) |
|
19.5.5 Stochastic maximum likelihood |
|
|
679 | (1) |
|
19.5.6 Feature induction for maxent models |
|
|
680 | (1) |
|
19.5.7 Iterative proportional fitting (IPF) |
|
|
681 | (3) |
|
19.6 Conditional random fields (CRFs) |
|
|
684 | (9) |
|
19.6.1 Chain-structured CRFs, MEMMs and the label-bias problem |
|
|
684 | (2) |
|
19.6.2 Applications of CRFs |
|
|
686 | (6) |
|
|
692 | (1) |
|
|
693 | (14) |
|
19.7.1 SSVMs: a probabilistic view |
|
|
693 | (2) |
|
19.7.2 SSVMs: a non-probabilistic view |
|
|
695 | (3) |
|
19.7.3 Cutting plane methods for fitting SSVMs |
|
|
698 | (2) |
|
19.7.4 Online algorithms for fitting SSVMs |
|
|
700 | (1) |
|
19.7.5 Latent structural SVMs |
|
|
701 | (6) |
20 Exact inference for graphical models |
|
707 | (24) |
|
|
707 | (1) |
|
20.2 Belief propagation for trees |
|
|
707 | (7) |
|
|
707 | (2) |
|
|
709 | (1) |
|
|
710 | (2) |
|
|
712 | (2) |
|
20.3 The variable elimination algorithm |
|
|
714 | (6) |
|
20.3.1 The generalized distributive law |
|
|
717 | (1) |
|
20.3.2 Computational complexity of VE |
|
|
717 | (3) |
|
|
720 | (1) |
|
20.4 The junction tree algorithm |
|
|
720 | (6) |
|
20.4.1 Creating a junction tree |
|
|
720 | (2) |
|
20.4.2 Message passing on a junction tree |
|
|
722 | (3) |
|
20.4.3 Computational complexity of JTA |
|
|
725 | (1) |
|
20.4.4 JTA generalizations |
|
|
726 | (1) |
|
20.5 Computational intractability of exact inference in the worst case |
|
|
726 | (5) |
|
20.5.1 Approximate inference |
|
|
727 | (4) |
21 Variational inference |
|
731 | (36) |
|
|
731 | (1) |
|
21.2 Variational inference |
|
|
732 | (3) |
|
21.2.1 Alternative interpretations of the variational objective |
|
|
733 | (1) |
|
21.2.2 Forward or reverse KL? |
|
|
733 | (2) |
|
21.3 The mean field method |
|
|
735 | (4) |
|
21.3.1 Derivation of the mean field update equations |
|
|
736 | (1) |
|
21.3.2 Example: mean field for the Ising model |
|
|
737 | (2) |
|
21.4 Structured mean field |
|
|
739 | (3) |
|
21.4.1 Example: factorial HMM |
|
|
740 | (2) |
|
|
742 | (7) |
|
21.5.1 Example: VB for a univariate Gaussian |
|
|
742 | (4) |
|
21.5.2 Example: VB for linear regression |
|
|
746 | (3) |
|
21.6 Variational Bayes EM |
|
|
749 | (7) |
|
21.6.1 Example: VBEM for mixtures of Gaussians |
|
|
750 | (6) |
|
21.7 Variational message passing and VIBES |
|
|
756 | (1) |
|
21.8 Local variational bounds |
|
|
756 | (11) |
|
21.8.1 Motivating applications |
|
|
756 | (2) |
|
21.8.2 Bohning's quadratic bound to the log-sum-exp function |
|
|
758 | (2) |
|
21.8.3 Bounds for the sigmoid function |
|
|
760 | (2) |
|
21.8.4 Other bounds and approximations to the log-sum-exp function |
|
|
762 | (1) |
|
21.8.5 Variational inference based on upper bounds |
|
|
763 | (4) |
22 More variational inference |
|
767 | (48) |
|
|
767 | (1) |
|
22.2 Loopy belief propagation: algorithmic issues |
|
|
767 | (9) |
|
|
767 | (1) |
|
22.2.2 LBP on pairwise models |
|
|
768 | (1) |
|
22.2.3 LBP on a factor graph |
|
|
769 | (2) |
|
|
771 | (3) |
|
|
774 | (1) |
|
22.2.6 Other speedup tricks for LBP |
|
|
775 | (1) |
|
22.3 Loopy belief propagation: theoretical issues |
|
|
776 | (7) |
|
22.3.1 UGMs represented in exponential family form |
|
|
776 | (1) |
|
22.3.2 The marginal polytope |
|
|
777 | (1) |
|
22.3.3 Exact inference as a variational optimization problem |
|
|
778 | (1) |
|
22.3.4 Mean field as a variational optimization problem |
|
|
779 | (1) |
|
22.3.5 LBP as a variational optimization problem |
|
|
779 | (4) |
|
22.3.6 Loopy BP vs mean field |
|
|
783 | (1) |
|
22.4 Extensions of belief propagation |
|
|
783 | (4) |
|
22.4.1 Generalized belief propagation |
|
|
783 | (2) |
|
22.4.2 Convex belief propagation |
|
|
785 | (2) |
|
22.5 Expectation propagation |
|
|
787 | (12) |
|
22.5.1 EP as a variational inference problem |
|
|
788 | (1) |
|
22.5.2 Optimizing the EP objective using moment matching |
|
|
789 | (2) |
|
22.5.3 EP for the clutter problem |
|
|
791 | (1) |
|
22.5.4 LBP is a special case of EP |
|
|
792 | (1) |
|
22.5.5 Ranking players using TrueSkill |
|
|
793 | (6) |
|
22.5.6 Other applications of EP |
|
|
799 | (1) |
|
22.6 MAP state estimation |
|
|
799 | (16) |
|
22.6.1 Linear programming relaxation |
|
|
799 | (1) |
|
22.6.2 Max-product belief propagation |
|
|
800 | (1) |
|
|
801 | (3) |
|
22.6.4 Experimental comparison of graphcuts and BP |
|
|
804 | (2) |
|
22.6.5 Dual decomposition |
|
|
806 | (9) |
23 Monte Carlo inference |
|
815 | (22) |
|
|
815 | (1) |
|
23.2 Sampling from standard distributions |
|
|
815 | (2) |
|
|
815 | (2) |
|
23.2.2 Sampling from a Gaussian (Box-Muller method) |
|
|
817 | (1) |
|
|
817 | (3) |
|
|
817 | (1) |
|
|
818 | (1) |
|
23.3.3 Application to Bayesian statistics |
|
|
819 | (1) |
|
23.3.4 Adaptive rejection sampling |
|
|
819 | (1) |
|
23.3.5 Rejection sampling in high dimensions |
|
|
820 | (1) |
|
|
820 | (3) |
|
|
820 | (1) |
|
23.4.2 Handling unnormalized distributions |
|
|
821 | (1) |
|
23.4.3 Importance sampling for a DGM: likelihood weighting |
|
|
822 | (1) |
|
23.4.4 Sampling importance resampling (SIR) |
|
|
822 | (1) |
|
|
823 | (8) |
|
23.5.1 Sequential importance sampling |
|
|
824 | (1) |
|
23.5.2 The degeneracy problem |
|
|
825 | (1) |
|
23.5.3 The resampling step |
|
|
825 | (2) |
|
23.5.4 The proposal distribution |
|
|
827 | (1) |
|
23.5.5 Application: robot localization |
|
|
828 | (1) |
|
23.5.6 Application: visual object tracking |
|
|
828 | (3) |
|
23.5.7 Application: time series forecasting |
|
|
831 | (1) |
|
23.6 Rao-Blackwellised particle filtering (RBPF) |
|
|
831 | (6) |
|
23.6.1 RBPF for switching LG-SSMs |
|
|
831 | (1) |
|
23.6.2 Application: tracking a maneuvering target |
|
|
832 | (2) |
|
23.6.3 Application: Fast SLAM |
|
|
834 | (3) |
24 Markov chain Monte Carlo (MCMC) inference |
|
837 | (38) |
|
|
837 | (1) |
|
|
838 | (10) |
|
|
838 | (1) |
|
24.2.2 Example: Gibbs sampling for the Ising model |
|
|
838 | (2) |
|
24.2.3 Example: Gibbs sampling for inferring the parameters of a GMM |
|
|
840 | (1) |
|
24.2.4 Collapsed Gibbs sampling |
|
|
841 | (3) |
|
24.2.5 Gibbs sampling for hierarchical GLMs |
|
|
844 | (2) |
|
|
846 | (1) |
|
24.2.7 The Imputation Posterior (IP) algorithm |
|
|
847 | (1) |
|
24.2.8 Blocking Gibbs sampling |
|
|
847 | (1) |
|
24.3 Metropolis Hastings algorithm |
|
|
848 | (8) |
|
|
848 | (1) |
|
24.3.2 Gibbs sampling is a special case of MH |
|
|
849 | (1) |
|
24.3.3 Proposal distributions |
|
|
850 | (3) |
|
|
853 | (1) |
|
24.3.5 Initialization and mode hopping |
|
|
854 | (1) |
|
|
854 | (1) |
|
24.3.7 Reversible jump (trans-dimensional) MCMC |
|
|
855 | (1) |
|
24.4 Speed and accuracy of MCMC |
|
|
856 | (7) |
|
|
856 | (1) |
|
24.4.2 Mixing rates of Markov chains |
|
|
857 | (1) |
|
24.4.3 Practical convergence diagnostics |
|
|
858 | (2) |
|
|
860 | (2) |
|
|
862 | (1) |
|
25.5 Auxiliary variable MCMC |
|
|
863 | (5) |
|
24.5.1 Auxiliary variable sampling for logistic regression |
|
|
863 | (1) |
|
|
864 | (2) |
|
|
866 | (2) |
|
24.5.4 Hybrid/Hamiltonian MCMC |
|
|
868 | (1) |
|
|
868 | (4) |
|
24.6.1 Simulated annealing |
|
|
869 | (2) |
|
24.6.2 Annealed importance sampling |
|
|
871 | (1) |
|
24.6.3 Parallel tempering |
|
|
871 | (1) |
|
24.7 Approximating the marginal likelihood |
|
|
872 | (3) |
|
24.7.1 The candidate method |
|
|
872 | (1) |
|
24.7.2 Harmonic mean estimate |
|
|
872 | (1) |
|
24.7.3 Annealed importance sampling |
|
|
873 | (2) |
25 Clustering |
|
875 | (32) |
|
|
875 | (4) |
|
25.1.1 Measuring (dis)similarity |
|
|
875 | (1) |
|
25.1.2 Evaluating the output of clustering methods |
|
|
876 | (3) |
|
25.2 Dirichlet process mixture models |
|
|
879 | (8) |
|
25.2.1 From finite to infinite mixture models |
|
|
879 | (3) |
|
25.2.2 The Dirichlet process |
|
|
882 | (3) |
|
25.2.3 Applying Dirichlet processes to mixture modeling |
|
|
885 | (1) |
|
25.2.4 Fitting a DP mixture model |
|
|
886 | (1) |
|
25.3 Affinity propagation |
|
|
887 | (3) |
|
|
890 | (3) |
|
|
891 | (1) |
|
25.4.2 Normalized graph Laplacian |
|
|
892 | (1) |
|
|
893 | (1) |
|
25.5 Hierarchical clustering |
|
|
893 | (8) |
|
25.5.1 Agglomerative clustering |
|
|
895 | (3) |
|
25.5.2 Divisive clustering |
|
|
898 | (1) |
|
25.5.3 Choosing the number of dusters |
|
|
899 | (1) |
|
25.5.4 Bayesian hierarchical clustering |
|
|
899 | (2) |
|
25.6 Clustering datapoints and features |
|
|
901 | (6) |
|
|
903 | (1) |
|
25.6.2 Multi-view clustering |
|
|
903 | (4) |
26 Graphical model structure learning |
|
907 | (38) |
|
|
907 | (1) |
|
26.2 Structure learning for knowledge discovery |
|
|
908 | (2) |
|
26.2.1 Relevance networks |
|
|
908 | (1) |
|
26.2.2 Dependency networks |
|
|
909 | (1) |
|
26.3 Learning tree structures |
|
|
910 | (4) |
|
26.3.1 Directed or undirected tree? |
|
|
911 | (1) |
|
26.3.2 Chow-Liu algorithm for finding the ML tree structure |
|
|
912 | (1) |
|
26.3.3 Finding the MAP forest |
|
|
912 | (2) |
|
|
914 | (1) |
|
26.4 Learning DAG structures |
|
|
914 | (8) |
|
26.4.1 Markov equivalence |
|
|
914 | (2) |
|
26.4.2 Exact structural inference |
|
|
916 | (4) |
|
26.4.3 Scaling up to larger graphs |
|
|
920 | (2) |
|
26.5 Learning DAG structure with latent variables |
|
|
922 | (9) |
|
26.5.1 Approximating the marginal likelihood when we have missing data |
|
|
922 | (3) |
|
|
925 | (1) |
|
26.5.3 Discovering hidden variables |
|
|
926 | (2) |
|
26.5.4 Case study: Google's Rephil |
|
|
928 | (1) |
|
26.5.5 Structural equation models |
|
|
929 | (2) |
|
26.6 Learning causal DAGs |
|
|
931 | (7) |
|
26.6.1 Causal interpretation of DAGs |
|
|
931 | (2) |
|
26.6.2 Using causal DAGs to resolve Simpson's paradox |
|
|
933 | (2) |
|
26.6.3 Learning causal DAG structures |
|
|
935 | (3) |
|
26.7 Learning undirected Gaussian graphical models |
|
|
938 | (4) |
|
|
938 | (1) |
|
|
939 | (2) |
|
26.7.3 Bayesian inference for GGM structure |
|
|
941 | (1) |
|
26.7.4 Handling non-Gaussian data using copulas |
|
|
942 | (1) |
|
26.8 Learning undirected discrete graphical models |
|
|
942 | (3) |
|
26.8.1 Graphical lasso for MRFs/CRFs |
|
|
942 | (2) |
|
26.8.2 Thin junction trees |
|
|
944 | (1) |
27 Latent variable models for discrete data |
|
945 | (50) |
|
|
945 | (1) |
|
27.2 Distributed state LVMs for discrete data |
|
|
946 | (4) |
|
|
946 | (1) |
|
27.2.2 Exponential family PCA |
|
|
947 | (1) |
|
|
948 | (1) |
|
27.2.4 GaP model and non-negative matrix factorization |
|
|
949 | (1) |
|
27.3 Latent Dirichlet allocation (LDA) |
|
|
950 | (11) |
|
|
950 | (3) |
|
27.3.2 Unsupervised discovery of topics |
|
|
953 | (1) |
|
27.3.3 Quantitatively evaluating LDA as a language model |
|
|
953 | (2) |
|
27.3.4 Fitting using (collapsed) Gibbs sampling |
|
|
955 | (1) |
|
|
956 | (1) |
|
27.3.6 Fitting using batch variational inference |
|
|
957 | (2) |
|
27.3.7 Fitting using online variational inference |
|
|
959 | (1) |
|
27.3.8 Determining the number of topics |
|
|
960 | (1) |
|
|
961 | (9) |
|
27.4.1 Correlated topic model |
|
|
961 | (1) |
|
27.4.2 Dynamic topic model |
|
|
962 | (1) |
|
|
963 | (4) |
|
|
967 | (3) |
|
27.5 LVMs for graph-structured data |
|
|
970 | (5) |
|
27.5.1 Stochastic block model |
|
|
971 | (2) |
|
27.5.2 Mixed membership stochastic block model |
|
|
973 | (1) |
|
27.5.3 Relational topic model |
|
|
974 | (1) |
|
27.6 LVMs for relational data |
|
|
975 | (8) |
|
27.6.1 Infinite relational model |
|
|
976 | (3) |
|
27.6.2 Probabilistic matrix factorization for collaborative filtering |
|
|
979 | (4) |
|
27.7 Restricted Boltzmann machines (RBMs) |
|
|
983 | (12) |
|
|
985 | (2) |
|
|
987 | (4) |
|
27.7.3 Applications of RBMs |
|
|
991 | (4) |
28 Deep learning |
|
995 | (14) |
|
|
995 | (1) |
|
28.2 Deep generative models |
|
|
995 | (4) |
|
28.2.1 Deep directed networks |
|
|
996 | (1) |
|
28.2.2 Deep Boltzmann machines |
|
|
996 | (1) |
|
28.2.3 Deep belief networks |
|
|
997 | (1) |
|
28.2.4 Greedy layer-wise learning of DBNs |
|
|
998 | (1) |
|
28.3 Deep neural networks |
|
|
999 | (2) |
|
28.3.1 Deep multi-layer perceptrons |
|
|
999 | (1) |
|
28.3.2 Deep auto-encoders |
|
|
1000 | (1) |
|
28.3.3 Stacked denoising auto-encoders |
|
|
1001 | (1) |
|
28.4 Applications of deep networks |
|
|
1001 | (4) |
|
28.4.1 Handwritten digit classification using DBNs |
|
|
1001 | (1) |
|
28.4.2 Data visualization and feature discovery using deep auto-encoders |
|
|
1002 | (1) |
|
28.4.3 Information retrieval using deep auto-encoders (semantic hashing) |
|
|
1003 | (1) |
|
28.4.4 Learning audio features using 1d convolutional DBNs |
|
|
1004 | (1) |
|
28.4.5 Learning image features using 2d convolutional DBNs |
|
|
1005 | (1) |
|
|
1005 | (4) |
Notation |
|
1009 | (6) |
Bibliography |
|
1015 | (32) |
Indexes |
|
1047 | |
|
|
1047 | (3) |
|
|
1050 | |