Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Machine Learning: A Probabilistic Perspective

4.34/5 (1021 hinnangut Goodreads-ist)

Kevin P. Murphy

Formaat: 1104 pages
Sari: Machine Learning
Ilmumisaeg: 07-Sep-2012
Kirjastus: MIT Press
Keel: eng
ISBN-13: 9780262305242

Teised raamatud teemal:

Machine learning

Formaat - PDF+DRM
Hind: 260,00 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: 1104 pages
Sari: Machine Learning
Ilmumisaeg: 07-Sep-2012
Kirjastus: MIT Press
Keel: eng
ISBN-13: 9780262305242

Teised raamatud teemal:

Machine learning

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

Muu info

Winner of Winner, 2013 DeGroot Prize awarded by the International Society for Bayesian Analysis 2013.

Preface

xxvii

1 Introduction

(26)

1.1 Machine learning: what and why?

(2)

1.1.1 Types of machine learning

(1)

1.2 Supervised learning

(6)

1.2.1 Classification

(5)

1.2.2 Regression

(1)

1.3 Unsupervised learning

(7)

1.3.1 Discovering clusters

(1)

1.3.2 Discovering latent factors

(2)

1.3.3 Discovering graph structure

(1)

1.3.4 Matrix completion

(2)

1.4 Some basic concepts in machine learning

(11)

1.4.1 Parametric vs non-parametric models

(1)

1.4.2 A simple non-parametric classifier: K-nearest neighbors

(2)

1.4.3 The curse of dimensionality

(1)

1.4.4 Parametric models for classification and regression

(1)

1.4.5 Linear regression

(2)

1.4.6 Logistic regression

(1)

1.4.7 Overfitting

(1)

1.4.8 Model selection

(2)

1.4.9 No free lunch theorem

(3)

2 Probability

(38)

2.1 Introduction

(1)

2.2 A brief review of probability theory

(6)

2.2.1 Discrete random variables

(1)

2.2.2 Fundamental rules

(1)

2.2.3 Bayes rule

(1)

2.2.4 Independence and conditional independence

(2)

2.2.5 Continuous random variables

(1)

2.2.6 Quantiles

(1)

2.2.7 Mean and variance

(1)

2.3 Some common discrete distributions

(4)

2.3.1 The binomial and Bernoulli distributions

(1)

2.3.2 The multinornial and multinoulli distributions

(2)

2.3.3 The Poisson distribution

(1)

2.3.4 The empirical distribution

(1)

2.4 Some common continuous distributions

(6)

2.4.1 Gaussian (normal) distribution

(1)

2.4.2 Degenerate pdf

(2)

2.4.3 Laplace distribution

(1)

2.4.4 The gamma distribution

(1)

2.4.5 The beta distribution

(1)

2.4.6 Pareto distribution

(1)

2.5 Joint probability distributions

(5)

2.5.1 Covariance and correlation

(2)

2.5.2 The multivariate Gaussian

(1)

2.5.3 Multivariate Student t distribution

(1)

2.5.4 Dirichlet distribution

(2)

2.6 Transformations of random variables

(3)

2.6.1 Linear transformations

(1)

2.6.2 General transformations

(1)

2.6.3 Central limit theorem

(1)

2.7 Monte Carlo approximation

(4)

2.7.1 Example: change of variables, the MC way

(1)

2.7.2 Example: estimating π by Monte Carlo integration

(1)

2.7.3 Accuracy of Monte Carlo approximation

(2)

2.8 Information theory

(9)

2.8.1 Entropy

(1)

2.8.2 KL divergence

(2)

2.8.3 Mutual information

(6)

3 Generative models for discrete data

(32)

3.1 introduction

(1)

3.2 Bayesian concept learning

(7)

3.2.1 likelihood

(1)

3.2.2 Prior

(1)

3.2.3 Posterior

(3)

3.2.4 Posterior predictive distribution

(1)

3.2.5 A more complex prior

(1)

3.3 The beta-binomial model

(6)

3.3.1 Likelihood

(1)

3.3.2 Prior

(1)

3.3.3 Posterior

(2)

3.3.4 Posterior predictive distribution

(1)

3.4 The Dirichlet-multinomial model

(4)

3.4.1 Likelihood

(1)

3.4.2 Prior

(1)

3.4.3 Posterior

(2)

3.4.4 Posterior predictive

(1)

3.5 Naive Bayes classifiers

(15)

3.5.1 Model fitting

(2)

3.5.2 Using the model for prediction

(1)

3.5.3 The log-sum-exp trick

(1)

3.5.4 Feature selection using mutual information

(1)

3.5.5 Classifying documents using bag of words

(10)

4 Gaussian models

(52)

4.1 Introduction

(4)

4.1.1 Notation

(1)

4.1.2 Basics

(2)

4.1.3 MLE for an MVN

(2)

4.1.4 Maximum entropy derivation of the Gaussian

101

(1)

4.2 Gaussian discriminant analysis

101

(9)

4.2.1 Quadratic discriminant analysis (QDA)

102

(1)

4.2.2 Linear discriminant analysis (LDA)

103

(1)

4.2.3 Two-class LDA

104

(2)

4.2.4 MLE for discriminant analysis

106

(1)

4.2.5 Strategies for preventing overfitting

106

(1)

4.2.6 Regularized LDA

107

(1)

4.2.7 Diagonal LDA

108

(1)

4.2.8 Nearest shrunken centroids classifier

109

(1)

4.3 Inference in jointly Gaussian distributions

110

(9)

4.3.1 Statement of the result

111

(1)

4.3.2 Examples

111

(4)

4.3.3 Information form

115

(1)

4.3.4 Proof of the result

116

(3)

4.4 Linear Gaussian systems

119

(6)

4.4.1 Statement of the result

119

(1)

4.4.2 Examples

120

(4)

4.4.3 Proof of the result

124

(1)

4.5 Digression: The Wishart distribution

125

(2)

4.5.1 Inverse Wishart distribution

126

(1)

4.5.2 Visualizing the Wishart distribution

127

(1)

4.6 Inferring the parameters of an MVN

127

(22)

4.6.1 Posterior distribution of µ

128

(1)

4.6.2 Posterior distribution of Σ

128

(4)

4.6.3 Posterior distribution of µ and Σ

132

(6)

4.6.4 Sensor fusion with unknown precisions

138

(11)

5 Bayesian statistics

149

(42)

5.1 Introduction

149

(1)

5.2 Summarizing posterior distributions

149

(6)

5.2.1 MAP estimation

149

(3)

5.2.2 Credible intervals

152

(2)

5.2.3 Inference for a difference in proportions

154

(1)

5.3 Bayesian model selection

155

(10)

5.3.1 Bayesian Occam's razor

156

(2)

5.3.2 Computing the marginal likelihood (evidence)

158

(5)

5.3.3 Bayes factors

163

(1)

5.3.4 Jeffreys-Lindley paradox

164

(1)

5.4 Priors

165

(6)

5.4.1 Uninformative priors

165

(1)

5.4.2 Jeffreys priors

166

(2)

5.4.3 Robust priors

168

(1)

5.4.4 Mixtures of conjugate priors

168

(3)

5.5 Hierarchical Bayes

171

(1)

5.5.1 Example: modeling related cancer rates

171

(1)

5.6 Empirical Bayes

172

(4)

5.6.1 Example: beta-binomial model

173

(1)

5.6.2 Example: Gaussian-Gaussian model

173

(3)

5.7 Bayesian decision theory

176

(15)

5.7.1 Bayes estimators for common loss functions

177

(3)

5.7.2 The false positive vs false negative tradeoff

180

(4)

5.7.3 Other topics

184

(7)

6 Frequentist statistics

191

(26)

6.1 Introduction

191

(1)

6.2 Sampling distribution of an estimator

191

(3)

6.2.1 Bootstrap

192

(1)

6.2.2 Large sample theory for the MLE

193

(1)

6.3 Frequentist decision theory

194

(6)

6.3.1 Bayes risk

195

(1)

6.3.2 Minimax risk

196

(1)

6.3.3 Admissible estimators

197

(3)

6.4 Desirable properties of estimators

200

(4)

6.4.1 Consistent estimators

200

(1)

6.4.2 Unbiased estimators

200

(1)

6.4.3 Minimum variance estimators

201

(1)

6.4.4 The bias-variance tradeoff

202

(2)

6.5 Empirical risk minimization

204

(7)

6.5.1 Regularized risk minimization

205

(1)

6.5.2 Structural risk minimization

206

(1)

6.5.3 Estimating the risk using cross validation

206

(3)

6.5.4 Upper bounding the risk using statistical learning theory

209

(1)

6.5.5 Surrogate loss functions

210

(1)

6.6 Pathologies of frequentist statistics

211

(6)

6.6.1 Counter-intuitive behavior of confidence intervals

212

(1)

6.6.2 p-values considered harmful

213

(1)

6.6.3 The likelihood principle

214

(1)

6.6.4 Why isn't everyone a Bayesian?

215

(2)

7 Linear regression

217

(28)

7.1 Introduction

217

(1)

7.2 Model specification

217

(1)

7.3 Maximum likelihood estimation (least squares)

217

(6)

7.3.1 Derivation of the MLE

219

(1)

7.3.2 Geometric interpretation

220

(1)

7.3.3 Convexity

221

(2)

7.4 Robust linear regression

223

(2)

7.5 Ridge regression

225

(6)

7.5.1 Basic idea

225

(2)

7.5.2 Numerically stable computation

227

(1)

7.5.3 Connection with PCA

228

(2)

7.5.4 Regularization effects of big data

230

(1)

7.6 Bayesian linear regression

231

(14)

7.6.1 Computing the posterior

232

(1)

7.6.2 Computing the posterior predictive

233

(1)

7.6.3 Bayesian inference when σ2 is unknown

234

(4)

7.6.4 EB for linear regression (evidence procedure)

238

(7)

8 Logistic regression

245

(36)

8.1 Introduction

245

(1)

8.2 Model specification

245

(1)

8.3 Model fitting

245

(9)

8.3.1 MLE

246

(1)

8.3.2 Steepest descent

247

(2)

8.3.3 Newton's method

249

(1)

8.3.4 Iteratively reweighted least squares (IRLS)

250

(1)

8.3.5 Quasi-Newton (variable metric) methods

251

(1)

8.3.6 l2 regularization

252

(1)

8.3.7 Multi-class logistic regression

252

(2)

8.4 Bayesian logistic regression

254

(7)

8.4.1 Laplace approximation

255

(1)

8.4.2 Derivation of the BIC

255

(1)

8.4.3 Gaussian approximation for logistic regression

256

(1)

8.4.4 Approximating the posterior predictive

256

(4)

8.4.5 Residual analysis (outlier detection)

260

(1)

8.5 Online learning and stochastic optimization

261

(6)

8.5.1 Online learning and regret minimization

262

(1)

8.5.2 Stochastic optimization and risk minimization

262

(2)

8.5.3 The LMS algorithm

264

(1)

8.5.4 The perceptron algorithm

265

(1)

8.5.5 A Bayesian view

266

(1)

8.6 Generative vs discriminative classifiers

267

(14)

8.6.1 Pros and cons of each approach

268

(1)

8.6.2 Dealing with missing data

269

(2)

8.6.3 Fisher's linear discriminant analysis (FLDA)

271

(10)

9 Generalized linear models and the exponential family

281

(26)

9.1 Introduction

281

(1)

9.2 The exponential family

281

(9)

9.2.1 Definition

282

(1)

9.2.2 Examples

282

(2)

9.2.3 Log partition function

284

(2)

9.2.4 MLE for the exponential family

286

(1)

9.2.5 Bayes for the exponential family

287

(2)

9.2.6 Maximum entropy derivation of the exponential family

289

(1)

9.3 Generalized linear models (GLMs)

290

(3)

9.3.1 Basics

290

(2)

9.3.2 ML and MAP estimation

292

(1)

9.3.3 Bayesian inference

293

(1)

9.4 Probit regression

293

(3)

9.4.1 ML/MAP estimation using gradient-based optimization

294

(1)

9.4.2 Latent variable interpretation

294

(1)

9.4.3 Ordinal probit regression

295

(1)

9.4.4 Multinomial probit models

295

(1)

9.5 Multi-task learning

296

(2)

9.5.1 Hierarchical Bayes for multi-task learning

296

(1)

9.5.2 Application to personalized email spam filtering

296

(1)

9.5.3 Application to domain adaptation

297

(1)

9.5.4 Other kinds of prior

297

(1)

9.6 Generalized linear mixed models

298

(2)

9.6.1 Example: semi-parametric GLMMs for medical data

298

(2)

9.6.2 Computational issues

300

(1)

9.7 Learning to rank

300

(7)

9.7.1 The pointwise approach

301

(1)

9.7.2 The pairwise approach

301

(1)

9.7.3 The listwise approach

302

(1)

9.7.4 Loss functions for ranking

303

(4)

10 Directed graphical models (Bayes nets)

307

(30)

10.1 Introduction

307

(4)

10.1.1 Chain rule

307

(1)

10.1.2 Conditional independence

308

(1)

10.1.3 Graphical models

308

(1)

10.1.4 Graph terminology

309

(1)

10.1.5 Directed graphical models

310

(1)

10.2 Examples

311

(8)

10.2.1 Naive Bayes classifiers

311

(1)

10.2.2 Markov and hidden Markov models

312

(1)

10.2.3 Medical diagnosis

313

(2)

10.2.4 Genetic linkage analysis

315

(3)

10.2.5 Directed Gaussian graphical models

318

(1)

10.3 Inference

319

(1)

10.4 Learning

320

(4)

10.4.1 Plate notation

320

(2)

10.4.2 Learning from complete data

322

(1)

10.4.3 Learning with missing and/or latent variables

323

(1)

10.5 Conditional independence properties of DGMs

324

(4)

10.5.1 d-separation and the Bayes Ball algorithm (global Markov properties)

324

(3)

10.5.2 Other Markov properties of DGMs

327

(1)

10.5.3 Markov blanket and full conditionals

327

(1)

10.6 Influence (decision) diagrams

328

(9)

11 Mixture models and the EM algorithm

337

(44)

11.1 Latent variable models

337

(1)

11.2 Mixture models

337

(8)

11.2.1 Mixtures of Gaussians

339

(1)

11.2.2 Mixture of multinoullis

340

(1)

11.2.3 Using mixture models for clustering

340

(2)

11.2.4 Mixtures of experts

342

(3)

11.3 Parameter estimation for mixture models

345

(3)

11.3.1 Unidentifiability

346

(1)

11.3.2 Computing a MAP estimate is non-convex

347

(1)

11.4 The EM algorithm

348

(22)

11.4.1 Basic idea

349

(1)

11.4.2 EM for GMMs

350

(7)

11.4.3 EM for mixture of experts

357

(1)

11.4.4 EM for DGMs with hidden variables

358

(1)

11.4.5 EM for the Student distribution

359

(3)

11.4.6 EM for probit regression

362

(1)

11.4.7 Theoretical basis for EM

363

(2)

11.4.8 Online EM

365

(2)

11.4.9 Other EM variants

367

(3)

11.5 Model selection for latent variable models

370

(2)

11.5.1 Model selection for probabilistic models

370

(1)

11.5.2 Model selection for non-probabilistic methods

370

(2)

11.6 Fitting models with missing data

372

(9)

11.6.1 EM for the MLE of an MVN with missing data

373

(8)

12 Latent linear models

381

(40)

12.1 Factor analysis

381

(6)

12.1.1 FA is a low rank parameterization of an MVN

381

(1)

12.1.2 Inference of the latent factors

382

(1)

12.1.3 Unidentifiability

383

(2)

12.1.4 Mixtures of factor analysers

385

(1)

12.1.5 EM for factor analysis models

386

(1)

12.1.6 Fitting FA models with missing data

387

(1)

12.2 Principal components analysis (PCA)

387

(11)

12.2.1 Classical PCA: statement of the theorem

387

(2)

12.2.2 Proof

389

(3)

12.2.3 Singular value decomposition (SVD)

392

(3)

12.2.4 Probabilistic PCA

395

(1)

12.2.5 EM algorithm for PCA

396

(2)

12.3 Choosing the number of latent dimensions

398

(4)

12.3.1 Model selection for FA/PPCA

398

(1)

12.3.2 Model selection for PCA

399

(3)

12.4 PCA for categorical data

402

(2)

12.5 PCA for paired and multi-view data

404

(3)

12.5.1 Supervised PCA (latent factor regression)

405

(1)

12.5.2 Partial least squares

406

(1)

12.5.3 Canonical correlation analysis

407

(1)

12.6 Independent Component Analysis (ICA)

407

(14)

12.6.1 Maximum likelihood estimation

410

(1)

12.6.2 The FastICA algorithm

411

(3)

12.6.3 Using EM

414

(1)

12.6.4 Other estimation principles

415

(6)

13 Sparse linear models

421

(58)

13.1 Introduction

421

(1)

13.2 Bayesian variable selection

422

(7)

13.2.1 The spike and slab model

424

(1)

13.2.2 From the Bernoulli-Gaussian model to l0 regularization

425

(1)

13.2.3 Algorithms

426

(3)

13.3 l1 regularization: basics

429

(12)

13.3.1 Why does l1 regularization yield sparse solutions?

430

(1)

13.3.2 Optimality conditions for lasso

431

(4)

13.3.3 Comparison of least squares, lasso, ridge and subset selection

435

(1)

13.3.4 Regularization path

436

(3)

13.3.5 Model selection

439

(1)

13.3.6 Bayesian inference for linear models with Laplace priors

440

(1)

13.4 l1 regularization: algorithms

441

(8)

13.4.1 Coordinate descent

441

(1)

13.4.2 LARS and other homotopy methods

441

(1)

13.4.3 Proximal and gradient projection methods

442

(5)

13.4.4 EM for lasso

447

(2)

13.5 l1 regularization: extensions

449

(8)

13.5.1 Group Lasso

449

(5)

13.5.2 Fused lasso

454

(1)

13.5.3 Elastic net (ridge and lasso combined)

455

(2)

13.6 Non-convex regularizers

457

(6)

13.6.1 Bridge regression

458

(1)

13.6.2 Hierarchical adaptive lasso

458

(4)

13.6.3 Other hierarchical priors

462

(1)

13.7 Automatic relevance determination (ARD)/sparse Bayesian learning (SBL)

463

(5)

13.7.1 ARD for linear regression

463

(2)

13.7.2 Whence sparsity?

465

(1)

13.7.3 Connection to MAP estimation

465

(1)

13.7.4 Algorithms for ARD

466

(2)

13.7.5 ARD for logistic regression

468

(1)

13.8 Sparse coding

468

(11)

13.8.1 Learning a sparse coding dictionary

469

(1)

13.8.2 Results of dictionary learning from image patches

470

(2)

13.8.3 Compressed sensing

472

(1)

13.8.4 Image inpainting and denoising

472

(7)

14 Kernels

479

(36)

14.1 Introduction

479

(1)

14.2 Kernel functions

479

(7)

14.2.1 RBF kernels

480

(1)

14.2.2 Kernels for comparing documents

480

(1)

14.2.3 Mercer (positive definite) kernels

481

(1)

14.2.4 Linear kernels

482

(1)

14.2.5 Matern kernels

482

(1)

14.2.6 String kernels

483

(1)

14.2.7 Pyramid match kernels

484

(1)

14.2.8 Kernels derived from probabilistic generative models

485

(1)

14.3 Using kernels inside GLMs

486

(2)

14.3.1 Kernel machines

486

(1)

14.3.2 L1VMs, RVMs, and other sparse vector machines

487

(1)

14.4 The kernel trick

488

(8)

14.4.1 Kernelized nearest neighbor classification

489

(1)

14.4.2 Kernelized K-medoids clustering

489

(3)

14.4.3 Kernelized ridge regression

492

(1)

14.4.4 Kernel PCA

493

(3)

14.5 Support vector machines (SVMs)

496

(9)

14.5.1 SVMs for regression

497

(1)

14.5.2 SVMs for classification

498

(6)

14.5.3 Choosing C

504

(1)

14.5.4 Summary of key points

504

(1)

14.5.5 A probabilistic interpretation of SVMs

505

(1)

14.6 Comparison of discriminative kernel methods

505

(2)

14.7 Kernels for building generative models

507

(8)

14.7.1 Smoothing kernels

507

(1)

14.7.2 Kernel density estimation (KDE)

508

(1)

14.7.3 From KDE to KNN

509

(1)

14.7.4 Kernel regression

510

(2)

14.7.5 Locally weighted regression

512

(3)

15 Gaussian processes

515

(28)

15.1 Introduction

515

(1)

15.2 GPs for regression

516

(9)

15.2.1 Predictions using noise-free observations

517

(1)

15.2.2 Predictions using noisy observations

518

(1)

15.2.3 Effect of the kernel parameters

519

(2)

15.2.4 Estimating the kernel parameters

521

(3)

15.2.5 Computational and numerical issues

524

(1)

15.2.6 Semi-parametric GPs

524

(1)

15.3 GPs meet GLMs

525

(7)

15.3.1 Binary classification

525

(3)

15.3.2 Multi-class classification

528

(3)

15.3.3 GPs for Poisson regression

531

(1)

15.4 Connection with other methods

532

(8)

15.4.1 Linear models compared to GPs

532

(1)

15.4.2 Linear smoothers compared to GPs

533

(1)

15.4.3 SVMs compared to GPs

534

(1)

15.4.4 LIVM and Rs compared to GPs

534

(1)

15.4.5 Neural networks compared to GPs

535

(1)

15.4.6 Smoothing splines compared to GPs

536

(2)

15.4.7 RKHS methods compared to GPs

538

(2)

15.5 GP latent variable model

540

(2)

15.6 Approximation methods for large datasets

542

(1)

16 Adaptive basis function models

543

(46)

16.1 Introduction

543

(1)

16.2 Classification and regression trees (CART)

544

(8)

16.2.1 Basics

544

(1)

16.2.2 Growing a tree

545

(4)

16.2.3 Pruning a tree

549

(1)

16.2.4 Pros and cons of trees

550

(1)

16.2.5 Random forests

550

(1)

16.2.6 CART compared to hierarchical mixture of experts

551

(1)

16.3 Generalized additive models

552

(2)

16.3.1 Backfitting

552

(1)

16.3.2 Computational efficiency

553

(1)

16.3.3 Multivariate adaptive regression splines (MARS)

553

(1)

16.4 Boosting

554

(9)

16.4.1 Forward stagewise additive modeling

555

(2)

16.4.2 L2boosting

557

(1)

16.4.3 AdaBoost

558

(1)

16.4.4 LogitBoost

559

(1)

16.4.5 Boosting as functional gradient descent

560

(1)

16.4.6 Sparse boosting

561

(1)

16.4.7 Multivariate adaptive regression trees (MART)

562

(1)

16.4.8 Why does boosting work so well?

562

(1)

16.4.9 A Bayesian view

563

(1)

16.5 Feedforward neural networks (multilayer perceptrons)

563

(17)

16.5.1 Convolutional neural networks

564

(4)

16.5.2 Other kinds of neural networks

568

(1)

16.5.3 A brief history of the field

568

(1)

16.5.4 The backpropagation algorithm

569

(3)

16.5.5 Identifiability

572

(1)

16.5.6 Regularization

572

(4)

16.5.7 Bayesian inference

576

(4)

16.6 Ensemble learning

580

(2)

16.6.1 Stacking

580

(1)

16.6.2 Error-correcting output codes

581

(1)

16.6.3 Ensemble learning is not equivalent to Bayes model averaging

581

(1)

16.7 Experimental comparison

582

(3)

16.7.1 Low-dimensional features

582

(1)

16.7.2 High-dimensional features

583

(2)

16.8 Interpreting black-box models

585

(4)

17 Markov and hidden Markov models

589

(42)

17.1 Introduction

589

(1)

17.2 Markov models

589

(14)

17.2.1 Transition matrix

589

(2)

17.2.2 Application: Language modeling

591

(5)

17.2.3 Stationary distribution of a Markov chain

596

(4)

17.2.4 Application: Google's PageRank algorithm for web page ranking

600

(3)

17.3 Hidden Markov models

603

(3)

17.3.1 Applications of HMMs

604

(2)

17.4 Inference in HMMs

606

(11)

17.4.1 Types of inference problems for temporal models

606

(3)

17.4.2 The forwards algorithm

609

(1)

17.4.3 The forwards-backwards algorithm

610

(2)

17.4.4 The Viterbi algorithm

612

(4)

17.4.5 Forwards filtering, backwards sampling

616

(1)

17.5 Learning for HMMs

617

(4)

17.5.1 Training with fully observed data

617

(1)

17.5.2 EM for HMMs (the Baum-Welch algorithm)

618

(2)

17.5.3 Bayesian methods for "fitting" HMMs

620

(1)

17.5.4 Discriminative training

620

(1)

17.5.5 Model selection

621

(1)

17.6 Generalizations of HMMs

621

(10)

17.6.1 Variable duration (semi-Markov) HMMs

622

(2)

17.6.2 Hierarchical HMMs

624

(1)

17.6.3 Input-output HMMs

625

(1)

17.6.4 Auto-regressive and buried HMMs

626

(1)

17.6.5 Factorial HMM

627

(1)

17.6.6 Coupled HMM and the influence model

628

(1)

17.6.7 Dynamic Bayesian networks (DBNs)

628

(3)

18 State space models

631

(30)

18.1 Introduction

631

(1)

18.2 Applications of SSMs

632

(8)

18.2.1 SSMs for object tracking

632

(1)

18.2.2 Robotic SLAM

633

(3)

18.2.3 Online parameter learning using recursive least squares

636

(1)

18.2.4 SSM for time series forecasting

637

(3)

18.3 Inference in LG-SSM

640

(6)

18.3.1 The Kalman filtering algorithm

640

(3)

18.3.2 The Kalman smoothing algorithm

643

(3)

18.4 Learning for LG-SSM

646

(1)

18.4.1 Identifiability and numerical stability

646

(1)

18.4.2 Training with fully observed data

647

(1)

18.4.3 EM for LG-SSM

647

(1)

18.4.4 Subspace methods

647

(1)

18.4.5 Bayesian methods for "fitting" LG-SSMs

647

(1)

18.5 Approximate online inference for non-linear, non-Gaussian SSMs

647

(8)

18.5.1 Extended Kalman filter (EKF)

648

(2)

18.5.2 Unscented Kalman filter (UKF)

650

(2)

18.5.3 Assumed density filtering (ADF)

652

(3)

18.6 Hybrid discrete/continuous SSMs

655

(6)

18.6.1 Inference

656

(2)

18.6.2 Application: data association and multi-target tracking

658

(1)

18.6.3 Application: fault diagnosis

659

(1)

18.6.4 Application: econometric forecasting

660

(1)

19 Undirected graphical models (Markov random fields)

661

(46)

19.1 Introduction

661

(1)

19.2 Conditional independence properties of UGMs

661

(4)

19.2.1 Key properties

661

(2)

19.2.2 An undirected alternative to d-separation

663

(1)

19.2.3 Comparing directed and undirected graphical models

664

(1)

19.3 Parameterization of MRFs

665

(3)

19.3.1 The Hammersley-Clifford theorem

665

(2)

19.3.2 Representing potential functions

667

(1)

19.4 Examples of MRFs

668

(8)

19.4.1 Ising model

668

(1)

19.4.2 Hopfield networks

669

(2)

19.4.3 Potts model

671

(1)

19.4.4 Gaussian MRFs

672

(2)

19.4.5 Markov logic networks

674

(2)

19.5 Learning

676

(8)

19.5.1 Training maxent models using gradient methods

676

(1)

19.5.2 Training partially observed maxent models

677

(1)

19.5.3 Approximate methods for computing the MLEs of MRFs

678

(1)

19.5.4 Pseudo likelihood

678

(1)

19.5.5 Stochastic maximum likelihood

679

(1)

19.5.6 Feature induction for maxent models

680

(1)

19.5.7 Iterative proportional fitting (IPF)

681

(3)

19.6 Conditional random fields (CRFs)

684

(9)

19.6.1 Chain-structured CRFs, MEMMs and the label-bias problem

684

(2)

19.6.2 Applications of CRFs

686

(6)

19.6.3 CRF training

692

(1)

19.7 Structural SVMs

693

(14)

19.7.1 SSVMs: a probabilistic view

693

(2)

19.7.2 SSVMs: a non-probabilistic view

695

(3)

19.7.3 Cutting plane methods for fitting SSVMs

698

(2)

19.7.4 Online algorithms for fitting SSVMs

700

(1)

19.7.5 Latent structural SVMs

701

(6)

20 Exact inference for graphical models

707

(24)

20.1 Introduction

707

(1)

20.2 Belief propagation for trees

707

(7)

20.2.1 Serial protocol

707

(2)

20.2.2 Parallel protocol

709

(1)

20.2.3 Gaussian BP

710

(2)

20.2.4 Other BP variants

712

(2)

20.3 The variable elimination algorithm

714

(6)

20.3.1 The generalized distributive law

717

(1)

20.3.2 Computational complexity of VE

717

(3)

20.3.3 A weakness of VE

720

(1)

20.4 The junction tree algorithm

720

(6)

20.4.1 Creating a junction tree

720

(2)

20.4.2 Message passing on a junction tree

722

(3)

20.4.3 Computational complexity of JTA

725

(1)

20.4.4 JTA generalizations

726

(1)

20.5 Computational intractability of exact inference in the worst case

726

(5)

20.5.1 Approximate inference

727

(4)

21 Variational inference

731

(36)

21.1 Introduction

731

(1)

21.2 Variational inference

732

(3)

21.2.1 Alternative interpretations of the variational objective

733

(1)

21.2.2 Forward or reverse KL?

733

(2)

21.3 The mean field method

735

(4)

21.3.1 Derivation of the mean field update equations

736

(1)

21.3.2 Example: mean field for the Ising model

737

(2)

21.4 Structured mean field

739

(3)

21.4.1 Example: factorial HMM

740

(2)

21.5 Variational Bayes

742

(7)

21.5.1 Example: VB for a univariate Gaussian

742

(4)

21.5.2 Example: VB for linear regression

746

(3)

21.6 Variational Bayes EM

749

(7)

21.6.1 Example: VBEM for mixtures of Gaussians

750

(6)

21.7 Variational message passing and VIBES

756

(1)

21.8 Local variational bounds

756

(11)

21.8.1 Motivating applications

756

(2)

21.8.2 Bohning's quadratic bound to the log-sum-exp function

758

(2)

21.8.3 Bounds for the sigmoid function

760

(2)

21.8.4 Other bounds and approximations to the log-sum-exp function

762

(1)

21.8.5 Variational inference based on upper bounds

763

(4)

22 More variational inference

767

(48)

22.1 Introduction

767

(1)

22.2 Loopy belief propagation: algorithmic issues

767

(9)

22.2.1 A brief history

767

(1)

22.2.2 LBP on pairwise models

768

(1)

22.2.3 LBP on a factor graph

769

(2)

22.2.4 Convergence

771

(3)

22.2.5 Accuracy of LBP

774

(1)

22.2.6 Other speedup tricks for LBP

775

(1)

22.3 Loopy belief propagation: theoretical issues

776

(7)

22.3.1 UGMs represented in exponential family form

776

(1)

22.3.2 The marginal polytope

777

(1)

22.3.3 Exact inference as a variational optimization problem

778

(1)

22.3.4 Mean field as a variational optimization problem

779

(1)

22.3.5 LBP as a variational optimization problem

779

(4)

22.3.6 Loopy BP vs mean field

783

(1)

22.4 Extensions of belief propagation

783

(4)

22.4.1 Generalized belief propagation

783

(2)

22.4.2 Convex belief propagation

785

(2)

22.5 Expectation propagation

787

(12)

22.5.1 EP as a variational inference problem

788

(1)

22.5.2 Optimizing the EP objective using moment matching

789

(2)

22.5.3 EP for the clutter problem

791

(1)

22.5.4 LBP is a special case of EP

792

(1)

22.5.5 Ranking players using TrueSkill

793

(6)

22.5.6 Other applications of EP

799

(1)

22.6 MAP state estimation

799

(16)

22.6.1 Linear programming relaxation

799

(1)

22.6.2 Max-product belief propagation

800

(1)

22.6.3 Graphcuts

801

(3)

22.6.4 Experimental comparison of graphcuts and BP

804

(2)

22.6.5 Dual decomposition

806

(9)

23 Monte Carlo inference

815

(22)

23.1 Introduction

815

(1)

23.2 Sampling from standard distributions

815

(2)

23.2.1 Using the cdf

815

(2)

23.2.2 Sampling from a Gaussian (Box-Muller method)

817

(1)

23.3 Rejection sampling

817

(3)

23.3.1 Basic idea

817

(1)

23.3.2 Example

818

(1)

23.3.3 Application to Bayesian statistics

819

(1)

23.3.4 Adaptive rejection sampling

819

(1)

23.3.5 Rejection sampling in high dimensions

820

(1)

23.4 Importance sampling

820

(3)

23.4.1 Basic idea

820

(1)

23.4.2 Handling unnormalized distributions

821

(1)

23.4.3 Importance sampling for a DGM: likelihood weighting

822

(1)

23.4.4 Sampling importance resampling (SIR)

822

(1)

23.5 Particle filtering

823

(8)

23.5.1 Sequential importance sampling

824

(1)

23.5.2 The degeneracy problem

825

(1)

23.5.3 The resampling step

825

(2)

23.5.4 The proposal distribution

827

(1)

23.5.5 Application: robot localization

828

(1)

23.5.6 Application: visual object tracking

828

(3)

23.5.7 Application: time series forecasting

831

(1)

23.6 Rao-Blackwellised particle filtering (RBPF)

831

(6)

23.6.1 RBPF for switching LG-SSMs

831

(1)

23.6.2 Application: tracking a maneuvering target

832

(2)

23.6.3 Application: Fast SLAM

834

(3)

24 Markov chain Monte Carlo (MCMC) inference

837

(38)

24.1 Introduction

837

(1)

24.2 Gibbs sampling

838

(10)

24.2.1 Basic idea

838

(1)

24.2.2 Example: Gibbs sampling for the Ising model

838

(2)

24.2.3 Example: Gibbs sampling for inferring the parameters of a GMM

840

(1)

24.2.4 Collapsed Gibbs sampling

841

(3)

24.2.5 Gibbs sampling for hierarchical GLMs

844

(2)

24.2.6 BUGS and JAGS

846

(1)

24.2.7 The Imputation Posterior (IP) algorithm

847

(1)

24.2.8 Blocking Gibbs sampling

847

(1)

24.3 Metropolis Hastings algorithm

848

(8)

24.3.1 Basic idea

848

(1)

24.3.2 Gibbs sampling is a special case of MH

849

(1)

24.3.3 Proposal distributions

850

(3)

24.3.4 Adaptive MCMC

853

(1)

24.3.5 Initialization and mode hopping

854

(1)

24.3.6 Why MH works

854

(1)

24.3.7 Reversible jump (trans-dimensional) MCMC

855

(1)

24.4 Speed and accuracy of MCMC

856

(7)

24.4.1 The burn-in phase

856

(1)

24.4.2 Mixing rates of Markov chains

857

(1)

24.4.3 Practical convergence diagnostics

858

(2)

24.4.4 Accuracy of MCMC

860

(2)

24.4.5 How many chains?

862

(1)

25.5 Auxiliary variable MCMC

863

(5)

24.5.1 Auxiliary variable sampling for logistic regression

863

(1)

24.5.2 Slice sampling

864

(2)

24.5.3 Swendsen Wang

866

(2)

24.5.4 Hybrid/Hamiltonian MCMC

868

(1)

24.6 Annealing methods

868

(4)

24.6.1 Simulated annealing

869

(2)

24.6.2 Annealed importance sampling

871

(1)

24.6.3 Parallel tempering

871

(1)

24.7 Approximating the marginal likelihood

872

(3)

24.7.1 The candidate method

872

(1)

24.7.2 Harmonic mean estimate

872

(1)

24.7.3 Annealed importance sampling

873

(2)

25 Clustering

875

(32)

25.1 Introduction

875

(4)

25.1.1 Measuring (dis)similarity

875

(1)

25.1.2 Evaluating the output of clustering methods

876

(3)

25.2 Dirichlet process mixture models

879

(8)

25.2.1 From finite to infinite mixture models

879

(3)

25.2.2 The Dirichlet process

882

(3)

25.2.3 Applying Dirichlet processes to mixture modeling

885

(1)

25.2.4 Fitting a DP mixture model

886

(1)

25.3 Affinity propagation

887

(3)

25.4 Spectral clustering

890

(3)

25.4.1 Graph Laplacian

891

(1)

25.4.2 Normalized graph Laplacian

892

(1)

25.4.3 Example

893

(1)

25.5 Hierarchical clustering

893

(8)

25.5.1 Agglomerative clustering

895

(3)

25.5.2 Divisive clustering

898

(1)

25.5.3 Choosing the number of dusters

899

(1)

25.5.4 Bayesian hierarchical clustering

899

(2)

25.6 Clustering datapoints and features

901

(6)

25.6.1 Biclustering

903

(1)

25.6.2 Multi-view clustering

903

(4)

26 Graphical model structure learning

907

(38)

26.1 Introduction

907

(1)

26.2 Structure learning for knowledge discovery

908

(2)

26.2.1 Relevance networks

908

(1)

26.2.2 Dependency networks

909

(1)

26.3 Learning tree structures

910

(4)

26.3.1 Directed or undirected tree?

911

(1)

26.3.2 Chow-Liu algorithm for finding the ML tree structure

912

(1)

26.3.3 Finding the MAP forest

912

(2)

26.3.4 Mixtures of trees

914

(1)

26.4 Learning DAG structures

914

(8)

26.4.1 Markov equivalence

914

(2)

26.4.2 Exact structural inference

916

(4)

26.4.3 Scaling up to larger graphs

920

(2)

26.5 Learning DAG structure with latent variables

922

(9)

26.5.1 Approximating the marginal likelihood when we have missing data

922

(3)

26.5.2 Structural EM

925

(1)

26.5.3 Discovering hidden variables

926

(2)

26.5.4 Case study: Google's Rephil

928

(1)

26.5.5 Structural equation models

929

(2)

26.6 Learning causal DAGs

931

(7)

26.6.1 Causal interpretation of DAGs

931

(2)

26.6.2 Using causal DAGs to resolve Simpson's paradox

933

(2)

26.6.3 Learning causal DAG structures

935

(3)

26.7 Learning undirected Gaussian graphical models

938

(4)

26.7.1 MLE for a GGM

938

(1)

26.7.2 Graphical lasso

939

(2)

26.7.3 Bayesian inference for GGM structure

941

(1)

26.7.4 Handling non-Gaussian data using copulas

942

(1)

26.8 Learning undirected discrete graphical models

942

(3)

26.8.1 Graphical lasso for MRFs/CRFs

942

(2)

26.8.2 Thin junction trees

944

(1)

27 Latent variable models for discrete data

945

(50)

27.1 Introduction

945

(1)

27.2 Distributed state LVMs for discrete data

946

(4)

27.2.1 Mixture models

946

(1)

27.2.2 Exponential family PCA

947

(1)

27.2.3 LDA and rnPCA

948

(1)

27.2.4 GaP model and non-negative matrix factorization

949

(1)

27.3 Latent Dirichlet allocation (LDA)

950

(11)

27.3.1 Basics

950

(3)

27.3.2 Unsupervised discovery of topics

953

(1)

27.3.3 Quantitatively evaluating LDA as a language model

953

(2)

27.3.4 Fitting using (collapsed) Gibbs sampling

955

(1)

27.3.5 Example

956

(1)

27.3.6 Fitting using batch variational inference

957

(2)

27.3.7 Fitting using online variational inference

959

(1)

27.3.8 Determining the number of topics

960

(1)

27.4 Extensions of LDA

961

(9)

27.4.1 Correlated topic model

961

(1)

27.4.2 Dynamic topic model

962

(1)

27.4.3 LDA-HMM

963

(4)

27.4.4 Supervised LDA

967

(3)

27.5 LVMs for graph-structured data

970

(5)

27.5.1 Stochastic block model

971

(2)

27.5.2 Mixed membership stochastic block model

973

(1)

27.5.3 Relational topic model

974

(1)

27.6 LVMs for relational data

975

(8)

27.6.1 Infinite relational model

976

(3)

27.6.2 Probabilistic matrix factorization for collaborative filtering

979

(4)

27.7 Restricted Boltzmann machines (RBMs)

983

(12)

27.7.1 Varieties of RBMs

985

(2)

27.7.2 Learning RBMs

987

(4)

27.7.3 Applications of RBMs

991

(4)

28 Deep learning

995

(14)

28.1 Introduction

995

(1)

28.2 Deep generative models

995

(4)

28.2.1 Deep directed networks

996

(1)

28.2.2 Deep Boltzmann machines

996

(1)

28.2.3 Deep belief networks

997

(1)

28.2.4 Greedy layer-wise learning of DBNs

998

(1)

28.3 Deep neural networks

999

(2)

28.3.1 Deep multi-layer perceptrons

999

(1)

28.3.2 Deep auto-encoders

1000

(1)

28.3.3 Stacked denoising auto-encoders

1001

(1)

28.4 Applications of deep networks

1001

(4)

28.4.1 Handwritten digit classification using DBNs

1001

(1)

28.4.2 Data visualization and feature discovery using deep auto-encoders

1002

(1)

28.4.3 Information retrieval using deep auto-encoders (semantic hashing)

1003

(1)

28.4.4 Learning audio features using 1d convolutional DBNs

1004

(1)

28.4.5 Learning image features using 2d convolutional DBNs

1005

(1)

28.5 Discussion

1005

(4)

Notation

1009

(6)

Bibliography

1015

(32)

Indexes

1047

Index to code

1047

(3)

Index to keywords

1050

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97802623052422e.html

Märksõnad:

E-raamat: Machine Learning: A Probabilistic Perspective

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Muu info

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv