|
|
xix | |
Series Foreword |
|
xxi | |
Foreword |
|
xxiii | |
Preface |
|
xxv | |
|
|
1 | (164) |
|
1 Learning, Regularity, and Compression |
|
|
3 | (38) |
|
1.1 Regularity and Learning |
|
|
4 | (1) |
|
1.2 Regularity and Compression |
|
|
4 | (4) |
|
1.3 Solomonoff's Breakthrough -- Kolmogorov Complexity |
|
|
8 | (2) |
|
1.4 Making the Idea Applicable |
|
|
10 | (2) |
|
1.5 Crude MDL, Refined MDL and Universal Coding |
|
|
12 | (11) |
|
1.5.1 From Crude to Refined MDL |
|
|
14 | (3) |
|
1.5.2 Universal Coding and Refined MDL |
|
|
17 | (1) |
|
1.5.3 Refined MDL for Model Selection |
|
|
18 | (2) |
|
1.5.4 Refined MDL for Prediction and Hypothesis Selection |
|
|
20 | (3) |
|
1.6 Some Remarks on Model Selection |
|
|
23 | (3) |
|
1.6.1 Model Selection among Non-Nested Models |
|
|
23 | (2) |
|
1.6.2 Goals of Model vs. Point Hypothesis Selection |
|
|
25 | (1) |
|
|
26 | (3) |
|
1.8 MDL, Occam's Razor, and the "True Model" |
|
|
29 | (7) |
|
1.8.1 Answer to Criticism No. 1 |
|
|
30 | (2) |
|
1.8.2 Answer to Criticism No. 2 |
|
|
32 | (4) |
|
1.9 History and Forms of MDL |
|
|
36 | (4) |
|
|
37 | (1) |
|
|
38 | (2) |
|
|
40 | (1) |
|
2 Probabilistic and Statistical Preliminaries |
|
|
41 | (38) |
|
2.1 General Mathematical Preliminaries |
|
|
41 | (5) |
|
2.2 Probabilistic Preliminaries |
|
|
46 | (16) |
|
2.2.1 Definitions; Notational Conventions |
|
|
46 | (7) |
|
2.2.2 Probabilistic Sources |
|
|
53 | (2) |
|
2.2.3 Limit Theorems and Statements |
|
|
55 | (2) |
|
2.2.4 Probabilistic Models |
|
|
57 | (3) |
|
2.2.5 Probabilistic Model Classes |
|
|
60 | (2) |
|
2.3 Kinds of Probabilistic Models* |
|
|
62 | (7) |
|
2.4 Terminological Preliminaries |
|
|
69 | (2) |
|
2.5 Modeling Preliminaries: Goals and Methods for Inductive Inference |
|
|
71 | (7) |
|
|
71 | (3) |
|
2.5.2 Basic Concepts of Bayesian Statistics |
|
|
74 | (4) |
|
|
78 | (1) |
|
3 Information-Theoretic Preliminaries |
|
|
79 | (30) |
|
|
79 | (11) |
|
3.1.1 Restriction to Prefix Coding Systems; Descriptions as Messages |
|
|
83 | (3) |
|
3.1.2 Different Kinds of Codes |
|
|
86 | (4) |
|
3.1.3 Assessing the Efficiency of Description Methods |
|
|
90 | (1) |
|
3.2 The Most Important Section of This Book: Probabilities and Code Lengths |
|
|
90 | (11) |
|
3.2.1 The Kraft Inequality |
|
|
91 | (4) |
|
3.2.2 Code Lengths "Are" Probabilities |
|
|
95 | (4) |
|
3.2.3 Immediate Insights and Consequences |
|
|
99 | (2) |
|
3.3 Probabilities and Code Lengths, Part II |
|
|
101 | (5) |
|
3.3.1 (Relative) Entropy and the Information Inequality |
|
|
103 | (3) |
|
3.3.2 Uniform Codes, Maximum Entropy, and Minimax Codelength |
|
|
106 | (1) |
|
3.4 Summary, Outlook, Further Reading |
|
|
106 | (3) |
|
4 Information-Theoretic Properties of Statistical Models |
|
|
109 | (22) |
|
|
109 | (2) |
|
4.2 Likelihood and Observed Fisher Information |
|
|
111 | (6) |
|
4.3 KL Divergence and Expected Fisher Information |
|
|
117 | (7) |
|
4.4 Maximum Likelihood: Data vs. Parameters |
|
|
124 | (6) |
|
|
130 | (1) |
|
5 Crude Two-Part Code MDL |
|
|
131 | (34) |
|
5.1 Introduction: Making Two-Part MDL Precise |
|
|
132 | (1) |
|
5.2 Two-Part Code MDL for Markov Chain Selection |
|
|
133 | (6) |
|
|
135 | (2) |
|
|
137 | (1) |
|
5.2.3 Crude Two-Part Code MDL for Markov Chains |
|
|
138 | (1) |
|
5.3 Simplistic Two-Part Code MDL Hypothesis Selection |
|
|
139 | (2) |
|
5.4 Two-Part MDL for Tasks Other Than Hypothesis Selection |
|
|
141 | (1) |
|
5.5 Behavior of Two-Part Code MDL |
|
|
142 | (2) |
|
5.6 Two-Part Code MDL and Maximum Likelihood |
|
|
144 | (6) |
|
5.6.1 The Maximum Likelihood Principle |
|
|
144 | (3) |
|
|
147 | (1) |
|
5.6.3 MDL as a Maximum Probability Principle |
|
|
148 | (2) |
|
5.7 Computing and Approximating Two-Part MDL in Practice |
|
|
150 | (2) |
|
5.8 Justifying Crude MDL: Consistency and Code Design |
|
|
152 | (11) |
|
5.8.1 A General Consistency Result |
|
|
153 | (4) |
|
5.8.2 Code Design for Two-Part Code MDL |
|
|
157 | (6) |
|
|
163 | (1) |
|
5.A Appendix: Proof of Theorem 5.1 |
|
|
163 | (2) |
|
|
165 | (238) |
|
6 Universal Coding with Countable Models |
|
|
171 | (36) |
|
6.1 Universal Coding: The Basic Idea |
|
|
172 | (6) |
|
6.1.1 Two-Part Codes as Simple Universal Codes |
|
|
174 | (1) |
|
6.1.2 From Universal Codes to Universal Models |
|
|
175 | (2) |
|
6.1.3 Formal Definition of Universality |
|
|
177 | (1) |
|
|
178 | (6) |
|
6.2.1 Minimax Regret and Normalized ML |
|
|
179 | (3) |
|
6.2.2 NML vs. Two-Part vs. Bayes |
|
|
182 | (2) |
|
6.3 The Countably Infinite Case |
|
|
184 | (6) |
|
6.3.1 The Two-Part and Bayesian Codes |
|
|
184 | (3) |
|
|
187 | (3) |
|
6.4 Prequential Universal Models |
|
|
190 | (9) |
|
6.4.1 Distributions as Prediction Strategies |
|
|
190 | (3) |
|
6.4.2 Bayes Is Prequential; NML and Two-part Are Not |
|
|
193 | (4) |
|
6.4.3 The Prequential Plug-In Model |
|
|
197 | (2) |
|
6.5 Individual vs. Stochastic Universality* |
|
|
199 | (5) |
|
6.5.1 Stochastic Redundancy |
|
|
199 | (2) |
|
6.5.2 Uniformly Universal Models |
|
|
201 | (3) |
|
6.6 Summary, Outlook and Further Reading |
|
|
204 | (3) |
|
7 Parametric Models: Normalized Maximum Likelihood |
|
|
207 | (24) |
|
|
207 | (4) |
|
|
208 | (3) |
|
7.2 Asymptotic Expansion of Parametric Complexity |
|
|
211 | (5) |
|
7.3 The Meaning of ∫ det I(θ)dθ |
|
|
216 | (10) |
|
7.3.1 Complexity and Functional Form |
|
|
217 | (2) |
|
7.3.2 KL Divergence and Distinguishability |
|
|
219 | (3) |
|
7.3.3 Complexity and Volume |
|
|
222 | (2) |
|
7.3.4 Complexity and the Number of Distinguishable Distributions* |
|
|
224 | (2) |
|
7.4 Explicit and Simplified Computations |
|
|
226 | (5) |
|
8 Parametric Models: Bayes |
|
|
231 | (26) |
|
|
231 | (3) |
|
8.1.1 Basic Interpretation of Theorem 8.1 |
|
|
233 | (1) |
|
8.2 Bayes Meets Minimax - Jeffreys' Prior |
|
|
234 | (5) |
|
8.2.1 Jeffreys' Prior and the Boundary |
|
|
237 | (2) |
|
8.3 How to Prove the Bayesian and NML Regret Theorems |
|
|
239 | (5) |
|
8.3.1 Proof Sketch of Theorem 8.1 |
|
|
239 | (2) |
|
8.3.2 Beyond Exponential Families |
|
|
241 | (2) |
|
8.3.3 Proof Sketch of Theorem 7.1 |
|
|
243 | (1) |
|
8.4 Stochastic Universality* |
|
|
244 | (4) |
|
8.A Appendix: Proofs of Theorem 8.1 and Theorem 8.2 |
|
|
248 | (9) |
|
9 Parametric Models: Prequential Plug-in |
|
|
257 | (14) |
|
9.1 Prequential Plug-in for Exponential Families |
|
|
257 | (5) |
|
9.2 The Plug-in vs. the Bayes Universal Model |
|
|
262 | (3) |
|
9.3 More Precise Asymptotics |
|
|
265 | (4) |
|
|
269 | (2) |
|
10 Parametric Models: Two-Part |
|
|
271 | (24) |
|
10.1 The Ordinary Two-Part Universal Model |
|
|
271 | (13) |
|
10.1.1 Derivation of the Two-Part Code Regret |
|
|
274 | (3) |
|
10.1.2 Proof Sketch of Theorem 10.1 |
|
|
277 | (5) |
|
|
282 | (2) |
|
10.2 The Conditional Two-Part Universal Code* |
|
|
284 | (9) |
|
10.2.1 Conditional Two-Part Codes for Discrete Exponential Families |
|
|
286 | (4) |
|
10.2.2 Distinguishability and the Phase Transition* |
|
|
290 | (3) |
|
|
293 | (2) |
|
11 NML With Infinite Complexity |
|
|
295 | (40) |
|
|
295 | (6) |
|
11.1.1 Examples of Undefined NML Distribution |
|
|
298 | (1) |
|
11.1.2 Examples of Undefined Jeffreys' Prior |
|
|
299 | (2) |
|
|
301 | (7) |
|
11.2.1 Constrained Parametric Complexity |
|
|
302 | (1) |
|
11.2.2 Meta-Two-Part Coding |
|
|
303 | (3) |
|
11.2.3 Renormalized Maximum Likelihood* |
|
|
306 | (2) |
|
|
308 | (8) |
|
11.3.1 Asymptotic Expansion of LNML |
|
|
312 | (4) |
|
11.4 Conditional Universal Models |
|
|
316 | (13) |
|
11.4.1 Bayesian Approach with Jeffreys' Prior |
|
|
317 | (3) |
|
|
320 | (5) |
|
11.4.3 Liang and Barron's Approach |
|
|
325 | (4) |
|
|
329 | (1) |
|
11.A Appendix: Proof of Theorem 11.4 |
|
|
329 | (6) |
|
|
335 | (34) |
|
|
336 | (4) |
|
12.1.1 Prelude: The Normal Location Family |
|
|
338 | (2) |
|
12.2 Least-Squares Estimation |
|
|
340 | (8) |
|
12.2.1 The Normal Equations |
|
|
342 | (3) |
|
12.2.2 Composition of Experiments |
|
|
345 | (1) |
|
12.2.3 Penalized Least-Squares |
|
|
346 | (2) |
|
|
348 | (15) |
|
12.3.1 Bayesian Linear Model Mx with Gaussian Prior |
|
|
354 | (5) |
|
12.3.2 Bayesian Linear Models Mx and Sx with Noninformative Priors |
|
|
359 | (4) |
|
12.4 Universal Models for Linear Regression |
|
|
363 | (6) |
|
|
363 | (1) |
|
|
364 | (1) |
|
12.4.3 Bayes-Jeffreys and CNML |
|
|
365 | (4) |
|
|
369 | (34) |
|
|
370 | (2) |
|
13.2 CUP: Unions of Parametric Models |
|
|
372 | (4) |
|
13.2.1 CUP vs. Parametric Models |
|
|
375 | (1) |
|
13.3 Universal Codes Based on Histograms |
|
|
376 | (7) |
|
13.3.1 Redundancy of Universal CUP Histogram Codes |
|
|
380 | (3) |
|
13.4 Nonparametric Redundancy |
|
|
383 | (7) |
|
13.4.1 Standard CUP Universal Codes |
|
|
384 | (3) |
|
13.4.2 Minimax Nonparametric Redundancy |
|
|
387 | (3) |
|
13.5 Gaussian Process Regression* |
|
|
390 | (12) |
|
13.5.1 Kernelization of Bayesian Linear Regression |
|
|
390 | (4) |
|
13.5.2 Gaussian Processes |
|
|
394 | (2) |
|
13.5.3 Gaussian Processes as Universal Models |
|
|
396 | (6) |
|
13.6 Conclusion and Further Reading |
|
|
402 | (1) |
|
|
403 | (194) |
|
|
409 | (50) |
|
|
409 | (2) |
|
14.2 Simple Refined MDL Model Selection |
|
|
411 | (9) |
|
14.2.1 Compression Interpretation |
|
|
415 | (1) |
|
14.2.2 Counting Interpretation |
|
|
416 | (2) |
|
14.2.3 Bayesian Interpretation |
|
|
418 | (1) |
|
14.2.4 Prequential Interpretation |
|
|
419 | (1) |
|
14.3 General Parametric Model Selection |
|
|
420 | (8) |
|
14.3.1 Models with Infinite Complexities |
|
|
420 | (2) |
|
14.3.2 Comparing Many or Infinitely Many Models |
|
|
422 | (3) |
|
14.3.3 The General Picture |
|
|
425 | (3) |
|
14.4 Practical Issues in MDL Model Selection |
|
|
428 | (10) |
|
14.4.1 Calculating Universal Codelengths |
|
|
428 | (1) |
|
14.4.2 Computational Efficiency and Practical Quality of Non-NML Universal Codes |
|
|
429 | (2) |
|
14.4.3 Model Selection with Conditional NML and Plug-in Codes |
|
|
431 | (4) |
|
14.4.4 General Warnings about Model Selection |
|
|
435 | (3) |
|
14.5 MDL Model Selection for Linear Regression |
|
|
438 | (13) |
|
14.5.1 Rissanen's RNML Approach |
|
|
439 | (4) |
|
14.5.2 Hansen and Yu's gMDL Approach |
|
|
443 | (3) |
|
14.5.3 Liang and Barron's Approach |
|
|
446 | (2) |
|
|
448 | (3) |
|
14.6 Worst Case vs. Average Case* |
|
|
451 | (8) |
|
15 MDL Prediction and Estimation |
|
|
459 | (42) |
|
|
459 | (1) |
|
15.2 MDL for Prediction and Predictive Estimation |
|
|
460 | (16) |
|
15.2.1 Prequential MDL Estimators |
|
|
461 | (4) |
|
15.2.2 Prequential MDL Estimators Are Consistent |
|
|
465 | (4) |
|
15.2.3 Parametric and Nonparametric Examples |
|
|
469 | (3) |
|
15.2.4 Cesaro KL consistency vs. KL consistency* |
|
|
472 | (4) |
|
15.3 Two-Part Code MDL for Point Hypothesis Selection |
|
|
476 | (7) |
|
15.3.1 Discussion of Two-Part Consistency Theorem |
|
|
478 | (5) |
|
15.4 MDL Parameter Estimation |
|
|
483 | (15) |
|
15.4.1 MDL Estimators vs. Luckiness ML Estimators |
|
|
487 | (4) |
|
15.4.2 What Estimator To Use? |
|
|
491 | (2) |
|
15.4.3 Comparison to Bayesian Estimators* |
|
|
493 | (5) |
|
|
498 | (1) |
|
15.A Appendix: Proof of Theorem 15.3 |
|
|
499 | (2) |
|
16 MDL Consistency and Convergence |
|
|
501 | (22) |
|
|
501 | (1) |
|
16.1.1 The Scenarios Considered |
|
|
501 | (1) |
|
16.2 Consistency: Prequential and Two-Part MDL Estimators |
|
|
502 | (3) |
|
16.3 Consistency: MDL Model Selection |
|
|
505 | (6) |
|
16.3.1 Selection between a Union of Parametric Models |
|
|
505 | (3) |
|
16.3.2 Nonparametric Model Selection Based on CUP Model Class |
|
|
508 | (3) |
|
16.4 MDL Consistency Peculiarities |
|
|
511 | (4) |
|
|
515 | (5) |
|
16.5.1 Relations between Divergences and Risk Measures |
|
|
517 | (2) |
|
|
519 | (1) |
|
16.6 MDL Rates of Convergence |
|
|
520 | (3) |
|
16.6.1 Prequential and Two-Part MDL Estimators |
|
|
520 | (2) |
|
16.6.2 MDL Model Selection |
|
|
522 | (1) |
|
|
523 | (74) |
|
17.1 MDL and Frequentist Paradigms |
|
|
524 | (7) |
|
17.1.1 Sanity Check or Design Principle? |
|
|
525 | (3) |
|
17.1.2 The Weak Prequential Principle |
|
|
528 | (1) |
|
17.1.3 MDL vs. Frequentist Principles: Remaining Issues |
|
|
529 | (2) |
|
17.2 MDL and Bayesian Inference |
|
|
531 | (18) |
|
17.2.1 Luckiness Functions vs. Prior Distributions |
|
|
534 | (5) |
|
17.2.2 MDL, Bayes, and Occam |
|
|
539 | (5) |
|
17.2.3 MDL and Brands of Bayesian Statistics |
|
|
544 | (4) |
|
17.2.4 Conclusion: a Common Future after All? |
|
|
548 | (1) |
|
|
549 | (6) |
|
|
549 | (1) |
|
|
550 | (2) |
|
17.3.3 Combining the Best of AIC and BIC |
|
|
552 | (3) |
|
|
555 | (7) |
|
17.4.1 Strict Minimum Message Length |
|
|
556 | (2) |
|
|
558 | (2) |
|
17.4.3 The Wallace-Freeman Estimator |
|
|
560 | (2) |
|
17.5 MDL and Prequential Analysis |
|
|
562 | (3) |
|
17.6 MDL and Cross-Validation |
|
|
565 | (2) |
|
17.7 MDL and Maximum Entropy |
|
|
567 | (3) |
|
17.8 Kolmogorov Complexity and Structure Function |
|
|
570 | (3) |
|
17.9 MDL and Individual Sequence Prediction |
|
|
573 | (6) |
|
17.10 MDL and Statistical Learning Theory |
|
|
579 | (13) |
|
17.10.1 Structural Risk Minimization |
|
|
581 | (4) |
|
17.10.2 PAC-Bayesian Approaches |
|
|
585 | (3) |
|
17.10.3 PAC-Bayes and MDL |
|
|
588 | (4) |
|
|
592 | (5) |
|
|
597 | (54) |
|
18 The Exponential or "Maximum Entropy" Families |
|
|
599 | (24) |
|
|
600 | (1) |
|
18.2 Definition and Overview |
|
|
601 | (4) |
|
|
605 | (4) |
|
18.4 Mean-Value, Canonical, and Other Parameterizations |
|
|
609 | (8) |
|
18.4.1 The Mean Value Parameterization |
|
|
609 | (2) |
|
18.4.2 Other Parameterizations |
|
|
611 | (2) |
|
18.4.3 Relating Mean-Value and Canonical Parameters |
|
|
613 | (4) |
|
18.5 Exponential Families of General Probabilistic Sources* |
|
|
617 | (2) |
|
18.6 Fisher Information Definitions and Characterizations* |
|
|
619 | (4) |
|
19 Information-Theoretic Properties of Exponential Families |
|
|
623 | (28) |
|
|
624 | (1) |
|
19.2 Robustness of Exponential Family Codes |
|
|
624 | (5) |
|
19.2.1 If mean Does Not Contain the Mean |
|
|
627 | (2) |
|
19.3 Behavior at the ML Estimate β |
|
|
629 | (3) |
|
19.4 Behavior of the ML Estimate β |
|
|
632 | (5) |
|
19.4.1 Central Limit Theorem |
|
|
633 | (1) |
|
|
634 | (3) |
|
19.5 Maximum Entropy and Minimax Codelength |
|
|
637 | (8) |
|
19.5.1 Exponential Families and Maximum Entropy |
|
|
638 | (3) |
|
19.5.2 Exponential Families and Minimax Codelength |
|
|
641 | (2) |
|
19.5.3 The Compression Game |
|
|
643 | (2) |
|
19.6 Likelihood Ratio Families and Renyi Divergences* |
|
|
645 | (5) |
|
19.6.1 The Likelihood Ratio Family |
|
|
647 | (3) |
|
|
650 | (1) |
References |
|
651 | (24) |
List of Symbols |
|
675 | (4) |
Subject Index |
|
679 | |