|
|
xix | |
Series Foreword |
|
xxi | |
Foreword |
|
xxiii | |
Preface |
|
xxv | |
|
|
1 | (164) |
|
Learning, Regularity, and Compression |
|
|
3 | (38) |
|
|
4 | (1) |
|
Regularity and Compression |
|
|
4 | (4) |
|
Solomonoff's Breakthrough -- Kolmogorov Complexity |
|
|
8 | (2) |
|
Making the Idea Applicable |
|
|
10 | (2) |
|
Crude MDL, Refined MDL and Universal Coding |
|
|
12 | (11) |
|
From Crude to Refined MDL |
|
|
14 | (3) |
|
Universal Coding and Refined MDL |
|
|
17 | (1) |
|
Refined MDL for Model Selection |
|
|
18 | (2) |
|
Refined MDL for Prediction and Hypothesis Selection |
|
|
20 | (3) |
|
Some Remarks on Model Selection |
|
|
23 | (3) |
|
Model Selection among Non-Nested Models |
|
|
23 | (2) |
|
Goals of Model vs. Point Hypothesis Selection |
|
|
25 | (1) |
|
|
26 | (3) |
|
MDL, Occam's Razor, and the ``True Model'' |
|
|
29 | (7) |
|
Answer to Criticism No. 1 |
|
|
30 | (2) |
|
Answer to Criticism No. 2 |
|
|
32 | (4) |
|
|
36 | (4) |
|
|
37 | (1) |
|
|
38 | (2) |
|
|
40 | (1) |
|
Probabilistic and Statistical Preliminaries |
|
|
41 | (38) |
|
General Mathematical Preliminaries |
|
|
41 | (5) |
|
Probabilistic Preliminaries |
|
|
46 | (16) |
|
Definitions; Notational Conventions |
|
|
46 | (7) |
|
|
53 | (2) |
|
Limit Theorems and Statements |
|
|
55 | (2) |
|
|
57 | (3) |
|
Probabilistic Model Classes |
|
|
60 | (2) |
|
Kinds of Probabilistic Models |
|
|
62 | (7) |
|
Terminological Preliminaries |
|
|
69 | (2) |
|
Modeling Preliminaries: Goals and Methods for Inductive Inference |
|
|
71 | (7) |
|
|
71 | (3) |
|
Basic Concepts of Bayesian Statistics |
|
|
74 | (4) |
|
|
78 | (1) |
|
Information-Theoretic Preliminaries |
|
|
79 | (30) |
|
|
79 | (11) |
|
Restriction to Prefix Coding Systems; Descriptions as Messages |
|
|
83 | (3) |
|
|
86 | (4) |
|
Assessing the Efficiency of Description Methods |
|
|
90 | (1) |
|
The Most Important Section of This Book: Probabilities and Code Lengths |
|
|
90 | (11) |
|
|
91 | (4) |
|
Code Lengths ``Are'' Probabilities |
|
|
95 | (4) |
|
Immediate Insights and Consequences |
|
|
99 | (2) |
|
Probabilities and Code Lengths, Part II |
|
|
101 | (5) |
|
(Relative) Entropy and the Information Inequality |
|
|
103 | (3) |
|
Uniform Codes, Maximum Entropy, and Minimax Codelength |
|
|
106 | (1) |
|
Summary, Outlook, Further Reading |
|
|
106 | (3) |
|
Information-Theoretic Properties of Statistical Models |
|
|
109 | (22) |
|
|
109 | (2) |
|
Likelihood and Observed Fisher Information |
|
|
111 | (6) |
|
KL Divergence and Expected Fisher Information |
|
|
117 | (7) |
|
Maximum Likelihood: Data vs. Parameters |
|
|
124 | (6) |
|
|
130 | (1) |
|
|
131 | (34) |
|
Introduction: Making Two-Part MDL Precise |
|
|
132 | (1) |
|
Two-Part Code MDL for Markov Chain Selection |
|
|
133 | (6) |
|
|
135 | (2) |
|
|
137 | (1) |
|
Crude Two-Part Code MDL for Markov Chains |
|
|
138 | (1) |
|
Simplistic Two-Part Code MDL Hypothesis Selection |
|
|
139 | (2) |
|
Two-Part MDL for Tasks Other Than Hypothesis Selection |
|
|
141 | (1) |
|
Behavior of Two-Part Code MDL |
|
|
142 | (2) |
|
Two-Part Code MDL and Maximum Likelihood |
|
|
144 | (6) |
|
The Maximum Likelihood Principle |
|
|
144 | (3) |
|
|
147 | (1) |
|
MDL as a Maximum Probability Principle |
|
|
148 | (2) |
|
Computing and Approximating Two-Part MDL in Practice |
|
|
150 | (2) |
|
Justifying Crude MDL: Consistency and Code Design |
|
|
152 | (11) |
|
A General Consistency Result |
|
|
153 | (4) |
|
Code Design for Two-Part Code MDL |
|
|
157 | (6) |
|
|
163 | (1) |
|
Appendix: Proof of Theorem 5.1 |
|
|
163 | (2) |
|
|
165 | (238) |
|
Universal Coding with Countable Models |
|
|
171 | (36) |
|
Universal Coding: The Basic Idea |
|
|
172 | (6) |
|
Two-Part Codes as Simple Universal Codes |
|
|
174 | (1) |
|
From Universal Codes to Universal Models |
|
|
175 | (2) |
|
Formal Definition of Universality |
|
|
177 | (1) |
|
|
178 | (6) |
|
Minimax Regret and Normalized ML |
|
|
179 | (3) |
|
NML vs. Two-Part vs. Bayes |
|
|
182 | (2) |
|
The Countably Infinite Case |
|
|
184 | (6) |
|
The Two-Part and Bayesian Codes |
|
|
184 | (3) |
|
|
187 | (3) |
|
Prequential Universal Models |
|
|
190 | (9) |
|
Distributions as Prediction Strategies |
|
|
190 | (3) |
|
Bayes Is Prequential; NML and Two-part Are Not |
|
|
193 | (4) |
|
The Prequential Plug-In Model |
|
|
197 | (2) |
|
Individual vs. Stochastic Universality |
|
|
199 | (5) |
|
|
199 | (2) |
|
Uniformly Universal Models |
|
|
201 | (3) |
|
Summary, Outlook and Further Reading |
|
|
204 | (3) |
|
Parametric Models: Normalized Maximum Likelihood |
|
|
207 | (24) |
|
|
207 | (4) |
|
|
208 | (3) |
|
Asymptotic Expansion of Parametric Complexity |
|
|
211 | (5) |
|
The Meaning of ∫Θ √det I(theta;)dθ |
|
|
216 | (10) |
|
Complexity and Functional Form |
|
|
217 | (2) |
|
KL Divergence and Distinguishability |
|
|
219 | (3) |
|
|
222 | (2) |
|
Complexity and the Number of Distinguishable Distributions |
|
|
224 | (2) |
|
Explicit and Simplified Computations |
|
|
226 | (5) |
|
|
231 | (26) |
|
|
231 | (3) |
|
Basic Interpretation of Theorem 8.1 |
|
|
233 | (1) |
|
Bayes Meets Minimax -- Jeffreys' Prior |
|
|
234 | (5) |
|
Jeffreys' Prior and the Boundary |
|
|
237 | (2) |
|
How to Prove the Bayesian and NML Regret Theorems |
|
|
239 | (5) |
|
Proof Sketch of Theorem 8.1 |
|
|
239 | (2) |
|
Beyond Exponential Families |
|
|
241 | (2) |
|
Proof Sketch of Theorem 7.1 |
|
|
243 | (1) |
|
|
244 | (4) |
|
Appendix: Proofs of Theorem 8.1 and Theorem 8.2 |
|
|
248 | (9) |
|
Parametric Models: Prequential Plug-in |
|
|
257 | (14) |
|
Prequential Plug-in for Exponential Families |
|
|
257 | (5) |
|
The Plug-in vs. the Bayes Universal Model |
|
|
262 | (3) |
|
|
265 | (4) |
|
|
269 | (2) |
|
Parametric Models: Two-Part |
|
|
271 | (24) |
|
The Ordinary Two-Part Universal Model |
|
|
271 | (13) |
|
Derivation of the Two-Part Code Regret |
|
|
274 | (3) |
|
Proof Sketch of Theorem 10.1 |
|
|
277 | (5) |
|
|
282 | (2) |
|
The Conditional Two-Part Universal Code |
|
|
284 | (9) |
|
Conditional Two-Part Codes for Discrete Exponential Families |
|
|
286 | (4) |
|
Distinguishability and the Phase Transition |
|
|
290 | (3) |
|
|
293 | (2) |
|
NML With Infinite Complexity |
|
|
295 | (40) |
|
|
295 | (6) |
|
Examples of Undefined NML Distribution |
|
|
298 | (1) |
|
Examples of Undefined Jeffreys' Prior |
|
|
299 | (2) |
|
|
301 | (7) |
|
Constrained Parametric Complexity |
|
|
302 | (1) |
|
|
303 | (3) |
|
Renormalized Maximum Likelihood |
|
|
306 | (2) |
|
|
308 | (8) |
|
Asymptotic Expansion of LNML |
|
|
312 | (4) |
|
Conditional Universal Models |
|
|
316 | (13) |
|
Bayesian Approach with Jeffrey's Prior |
|
|
317 | (3) |
|
|
320 | (5) |
|
Liang and Barron's Approach |
|
|
325 | (4) |
|
|
329 | (1) |
|
Appendix: Proof of Theorem 11.4 |
|
|
329 | (6) |
|
|
335 | (34) |
|
|
336 | (4) |
|
Prelude: The Normal Location Family |
|
|
338 | (2) |
|
|
340 | (8) |
|
|
342 | (3) |
|
Composition of Experiments |
|
|
345 | (1) |
|
|
346 | (2) |
|
|
348 | (15) |
|
Bayesian Linear Model Mx with Gaussian Prior |
|
|
354 | (5) |
|
Bayesian Linear Models Mx and Sx with Noninformative Priors |
|
|
359 | (4) |
|
Universal Models for Linear Regression |
|
|
363 | (6) |
|
|
363 | (1) |
|
|
364 | (1) |
|
|
365 | (4) |
|
|
369 | (34) |
|
|
370 | (2) |
|
CUP: Unions of Parametric Models |
|
|
372 | (4) |
|
CUP vs. Parametric Models |
|
|
375 | (1) |
|
Universal Codes Based on Histograms |
|
|
376 | (7) |
|
Redundancy of Universal CUP Histogram Codes |
|
|
380 | (3) |
|
|
383 | (7) |
|
Standard CUP Universal Codes |
|
|
384 | (3) |
|
Minimax Nonparametric Redundancy |
|
|
387 | (3) |
|
Gaussian Process Regression |
|
|
390 | (12) |
|
Kernelization of Bayesian Linear Regression |
|
|
390 | (4) |
|
|
394 | (2) |
|
Gaussian Processes as Universal Models |
|
|
396 | (6) |
|
Conclusion and Further Reading |
|
|
402 | (1) |
|
|
403 | (194) |
|
|
409 | (50) |
|
|
409 | (2) |
|
Simple Refined MDL Model Selection |
|
|
411 | (9) |
|
Compression Interpretation |
|
|
415 | (1) |
|
|
416 | (2) |
|
|
418 | (1) |
|
Prequential Interpretation |
|
|
419 | (1) |
|
General Parametric Model Selection |
|
|
420 | (8) |
|
Models with Infinite Complexities |
|
|
420 | (2) |
|
Comparing Many or Infinitely Many Models |
|
|
422 | (3) |
|
|
425 | (3) |
|
Practical Issues in MDL Model Selection |
|
|
428 | (10) |
|
Calculating Universal Codelengths |
|
|
428 | (1) |
|
Computational Efficiency and Practical Quality of Non-NML Universal Codes |
|
|
429 | (2) |
|
Model Selection with Conditional NML and Plug-in Codes |
|
|
431 | (4) |
|
General Warnings about Model Selection |
|
|
435 | (3) |
|
MDL Model Selection for Linear Regression |
|
|
438 | (13) |
|
|
439 | (4) |
|
Hansen and Yu's gMDL Approach |
|
|
443 | (3) |
|
Liang and Barron's Approach |
|
|
446 | (2) |
|
|
448 | (3) |
|
Worst Case vs. Average Case |
|
|
451 | (8) |
|
MDL Prediction and Estimation |
|
|
459 | (42) |
|
|
459 | (1) |
|
MDL for Prediction and Predictive Estimation |
|
|
460 | (16) |
|
Prequential MDL Estimators |
|
|
461 | (4) |
|
Prequential MDL Estimators Are Consistent |
|
|
465 | (4) |
|
Parametric and Nonparametric Examples |
|
|
469 | (3) |
|
Cesaro KL consistency vs. KL consistency |
|
|
472 | (4) |
|
Two-Part Code MDL for Point Hypothesis Selection |
|
|
476 | (7) |
|
Discussion of Two-Part Consistency Theorem |
|
|
478 | (5) |
|
|
483 | (15) |
|
MDL Estimators vs. Luckiness ML Estimators |
|
|
487 | (4) |
|
|
491 | (2) |
|
Comparison to Bayesian Estimators |
|
|
493 | (5) |
|
|
498 | (1) |
|
Appendix: Proof of Theorem 15.3 |
|
|
499 | (2) |
|
MDL Consistency and Convergence |
|
|
501 | (22) |
|
|
501 | (1) |
|
|
501 | (1) |
|
Consistency: Prequential and Two-Part MDL Estimators |
|
|
502 | (3) |
|
Consistency: MDL Model Selection |
|
|
505 | (6) |
|
Selection between a Union of Parametric Models |
|
|
505 | (3) |
|
Nonparametric Model Selection Based on CUP Model Class |
|
|
508 | (3) |
|
MDL Consistency Peculiarities |
|
|
511 | (4) |
|
|
515 | (5) |
|
Relations between Divergences and Risk Measures |
|
|
517 | (2) |
|
|
519 | (1) |
|
|
520 | (3) |
|
Prequential and Two-Part MDL Estimators |
|
|
520 | (2) |
|
|
522 | (1) |
|
|
523 | (74) |
|
MDL and Frequentist Paradigms |
|
|
524 | (7) |
|
Sanity Check or Design Principle? |
|
|
525 | (3) |
|
The Weak Prequential Principle |
|
|
528 | (1) |
|
MDL vs. Frequentist Principles: Remaining Issues |
|
|
529 | (2) |
|
MDL and Bayesian Inference |
|
|
531 | (18) |
|
Luckiness Functions vs. Prior Distributions |
|
|
534 | (5) |
|
|
539 | (5) |
|
MDL and Brands of Bayesian Statistics |
|
|
544 | (4) |
|
Conclusion: a Common Future after All? |
|
|
548 | (1) |
|
|
549 | (6) |
|
|
549 | (1) |
|
|
550 | (2) |
|
Combining the Best of AIC and BIC |
|
|
552 | (3) |
|
|
555 | (7) |
|
Strict Minimum Message Length |
|
|
556 | (2) |
|
|
558 | (2) |
|
The Wallace-Freeman Estimator |
|
|
560 | (2) |
|
MDL and Prequential Analysis |
|
|
562 | (3) |
|
|
565 | (2) |
|
|
567 | (3) |
|
Kolmogorov Complexity and Structure Function |
|
|
570 | (3) |
|
MDL and Individual Sequence Prediction |
|
|
573 | (6) |
|
MDL and Statistical Learning Theory |
|
|
579 | (13) |
|
Structural Risk Minimization |
|
|
581 | (4) |
|
|
585 | (3) |
|
|
588 | (4) |
|
|
592 | (5) |
|
IV. Additional Background |
|
|
597 | (54) |
|
The Exponential or ``Maximum Entropy'' Families |
|
|
599 | (24) |
|
|
600 | (1) |
|
|
601 | (4) |
|
|
605 | (4) |
|
Mean-Value, Canonical, and Other Parameterizations |
|
|
609 | (8) |
|
The Mean Value Parameterization |
|
|
609 | (2) |
|
|
611 | (2) |
|
Relating Mean-Value and Canonical Parameters |
|
|
613 | (4) |
|
Exponential Families of General Probabilistic Sources |
|
|
617 | (2) |
|
Fisher Information Definitions and Characterizations |
|
|
619 | (4) |
|
Information-Theoretic Properties of Exponential Families |
|
|
623 | (28) |
|
|
624 | (1) |
|
Robustness of Exponential Family Codes |
|
|
624 | (5) |
|
If Θmean Does Not Contain the Mean |
|
|
627 | (2) |
|
Behavior at the ML Estimate β |
|
|
629 | (3) |
|
Behavior of the ML Estimate β |
|
|
632 | (5) |
|
|
633 | (1) |
|
|
634 | (3) |
|
Maximum Entropy and Minimax Codelength |
|
|
637 | (8) |
|
Exponential Families and Maximum Entropy |
|
|
638 | (3) |
|
Exponential Families and Minimax Codelength |
|
|
641 | (2) |
|
|
643 | (2) |
|
Likelihood Ratio Families and Renyi Divergences |
|
|
645 | (5) |
|
The Likelihood Ratio Family |
|
|
647 | (3) |
|
|
650 | (1) |
References |
|
651 | (24) |
List of Symbols |
|
675 | (4) |
Subject Index |
|
679 | |