Muutke küpsiste eelistusi

Minimum Description Length Principle [Pehme köide]

(Centrum voor Wiskunde en Informatica)
  • Formaat: Paperback / softback, 736 pages, kõrgus x laius x paksus: 233x178x38 mm, 12 illus.; 12 Illustrations
  • Sari: Adaptive Computation and Machine Learning series
  • Ilmumisaeg: 23-Mar-2007
  • Kirjastus: MIT Press
  • ISBN-10: 0262529637
  • ISBN-13: 9780262529631
Teised raamatud teemal:
  • Formaat: Paperback / softback, 736 pages, kõrgus x laius x paksus: 233x178x38 mm, 12 illus.; 12 Illustrations
  • Sari: Adaptive Computation and Machine Learning series
  • Ilmumisaeg: 23-Mar-2007
  • Kirjastus: MIT Press
  • ISBN-10: 0262529637
  • ISBN-13: 9780262529631
Teised raamatud teemal:

A comprehensive introduction and reference guide to the minimum description length (MDL) Principle that is accessible to researchers dealing with inductive reference in diverse areas including statistics, pattern classification, machine learning, data mining, biology, econometrics, and experimental psychology, as well as philosophers interested in the foundations of statistics.

The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern. This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology.

Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.



A comprehensive introduction and reference guide to the minimum description length (MDL) Principle that is accessible to researchers dealing with inductive reference in diverse areas including statistics, pattern classification, machine learning, data mining, biology, econometrics, and experimental psychology, as well as philosophers interested in the foundations of statistics.

A comprehensive introduction and reference guide to the minimum description length (MDL) Principle that is accessible to researchers dealing with inductive reference in diverse areas including statistics, pattern classification, machine learning, data mining, biology, econometrics, and experimental psychology, as well as philosophers interested in the foundations of statistics.

The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern. This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology.

Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.

List of Figures
xix
Series Foreword xxi
Foreword xxiii
Preface xxv
I Introductory Material
1(164)
1 Learning, Regularity, and Compression
3(38)
1.1 Regularity and Learning
4(1)
1.2 Regularity and Compression
4(4)
1.3 Solomonoff's Breakthrough -- Kolmogorov Complexity
8(2)
1.4 Making the Idea Applicable
10(2)
1.5 Crude MDL, Refined MDL and Universal Coding
12(11)
1.5.1 From Crude to Refined MDL
14(3)
1.5.2 Universal Coding and Refined MDL
17(1)
1.5.3 Refined MDL for Model Selection
18(2)
1.5.4 Refined MDL for Prediction and Hypothesis Selection
20(3)
1.6 Some Remarks on Model Selection
23(3)
1.6.1 Model Selection among Non-Nested Models
23(2)
1.6.2 Goals of Model vs. Point Hypothesis Selection
25(1)
1.7 The MDL Philosophy
26(3)
1.8 MDL, Occam's Razor, and the "True Model"
29(7)
1.8.1 Answer to Criticism No. 1
30(2)
1.8.2 Answer to Criticism No. 2
32(4)
1.9 History and Forms of MDL
36(4)
1.9.1 What Is MDL?
37(1)
1.9.2 MDL Literature
38(2)
1.10 Summary and Outlook
40(1)
2 Probabilistic and Statistical Preliminaries
41(38)
2.1 General Mathematical Preliminaries
41(5)
2.2 Probabilistic Preliminaries
46(16)
2.2.1 Definitions; Notational Conventions
46(7)
2.2.2 Probabilistic Sources
53(2)
2.2.3 Limit Theorems and Statements
55(2)
2.2.4 Probabilistic Models
57(3)
2.2.5 Probabilistic Model Classes
60(2)
2.3 Kinds of Probabilistic Models*
62(7)
2.4 Terminological Preliminaries
69(2)
2.5 Modeling Preliminaries: Goals and Methods for Inductive Inference
71(7)
2.5.1 Consistency
71(3)
2.5.2 Basic Concepts of Bayesian Statistics
74(4)
2.6 Summary and Outlook
78(1)
3 Information-Theoretic Preliminaries
79(30)
3.1 Coding Preliminaries
79(11)
3.1.1 Restriction to Prefix Coding Systems; Descriptions as Messages
83(3)
3.1.2 Different Kinds of Codes
86(4)
3.1.3 Assessing the Efficiency of Description Methods
90(1)
3.2 The Most Important Section of This Book: Probabilities and Code Lengths
90(11)
3.2.1 The Kraft Inequality
91(4)
3.2.2 Code Lengths "Are" Probabilities
95(4)
3.2.3 Immediate Insights and Consequences
99(2)
3.3 Probabilities and Code Lengths, Part II
101(5)
3.3.1 (Relative) Entropy and the Information Inequality
103(3)
3.3.2 Uniform Codes, Maximum Entropy, and Minimax Codelength
106(1)
3.4 Summary, Outlook, Further Reading
106(3)
4 Information-Theoretic Properties of Statistical Models
109(22)
4.1 Introduction
109(2)
4.2 Likelihood and Observed Fisher Information
111(6)
4.3 KL Divergence and Expected Fisher Information
117(7)
4.4 Maximum Likelihood: Data vs. Parameters
124(6)
4.5 Summary and Outlook
130(1)
5 Crude Two-Part Code MDL
131(34)
5.1 Introduction: Making Two-Part MDL Precise
132(1)
5.2 Two-Part Code MDL for Markov Chain Selection
133(6)
5.2.1 The Code C2
135(2)
5.2.2 The Code C1
137(1)
5.2.3 Crude Two-Part Code MDL for Markov Chains
138(1)
5.3 Simplistic Two-Part Code MDL Hypothesis Selection
139(2)
5.4 Two-Part MDL for Tasks Other Than Hypothesis Selection
141(1)
5.5 Behavior of Two-Part Code MDL
142(2)
5.6 Two-Part Code MDL and Maximum Likelihood
144(6)
5.6.1 The Maximum Likelihood Principle
144(3)
5.6.2 MDL vs. ML
147(1)
5.6.3 MDL as a Maximum Probability Principle
148(2)
5.7 Computing and Approximating Two-Part MDL in Practice
150(2)
5.8 Justifying Crude MDL: Consistency and Code Design
152(11)
5.8.1 A General Consistency Result
153(4)
5.8.2 Code Design for Two-Part Code MDL
157(6)
5.9 Summary and Outlook
163(1)
5.A Appendix: Proof of Theorem 5.1
163(2)
II Universal Coding
165(238)
6 Universal Coding with Countable Models
171(36)
6.1 Universal Coding: The Basic Idea
172(6)
6.1.1 Two-Part Codes as Simple Universal Codes
174(1)
6.1.2 From Universal Codes to Universal Models
175(2)
6.1.3 Formal Definition of Universality
177(1)
6.2 The Finite Case
178(6)
6.2.1 Minimax Regret and Normalized ML
179(3)
6.2.2 NML vs. Two-Part vs. Bayes
182(2)
6.3 The Countably Infinite Case
184(6)
6.3.1 The Two-Part and Bayesian Codes
184(3)
6.3.2 The NML Code
187(3)
6.4 Prequential Universal Models
190(9)
6.4.1 Distributions as Prediction Strategies
190(3)
6.4.2 Bayes Is Prequential; NML and Two-part Are Not
193(4)
6.4.3 The Prequential Plug-In Model
197(2)
6.5 Individual vs. Stochastic Universality*
199(5)
6.5.1 Stochastic Redundancy
199(2)
6.5.2 Uniformly Universal Models
201(3)
6.6 Summary, Outlook and Further Reading
204(3)
7 Parametric Models: Normalized Maximum Likelihood
207(24)
7.1 Introduction
207(4)
7.1.1 Preliminaries
208(3)
7.2 Asymptotic Expansion of Parametric Complexity
211(5)
7.3 The Meaning of ∫ det I(θ)dθ
216(10)
7.3.1 Complexity and Functional Form
217(2)
7.3.2 KL Divergence and Distinguishability
219(3)
7.3.3 Complexity and Volume
222(2)
7.3.4 Complexity and the Number of Distinguishable Distributions*
224(2)
7.4 Explicit and Simplified Computations
226(5)
8 Parametric Models: Bayes
231(26)
8.1 The Bayesian Regret
231(3)
8.1.1 Basic Interpretation of Theorem 8.1
233(1)
8.2 Bayes Meets Minimax - Jeffreys' Prior
234(5)
8.2.1 Jeffreys' Prior and the Boundary
237(2)
8.3 How to Prove the Bayesian and NML Regret Theorems
239(5)
8.3.1 Proof Sketch of Theorem 8.1
239(2)
8.3.2 Beyond Exponential Families
241(2)
8.3.3 Proof Sketch of Theorem 7.1
243(1)
8.4 Stochastic Universality*
244(4)
8.A Appendix: Proofs of Theorem 8.1 and Theorem 8.2
248(9)
9 Parametric Models: Prequential Plug-in
257(14)
9.1 Prequential Plug-in for Exponential Families
257(5)
9.2 The Plug-in vs. the Bayes Universal Model
262(3)
9.3 More Precise Asymptotics
265(4)
9.4 Summary
269(2)
10 Parametric Models: Two-Part
271(24)
10.1 The Ordinary Two-Part Universal Model
271(13)
10.1.1 Derivation of the Two-Part Code Regret
274(3)
10.1.2 Proof Sketch of Theorem 10.1
277(5)
10.1.3 Discussion
282(2)
10.2 The Conditional Two-Part Universal Code*
284(9)
10.2.1 Conditional Two-Part Codes for Discrete Exponential Families
286(4)
10.2.2 Distinguishability and the Phase Transition*
290(3)
10.3 Summary and Outlook
293(2)
11 NML With Infinite Complexity
295(40)
11.1 Introduction
295(6)
11.1.1 Examples of Undefined NML Distribution
298(1)
11.1.2 Examples of Undefined Jeffreys' Prior
299(2)
11.2 Metauniversal Codes
301(7)
11.2.1 Constrained Parametric Complexity
302(1)
11.2.2 Meta-Two-Part Coding
303(3)
11.2.3 Renormalized Maximum Likelihood*
306(2)
11.3 NML with Luckiness
308(8)
11.3.1 Asymptotic Expansion of LNML
312(4)
11.4 Conditional Universal Models
316(13)
11.4.1 Bayesian Approach with Jeffreys' Prior
317(3)
11.4.2 Conditional NML
320(5)
11.4.3 Liang and Barron's Approach
325(4)
11.5 Summary and Remarks
329(1)
11.A Appendix: Proof of Theorem 11.4
329(6)
12 Linear Regression
335(34)
12.1 Introduction
336(4)
12.1.1 Prelude: The Normal Location Family
338(2)
12.2 Least-Squares Estimation
340(8)
12.2.1 The Normal Equations
342(3)
12.2.2 Composition of Experiments
345(1)
12.2.3 Penalized Least-Squares
346(2)
12.3 The Linear Model
348(15)
12.3.1 Bayesian Linear Model Mx with Gaussian Prior
354(5)
12.3.2 Bayesian Linear Models Mx and Sx with Noninformative Priors
359(4)
12.4 Universal Models for Linear Regression
363(6)
12.4.1 NML
363(1)
12.4.2 Bayes and LNML
364(1)
12.4.3 Bayes-Jeffreys and CNML
365(4)
13 Beyond Parametrics
369(34)
13.1 Introduction
370(2)
13.2 CUP: Unions of Parametric Models
372(4)
13.2.1 CUP vs. Parametric Models
375(1)
13.3 Universal Codes Based on Histograms
376(7)
13.3.1 Redundancy of Universal CUP Histogram Codes
380(3)
13.4 Nonparametric Redundancy
383(7)
13.4.1 Standard CUP Universal Codes
384(3)
13.4.2 Minimax Nonparametric Redundancy
387(3)
13.5 Gaussian Process Regression*
390(12)
13.5.1 Kernelization of Bayesian Linear Regression
390(4)
13.5.2 Gaussian Processes
394(2)
13.5.3 Gaussian Processes as Universal Models
396(6)
13.6 Conclusion and Further Reading
402(1)
III Refined MDL
403(194)
14 MDL Model Selection
409(50)
14.1 Introduction
409(2)
14.2 Simple Refined MDL Model Selection
411(9)
14.2.1 Compression Interpretation
415(1)
14.2.2 Counting Interpretation
416(2)
14.2.3 Bayesian Interpretation
418(1)
14.2.4 Prequential Interpretation
419(1)
14.3 General Parametric Model Selection
420(8)
14.3.1 Models with Infinite Complexities
420(2)
14.3.2 Comparing Many or Infinitely Many Models
422(3)
14.3.3 The General Picture
425(3)
14.4 Practical Issues in MDL Model Selection
428(10)
14.4.1 Calculating Universal Codelengths
428(1)
14.4.2 Computational Efficiency and Practical Quality of Non-NML Universal Codes
429(2)
14.4.3 Model Selection with Conditional NML and Plug-in Codes
431(4)
14.4.4 General Warnings about Model Selection
435(3)
14.5 MDL Model Selection for Linear Regression
438(13)
14.5.1 Rissanen's RNML Approach
439(4)
14.5.2 Hansen and Yu's gMDL Approach
443(3)
14.5.3 Liang and Barron's Approach
446(2)
14.5.4 Discussion
448(3)
14.6 Worst Case vs. Average Case*
451(8)
15 MDL Prediction and Estimation
459(42)
15.1 Introduction
459(1)
15.2 MDL for Prediction and Predictive Estimation
460(16)
15.2.1 Prequential MDL Estimators
461(4)
15.2.2 Prequential MDL Estimators Are Consistent
465(4)
15.2.3 Parametric and Nonparametric Examples
469(3)
15.2.4 Cesaro KL consistency vs. KL consistency*
472(4)
15.3 Two-Part Code MDL for Point Hypothesis Selection
476(7)
15.3.1 Discussion of Two-Part Consistency Theorem
478(5)
15.4 MDL Parameter Estimation
483(15)
15.4.1 MDL Estimators vs. Luckiness ML Estimators
487(4)
15.4.2 What Estimator To Use?
491(2)
15.4.3 Comparison to Bayesian Estimators*
493(5)
15.5 Summary and Outlook
498(1)
15.A Appendix: Proof of Theorem 15.3
499(2)
16 MDL Consistency and Convergence
501(22)
16.1 Introduction
501(1)
16.1.1 The Scenarios Considered
501(1)
16.2 Consistency: Prequential and Two-Part MDL Estimators
502(3)
16.3 Consistency: MDL Model Selection
505(6)
16.3.1 Selection between a Union of Parametric Models
505(3)
16.3.2 Nonparametric Model Selection Based on CUP Model Class
508(3)
16.4 MDL Consistency Peculiarities
511(4)
16.5 Risks and Rates
515(5)
16.5.1 Relations between Divergences and Risk Measures
517(2)
16.5.2 Minimax Rates
519(1)
16.6 MDL Rates of Convergence
520(3)
16.6.1 Prequential and Two-Part MDL Estimators
520(2)
16.6.2 MDL Model Selection
522(1)
17 MDL in Context
523(74)
17.1 MDL and Frequentist Paradigms
524(7)
17.1.1 Sanity Check or Design Principle?
525(3)
17.1.2 The Weak Prequential Principle
528(1)
17.1.3 MDL vs. Frequentist Principles: Remaining Issues
529(2)
17.2 MDL and Bayesian Inference
531(18)
17.2.1 Luckiness Functions vs. Prior Distributions
534(5)
17.2.2 MDL, Bayes, and Occam
539(5)
17.2.3 MDL and Brands of Bayesian Statistics
544(4)
17.2.4 Conclusion: a Common Future after All?
548(1)
17.3 MDL, AIC and BIC
549(6)
17.3.1 BIC
549(1)
17.3.2 AIC
550(2)
17.3.3 Combining the Best of AIC and BIC
552(3)
17.4 MDL and MML
555(7)
17.4.1 Strict Minimum Message Length
556(2)
17.4.2 Comparison to MDL
558(2)
17.4.3 The Wallace-Freeman Estimator
560(2)
17.5 MDL and Prequential Analysis
562(3)
17.6 MDL and Cross-Validation
565(2)
17.7 MDL and Maximum Entropy
567(3)
17.8 Kolmogorov Complexity and Structure Function
570(3)
17.9 MDL and Individual Sequence Prediction
573(6)
17.10 MDL and Statistical Learning Theory
579(13)
17.10.1 Structural Risk Minimization
581(4)
17.10.2 PAC-Bayesian Approaches
585(3)
17.10.3 PAC-Bayes and MDL
588(4)
17.11 The Road Ahead
592(5)
IV Additional Background
597(54)
18 The Exponential or "Maximum Entropy" Families
599(24)
18.1 Introduction
600(1)
18.2 Definition and Overview
601(4)
18.3 Basic Properties
605(4)
18.4 Mean-Value, Canonical, and Other Parameterizations
609(8)
18.4.1 The Mean Value Parameterization
609(2)
18.4.2 Other Parameterizations
611(2)
18.4.3 Relating Mean-Value and Canonical Parameters
613(4)
18.5 Exponential Families of General Probabilistic Sources*
617(2)
18.6 Fisher Information Definitions and Characterizations*
619(4)
19 Information-Theoretic Properties of Exponential Families
623(28)
19.1 Introduction
624(1)
19.2 Robustness of Exponential Family Codes
624(5)
19.2.1 If mean Does Not Contain the Mean
627(2)
19.3 Behavior at the ML Estimate β
629(3)
19.4 Behavior of the ML Estimate β
632(5)
19.4.1 Central Limit Theorem
633(1)
19.4.2 Large Deviations
634(3)
19.5 Maximum Entropy and Minimax Codelength
637(8)
19.5.1 Exponential Families and Maximum Entropy
638(3)
19.5.2 Exponential Families and Minimax Codelength
641(2)
19.5.3 The Compression Game
643(2)
19.6 Likelihood Ratio Families and Renyi Divergences*
645(5)
19.6.1 The Likelihood Ratio Family
647(3)
19.7 Summary
650(1)
References 651(24)
List of Symbols 675(4)
Subject Index 679