1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04.

Slides:



Advertisements
Similar presentations
15-583:Algorithms in the Real World
Advertisements

Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University.
The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Maximum likelihood (ML) and likelihood ratio (LR) test
Estimation of parameters. Maximum likelihood What has happened was most likely.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Machine Learning CMPT 726 Simon Fraser University
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
Visual Recognition Tutorial
Model Selection. Agenda Myung, Pitt, & Kim Olsson, Wennerholm, & Lyxzen.
Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC.
Bayesian Learning Rong Jin.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Thanks to Nir Friedman, HU
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
EM Algorithm Likelihood, Mixture Models and Clustering.
Language Modeling Approaches for Information Retrieval Rong Jin.
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM joint work with John Langford, TTI Chicago, Preliminary version.
IRDM WS Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events, Probabilities, Random Variables, Distributions,
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
01/20151 EPI 5344: Survival Analysis in Epidemiology Maximum Likelihood Estimation: An Introduction March 10, 2015 Dr. N. Birkett, School of Epidemiology,
Statistical Learning (From data to distributions).
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Lecture 16: Point Estimation Concepts and Methods Devore, Ch
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
DATA MINING LECTURE 10 Minimum Description Length Information Theory Co-Clustering.
INTRODUCTION TO Machine Learning 3rd Edition
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 28 February 2007.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Inconsistency under Misspecification Peter Grünwald CWI, Amsterdam Extension of joint work with John Langford, TTI Chicago (COLT 2004)
Ch4. Zero-Error Data Compression Yuan Luo. Content  Ch4. Zero-Error Data Compression  4.1 The Entropy Bound  4.2 Prefix Codes  Definition and.
CS479/679 Pattern Recognition Dr. George Bebis
EE465: Introduction to Digital Image Processing
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
Minimum Description Length Information Theory Co-Clustering
Maximum Likelihood Estimation
Context-based Data Compression
Predictive Learning from Data
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Distributed Compression For Binary Symetric Channels
Generally Discriminant Analysis
G. Delyon, Ph. Réfrégier and F. Galland Physics & Image Processing
Applied Statistics and Probability for Engineers
Presentation transcript:

1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu

2 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics

3 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics

4 Introduction Example: data compression –Description methods Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

5 Introduction Example: regression –Model selection and overfitting –Complexity of the model vs. Goodness of fit Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

6 Introduction Models vs. Hypotheses Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

7 Introduction Crude 2-part version of MDL Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

8 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics

9 Probabilities and Codelengths Let X be a finite or countable set –A code C(x) for X 1-to-1 mapping from X to U n>0 {0,1} n L C (x): number of bits needed to encode x using C –P: probability distribution defined on X P(x): the probability of x A sequence of (usually iid) observations x 1, x 2, …, x n : x n

10 Probabilities and Codelengths Prefix codes: as examples of uniquely decodable codes –no code word is a prefix of any other a0 b111 c1011 d1010 r110 !100 Source:

11 Probabilities and Codelengths Expected codelength of a code C –Lower bound: Optimal code –if it has minimum expected codelength over all uniquely decodable codes –How to design one given P? Huffman coding

12 Probabilities and Codelengths Huffman coding Source:

13 Probabilities and Codelengths How to design code for {1, 2, …, M}? –Assuming a uniform distribution: 1/M for each number –~logM bits

14 Probabilities and Codelengths How to design code for all the positive integers? –For each k Describe it with 0s Followed by a 1 Then encode k using the uniform code for In total, ~ 2logk + 1 bits –Can be refined…

15 Probabilities and Codelengths Let P be a probability distribution over X, then there exists a code C for X such that: Let C be a uniquely decodable code over X, then there exists a probability distribution P such that:

16 Probabilities and Codelengths Codelength revisited Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

17 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics

18 Crude MDL Preliminary: k-th order Markov chain on X={0,1} –A sequence: X 1, X 2, …, X N –Special case: 0-th order: Bernoulli model (biased coin) Maximum Likelihood estimator

19 Crude MDL Preliminary: k-th order Markov chain on X={0,1} –Special case: first order Markov chain B (1) MLE

20 Crude MDL Preliminary: k-th order Markov chain on X={0,1} –2 k parameters theta[1|000…000] = n[1|000…000]/n[000…000] theta[1|000…001] … theta[1|111…110] theta[1|111…111] –Log likelihood function: … –MLE: …

21 Crude MDL Question: Given data D=x n, find the Markov chain that best explains D. –We do not want to restrict ourselves to chains of fixed order How to avoid overfitting? Obviously, an (n-1)-th order Markov model would always fit the data the best

22 Crude MDL two-part MDL revisited Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

23 Crude MDL Description length of data given hypothesis

24 Crude MDL Description length of hypothesis –The code should not change with the sample size n. –Different codes will lead to preferences of different hypotheses –How to design a code that Leads to good inferences with small, practically relevant sample sizes?

25 Crude MDL An ``intuitive” and ``reasonable” code for k-th order Markov chain –First describe k using 2logk+1 bits –Then describe the d=2 k parameters Assume n is given in advance –For each theta in the MLE {theta[1|000…000], …, theta[1|111…111]}, the best precision we can achieve by counting is 1/(n+1) –Describe each theta with log(n+1) bits –L(H)=2logk+1+dlog(n+1) –L(H)+L(D|H) = 2logk+1+dlog(n+1) – logP(D|k, theta) –For a given k, only the MLE theta need to be considered

26 Crude MDL Good news –We have found a principled manner to encode data D using H Bad news –We have not found clear guidelines to design codes for H

27 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other issues

28 Refined MDL Universal codes and universal distributions –maximum likelihood code depends on the data How to describe the data in an unambiguous manner? –Design a code such that for every possible observation, its codelength corresponds to its ML? - impossible

29 Refined MDL Worst-case regret Optimal universal model

30 Refined MDL Normalized maximum likelihood (NML) Minimizing -logNML

31 Refined MDL Complexity of a model –The more sequences that can be fit well by an element of M, the larger M’s complexity –Would it lead to a ``right” balance between complexity and fit? Hopefully…

32 Refined MDL General refined MDL Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.

33 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics

34 Other topics Mixture code Resolvability …

35 References Barron, A.; Rissanen, J. & Yu, B. (1998), 'The minimum description length principle in coding and modeling', Information Theory, IEEE Transactions on 44(6), Grnwald, P.D.; Myung, I.J. & Pitt, M.A. (2005), Advances in Minimum Description Length: Theory and Applications (Neural Information Processing), The MIT Press. Hall, P. & Hannan, E.J. (1988), 'On stochastic complexity and nonparametric density estimation', Biometrika 75(4),