1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu
2 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics
3 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics
4 Introduction Example: data compression –Description methods Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
5 Introduction Example: regression –Model selection and overfitting –Complexity of the model vs. Goodness of fit Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
6 Introduction Models vs. Hypotheses Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
7 Introduction Crude 2-part version of MDL Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
8 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics
9 Probabilities and Codelengths Let X be a finite or countable set –A code C(x) for X 1-to-1 mapping from X to U n>0 {0,1} n L C (x): number of bits needed to encode x using C –P: probability distribution defined on X P(x): the probability of x A sequence of (usually iid) observations x 1, x 2, …, x n : x n
10 Probabilities and Codelengths Prefix codes: as examples of uniquely decodable codes –no code word is a prefix of any other a0 b111 c1011 d1010 r110 !100 Source:
11 Probabilities and Codelengths Expected codelength of a code C –Lower bound: Optimal code –if it has minimum expected codelength over all uniquely decodable codes –How to design one given P? Huffman coding
12 Probabilities and Codelengths Huffman coding Source:
13 Probabilities and Codelengths How to design code for {1, 2, …, M}? –Assuming a uniform distribution: 1/M for each number –~logM bits
14 Probabilities and Codelengths How to design code for all the positive integers? –For each k Describe it with 0s Followed by a 1 Then encode k using the uniform code for In total, ~ 2logk + 1 bits –Can be refined…
15 Probabilities and Codelengths Let P be a probability distribution over X, then there exists a code C for X such that: Let C be a uniquely decodable code over X, then there exists a probability distribution P such that:
16 Probabilities and Codelengths Codelength revisited Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
17 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics
18 Crude MDL Preliminary: k-th order Markov chain on X={0,1} –A sequence: X 1, X 2, …, X N –Special case: 0-th order: Bernoulli model (biased coin) Maximum Likelihood estimator
19 Crude MDL Preliminary: k-th order Markov chain on X={0,1} –Special case: first order Markov chain B (1) MLE
20 Crude MDL Preliminary: k-th order Markov chain on X={0,1} –2 k parameters theta[1|000…000] = n[1|000…000]/n[000…000] theta[1|000…001] … theta[1|111…110] theta[1|111…111] –Log likelihood function: … –MLE: …
21 Crude MDL Question: Given data D=x n, find the Markov chain that best explains D. –We do not want to restrict ourselves to chains of fixed order How to avoid overfitting? Obviously, an (n-1)-th order Markov model would always fit the data the best
22 Crude MDL two-part MDL revisited Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
23 Crude MDL Description length of data given hypothesis
24 Crude MDL Description length of hypothesis –The code should not change with the sample size n. –Different codes will lead to preferences of different hypotheses –How to design a code that Leads to good inferences with small, practically relevant sample sizes?
25 Crude MDL An ``intuitive” and ``reasonable” code for k-th order Markov chain –First describe k using 2logk+1 bits –Then describe the d=2 k parameters Assume n is given in advance –For each theta in the MLE {theta[1|000…000], …, theta[1|111…111]}, the best precision we can achieve by counting is 1/(n+1) –Describe each theta with log(n+1) bits –L(H)=2logk+1+dlog(n+1) –L(H)+L(D|H) = 2logk+1+dlog(n+1) – logP(D|k, theta) –For a given k, only the MLE theta need to be considered
26 Crude MDL Good news –We have found a principled manner to encode data D using H Bad news –We have not found clear guidelines to design codes for H
27 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other issues
28 Refined MDL Universal codes and universal distributions –maximum likelihood code depends on the data How to describe the data in an unambiguous manner? –Design a code such that for every possible observation, its codelength corresponds to its ML? - impossible
29 Refined MDL Worst-case regret Optimal universal model
30 Refined MDL Normalized maximum likelihood (NML) Minimizing -logNML
31 Refined MDL Complexity of a model –The more sequences that can be fit well by an element of M, the larger M’s complexity –Would it lead to a ``right” balance between complexity and fit? Hopefully…
32 Refined MDL General refined MDL Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
33 Outline Conceptual/non-technical introduction Probabilities and Codelengths Crude MDL Refined MDL Other topics
34 Other topics Mixture code Resolvability …
35 References Barron, A.; Rissanen, J. & Yu, B. (1998), 'The minimum description length principle in coding and modeling', Information Theory, IEEE Transactions on 44(6), Grnwald, P.D.; Myung, I.J. & Pitt, M.A. (2005), Advances in Minimum Description Length: Theory and Applications (Neural Information Processing), The MIT Press. Hall, P. & Hannan, E.J. (1988), 'On stochastic complexity and nonparametric density estimation', Biometrika 75(4),