Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago
Degrees of freedom in MDL modeling MDL does not specify the form of the grammar being inferred. Carl de Marcken (1996) There are alternatives to pointers for representing connections. Different representations may lead to different grammars.
Linguistica (Goldsmith 2001) Website: linguistica.uchicago.edu Data: corpus segmented into words Model: List of stems List of suffixes List of signatures { }{ } walk jump... ed ing... A sample signature:
Reminder: MDL analysis Corpus C 2 or more competing models describing C Model M assigns a probability to C : pr( C | M ) Compressed length of C given M : L( C | M ) = - log 2 pr( C | M ) Length of model M : L( M ) Description length of C given M : DL( C | M ) = L( C | M ) + L( M )
Learning process Bootstrapping heuristic: word = stem + suffix Successive heuristics propose modifications. MDL sanctions modifications. Compute L( corpus | model ) + L( model ) before and after modification. If it results in a decrease in DL, retain modification, otherwise discard it.
Length of the morphology L( morphology ) = sum of the lengths of lists (stems, suffixes, signatures) Length of a list = sum of the lengths of elements in it + small cost for list structure Length of a stem / suffix is proportional to the number of symbols in it.
{ }{ } walk jump... ed ing... { }{ } Length of the morphology (2) A signature specifies that a set of stems associate with a set of suffixes: { } walk jump great... List of stems { } ed ing est... List of suffixes
Length of the morphology (3) A pointer is a symbol that stands for a given morpheme. The information content of a pointer to a morpheme m is - log 2 pr( m ) The more probable the morpheme, the smaller the cost of a pointer to it: pr( m )- log 2 pr( m ) bits bits 10 bits
Length of the morphology (4) Length of signature = sum of lengths of 2 lists of pointers (to stems and to suffixes) Length of each list = sum of information cost of pointers in it + small cost for list structure
{ }{ } { } walk jump great... { } ed ing est... Morphology: { }{ } { } walk jump great... { } ed ing est... Morphology: Compressed length of the corpus walking in the... Corpus:
Compressed length of the corpus (2) Compressed length of a word w = information content of pointer to signature σ + information content of pointer to stem t given σ + information content of pointer to suffix f given σ = - log 2 pr ( σ ) - log 2 pr (t | σ ) - log 2 pr (f | σ ) L( corpus | morphology ) = sum of lengths of each individual word
Alternatives to pointers There are alternatives to pointers for representing connections in the morphology. { } walk jump great... chin List of (all) stems { } signature σ { } 110…0110…0 binary string
List of pointers vs. binary strings The number of symbols in a binary string is constant and equal to the total number of stems. The information content of the string depends on the distribution of 0's and 1's in it: total number of stems times entropy of string
Expected difference in DL Theoretical inference (see details in paper): 1. Binary strings are shorter when: the distribution of stems tends to be uniform the distribution of the number of stems being pointed to tends to be uniform 2. Lists of pointers are shorter when: the distribution of stems departs from uniformity the average number of stems being pointed to is small
A specific example Current state of the morphology: { }{ } walk jump... ed ing { }{ } walks broke... Proposed modification: walks = walk + s
{ }{ }... jump... ed ing A specific example (2) State of the morphology after modification: Cost: pointers to ed, ing and s { }{ } walk ed ing s { }{ } walks broke... Savings: the string walks, a pointer to it
Crucial difference The compressed length of binary strings is independent of the frequency of the items being pointed to. This encoding does not favor the creation of pointers to frequent items (or the deletion of pointers to rare items).
Conclusion There is more than one way of representing the connections between items in a grammar. The choice of a representation can have important consequences on the grammar being induced. Mathematical details can be found in the paper.