Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001

Today’s presentation 1. The task: unsupervised learning 2. Overview of program and output 3. Overview of Minimum Description Length framework 4. Application of MDL to iterative search of morphology-space, with successively finer-grained descriptions 5. Mathematical model 6. Current capabilities 7. Current challenges

Unsupervised learning  Input: untagged text in orthographic or phonetic form  with spaces (or punctuation) separating words.  But no tagging or text preparation.

Overview of program and output  Linguistica: a C++ Windows-based program available for download at http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000  Technical discussion in Computational Linguistics (June 2001)  Good results with 5,000 words, very fine- grained results with 500,000 words (corpus length, not lexicon count).

Output  List of stems, suffixes, and prefixes  List of signatures. A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. Hence, a stem in a corpus has a unique signature. Hence, a stem in a corpus has a unique signature. A signature has a unique set of stems associated with it A signature has a unique set of stems associated with it  …

(example of signature in English)  NULL.ed.ing.s askcallpoint = askaskedasking asks call calledcallingcalls pointpointedpointingpoints

…output  Roots (“stems of stems”) and the inner structure of stems  Regular allomorphy of stems: e.g., learn “delete stem-final –e in English before –ing and –ed”

Minimum Description Length (MDL)  Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989)  Work by Michael Brent and Carl de Marcken on word-discovery using MDL

Essence of MDL We are given 1. a corpus, and 2. a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. (“Given”? Given by who? We’ll get back to that.) (Remember: a distribution is a set of non- negative numbers summing to 1.0.)

 The higher the probability is that the morphology assigns to the (observed) corpus, the better that morphology is as a model of that data.  Better said: -1 * log probability (corpus) is a measure of how well the morphology models the data: the smaller that number is, the better the morphology models the data. This is known as the optimal compressed length of the data, given the model. Using base 2 logs, this number is a measure in information theoretic bits.

Essence of MDL…  The goodness of the morphology is also measured by how compact the morphology is.  We can measure the compactness of a morphology in information theoretic bits.

How can we measure the compactness of a morphology?  Let’s consider a naïve version of description length: count the number of letters.  This naïve version is nonetheless helpful in seeing the intuition involved.

Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing

Essence of MDL… The best overall theory of a corpus is the one for which the sum of  log prob (corpus) +  length of the morphology (that’s the description length) is the smallest.

Essence of MDL…

Overall logic  Search through morphology space for the morphology which provides the smallest description length.

Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.

Corpus Bootstrap heuristic Feed it into the “bootstrapping” heuristic...

Corpus Out of which comes a preliminary morphology, which need not be superb. Morphology Bootstrap heuristic

Corpus Morphology Bootstrap heuristic incremental heuristics Feed it to the incremental heuristics...

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Out comes a modified morphology.

Corpus Morphology Bootstrap heuristic incremental heuristics modified morphology Is the modification an improvement? Ask MDL!

Corpus Morphology Bootstrap heuristic modified morphology If it is an improvement, replace the morphology... Garbage

Corpus Bootstrap heuristic incremental heuristics modified morphology Send it back to the incremental heuristics again...

Morphology incremental heuristics modified morphology Continue until there are no improvements to try.

1. Bootstrap heuristic  A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes.  In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.

Better bootstrap heuristics Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut; and can put cuts too far to the right (“aborti-” problem). [Not a problem!]

Successor frequency g o v e r n Empirically, only one letter follows “gover”: “n”

Successor frequency g o v e r n m Empirically, 6 letters follows “govern”: “n” i o s e #

Successor frequency g o v e r n m Empirically, 1 letter follows “governm”: “e” e g o v e r 1 n 6 m 1 e peak of successor frequency

Lots of errors… c o n s e r v a t i v e s 9 18 11 6 4 1 2 1 1 2 1 1 wrong rightwrong

Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.

2. Incremental heuristics Course-grained to fine-grained  1. Stems and suffixes to split: Accept any analysis of a word if it consists of a known stem and a known suffix. Accept any analysis of a word if it consists of a known stem and a known suffix.  2. Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment. Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.

Incremental heuristic  3.Slide stem-suffix boundary to the left: Again, use MDL to decide. How do we use MDL to decide?

Using MDL to judge a potential stem act, acted, action, acts. We have the suffixes NULL, ed, ion, and s, but no signature NULL.ed.ion.s Let’s compute cost versus savings of signature NULL.ed.ion.s Savings: Stem savings: 3 copies of the stem act: that’s 3 x 4 = 12 letters = almost 60 bits.

Cost of NULL.ed.ing.s  A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is -- all the stems using it chip in to pay for its cost, though.

 Cost of signature: about 45 bits  Savings: about 60 bits so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.

Model  A model to give us a probability of each word in the corpus (hence, its optimal compressed length); and  A morphology whose length we can measure.

Frequency of analyzed word W is analyzed as belonging to Signature  stem T and suffix F. Actually what we care about is the log of this: Where [W] is the total number of words. [x] means the count of x’s in the corpus (token count)

Next, let’s see how to measure the length of a morphology A morphology is a set of 3 things:  A list of stems;  A list of suffixes;  A list of signatures with the associated stems. We’ll make an effort to make our grammars consist primarily of lists, whose length is conceptually simple.

Length of a list  A header telling us how long the list is, of length (roughly) log 2 N, where N is the length.  N entries. What’s in an entry? Raw lists: a list of strings of letters, where the length of each letter is log 2 (26) – the information content of a letter (we can use a more accurate conditional probability). Raw lists: a list of strings of letters, where the length of each letter is log 2 (26) – the information content of a letter (we can use a more accurate conditional probability). Pointer lists: Pointer lists:

Lists  Raw suffix list: ed ed s ing ing ion ion able able …  Signature 1: Suffixes: pointer to “ing” pointer to “ed”  Signature 2: Suffixes pointer to “ing” pointer to “ion” The length of each pointer is -- usually cheaper than the letters themselves

 The fact that a pointer to a symbol has a length that is inversely proportional to its frequency is the key:  We want the shortest overall grammar; so  That means maximizing the re-use of units (stems, affixes, signatures, etc.)

Number of letters structure + Signatures, which we’ll get to shortly

Information contained in the Signature component list of pointers to signatures indicates the number of distinct elements in X

Original morphology + Compressed data Repair heuristics: using MDL We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths. Revised morphology+ compressed data <><>

But it’s better to have a more thoughtful approach. Let’s define Then the change of the size of the punctuation in the lists: Then the size of the punctuation for the 3 lists is:

Size of the suffix component, remember: Change in its size when we consider a modification to the morphology: 1. Global effects of change of number of suffixes; 2. Effects on change of size of suffixes in both states; 3. Suffixes present only in state 1; 4. Suffixes present only in state 2;

Suffix component change: Contribution of suffixes that appear only in State1 Contribution of suffixes that appear only in State 2 Global effect of change on all suffixes Suffixes whose counts change

Digression on entropy, MDL, and morphology Why using MDL is closely related to measuring the complexity of the space of possible vocabularies

Consider the space of all words of length L, built from an alphabet of size b. How many ways are there to build a vocabulary of size N?Call that U(b,L,N). Clearly,

Compare that with the operation (choosing a set of N words of length L, alphabet size b) with the operation of choosing a set of T stems (of length t) and a set of F suffixes (of length f), where t + f = L. If we take the complexity of each task to be measured by the log of its size, then we’re asking the size of:

is easy to approximate, however. remember:

The number of bits needed to list all the words: the analysis The length of all the pointers to all the words: the compressed corpus Thus the log of the number of vocabularies = description length of that vocabulary, in the terms we’ve been using

That means that the differences in the sizes of the spaces of possible vocabularies is equal to the difference in the description length in the two cases: hence, Difference of complexity of “simplex word” analysis and complexity of analyzed word analysis= log U(b,L,N) – log U(b,t,T) – log U(b,f,F) Difference in size of morphologies Difference in size of compressed data

But we’ve (over)simplified in this case by ignoring the frequencies inherent in real corpora. What’s of great interest in real life is the fact that some suffixes are used often, others rarely, and similarly for stems.

We know something about the distribution of words, but nothing about distribution of stems and especially suffixes. But suppose we wanted to think about the statistics of vocabulary choice in which words could be selected more than once….

We want to select N words of length L, and the same word can be selected. How many ways of doing this are there? You can have any number of occurrence of a word, and 2 sets of the same number of them are indistinguishable. How many such vocabularies are there, then?

where Z(i) is the number of words of frequency i. (‘Z’ stands for “Zipf”). We don’t know much about frequencies of suffixes, but Zipf’s law says that hence for a morpheme set that obeyed the Zipf distribution:

End of digression

Today’s presentation 1. The task: unsupervised learning 2. Overview of program and output 3. Overview of Minimum Description Length framework 4. Application of MDL to iterative search of morphology-space, with successively finer-grained descriptions 5. Mathematical model 6. Current capabilities 7. Current work and challenges

Current research projects 1. Allomorphy: Automatic discovery of relationship between stems (lov~love, win~winn) 2. Use of syntax (automatic learning of syntactic categories) 3. Rich morphology: other languages (e.g., Swahili), other sub-languages (e.g., biochemistry sub-language) where the mean # morphemes/word is much higher 4. Ordering of morphemes

Allomorphy: Automatic discovery of relationship between stems  Currently learns (unfortunately, over- learns) how to delete stem-final letters in order to simplify signatures. E.g., delete stem-final –e in English before suffixes –ing, -ed, -ion (etc.). E.g., delete stem-final –e in English before suffixes –ing, -ed, -ion (etc.).

Automatic learning of syntactic categories  Work in progress with Mikhail Belkin (U of Chicago) Pursuing Shi and Malik’s 1997 application of spectral graph theory (vision) Pursuing Shi and Malik’s 1997 application of spectral graph theory (vision) Finding eigenvector decomposition of a graph that represents bigrams and trigramsFinding eigenvector decomposition of a graph that represents bigrams and trigrams

Rich morphologies  A practical challenge for use in data- mining and information retrieval in patent applications (de-oxy-ribo-nucle-ic, etc.)  Swahili, Hungarian, Turkish, etc.

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.

Similar presentations

Presentation on theme: "Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.

Similar presentations

Presentation on theme: "Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001."— Presentation transcript:

Similar presentations

About project

Feedback