Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo.

Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo

What should Syntactic Theory Explain? Which sentences are grammatical and which are not or How to transform observed sentences into a grammar I E learning Children transform observed sentences (E) Into psychological knowledge of language (I)

How should we study syntax? Linguists’ Approach: Choose some sentences Decide on grammaticality of each one  Make a grammar that accounts for which of these sentences are grammatical and which are not sentences grammar Informant Linguist

Computational Linguists’ Approach (Unsupervised Learning) Take a corpus  Extract as much information from the corpus as accurately as possible or  Learn a grammar that describes the corpus as accurately as possible corpus grammar lexical items language model etc.

Which approach gives more insight into language? Linguists tend to aim for high precision But only produce very limited and arbitrary coverage Computational linguists tend to obtain much better coverage But don’t account for any body of data completely And tend to learn only simpler kinds of structure  The approaches seem to be largely complementary

Which approach gives more insight into the human mind? The huge size and complexity of languages is one of their key distinctive properties  The linguists’ approach doesn’t account for this So should we apply our algorithms to large corpora of naturally occurring data?  This won’t directly address the kind of issue that syntacticians focus on

Negative Evidence Some constructions seem impossible to learn without negative evidence John hurt himself Mary hurt John John hated himself Mary hated John John behaved himself * Mary behaved John

Implicit Negative Evidence If we never hear something can’t we just assume its not grammatical?  Sentences we never heard?  Phrases we never heard?  Verb argument constructions we never heard?  Word-affix combinations we never heard? How often does something have to not occur before we decide it’s not grammatical? At what structural level do we make generalizations?

Minimum Description Length (MDL) MDL may be able to solve the ‘no negative evidence’ problem Prefers the grammar that results in the simplest overall description of data So prefers simple grammars And grammars that allow simple descriptions of the data

Observed sentences Space of possible sentences

Observed sentences Grammar Simple but non-constraining grammarSpace of possible sentences

Observed sentences Grammars Simple but non-constraining grammar Complex but constraining grammar Space of possible sentences

Observed sentences Grammars Grammar that is a good fit to the data Simple but non-constraining grammar Complex but constraining grammar Space of possible sentences

Why it has to be MDL Many machine learning techniques have been applied in computational linguistics MDL is very rarely used Not especially successful at learning grammatical structure from corpora So why MDL?

Maximum Likelihood Maximum likelihood can be seen as a special case of MDL in which the a priori probability of all hypotheses P(h) is equal But the hypothesis that only the observed sentences are grammatical will result in the maximum likelihood So ML can only be applied if there are restrictions on how well the estimated parameters can fit the data  The degree of generality of the grammars is set externally, not determined by the Maximum Likelihood principle

Maximum Entropy Make the grammar as unrestrictive as possible But constraints must be used to prevent a grammar just allowing any combination of words to be a grammatical sentence  Again the degree of generality of grammars is determined externally Neither Maximum Likelihood nor Maximum Entropy provide a principle that can decide when to make generalizations

1 S  NP VP 2 NP  John 3 NP  Mary 4 VP  screamed 5 VP  died Describing data in terms of the grammar: 1, 2, 4 = John screamed There is a restricted range of choices at each stage of the derivation  Fewer choices = higher probability Learning Phrase Structure Grammars Data: John screamed John died Mary Screamed

Encoding in My Model 1010100111010100101101010001100111100011010110 Symbol Frequencies Rule Frequencies Decoder 1 S  NP VP 2 NP  john 3 NP  mary 4 VP  screamed 5 VP  died John screamed John died Mary Screamed Grammar Data S (1) NP (3) VP (3) john (1) mary (1) screamed (1) died (1) null (4) Rule 1  3 Rule 2  2 Rule 3  1 Rule 4  2 Rule 5  1 Number of bits decoded = evaluation

John hit Mary Mary hit Ethel Ethel ran John ran Mary ran Ethel hit John Noam hit John Ethel screamed Mary kicked Ethel John hopes Ethel thinks Mary hit Ethel Ethel thinks John ran John thinks Ethel ran Mary ran Ethel hit Mary Mary thinks John hit Ethel John screamed Noam hopes John screamed Mary hopes Ethel hit John Noam kicked Mary Example: English Learned Grammar S  NP VP VP  ran VP  screamed VP  Vt NP VP  Vs S Vt  hit Vt  kicked Vs  thinks Vs  hopes NP  John NP  Ethel NP  Mary NP  Noam

Real Language Data Can the MDL metric also learn grammars from corpora of unrestricted natural language? If it could, we’d largely have finished syntax But search space is way too big  We need to simplify the task in some way  Only learn verb subcategorization classes

Switchboard Corpus NP (DT the) (NN year) )))) (S-ADV (NP-SBJ (-NONE- *-1) ) (ADVP (RB just) ) (VP (VBG visiting) (NP (PRP$ her) (NNS children) )))) (..) (-DFL- E_S) )) Extracted Information: Verb: spent Subcategorization frame: * NP SP S

Extracted Data Only verbs tagged as VBD (past tense) extracted Modifiers to basic labels ignored 21,759 training instances 704 different verbs 706 distinct subcategorization frames 25 different types of constituent appeared alongside the verbs (e.g. S, SBAR, NP, ADVP)

Verb Class Grammars S  Class1 Subcat1 S  Class1 Subcat2 S  Class2 Subcat1 Class1  grew Class1  ended Class2  did grew and ended appear can appear with subcats 1 and 2 do only with subcat 2 Grouping together verbs with similar subcategorizations should improve the evaluation

A New Search Mechanism We need a search mechanism that will only produce candidate grammars of the right form Start with all verbs in one class Move a randomly chosen verb to a new class (P=0.5) or a different class (P=0.5) Empty verb classes are deleted Redundant rules are removed

A New Search Mechanism (2) Annealing search: After no changes are accepted for 2,000 iterations switch to merging phase Merge two randomly selected classes After no changes accepted for 2,000 iterations switch back to moving phase Stop after no changes accepted for 20,000 iterations Multiple runs were conducted and the grammar with the overall lowest evaluation selected

Grammar Evaluations 207,312.4187,026.7220,520.4Data 37,885.5111,036.529,915.1Grammar 245,198.0298,063.3250,435.5 Overall Evaluation Best learned grammar Each verb in a separate class One verb class

Learned Classes ClassVerbs in Class Description 1thought, vowed, prayed, decided, adjusted, wondered, wished, allowed, knew, suggested, claimed, believed, remarked, resented, detailed, misunderstood, assumed, competed, snowballed, smoked, said, struggled, determined, noted, understood, foresaw, expected, discovered, realized, negotiated, suspected, indicated Usually take S or S BAR complement (S BAR usually contains that or who etc. followed by an S) 2 enjoyed, canceled, liked, had, finished, traded, sold, ruined, needed, watched, loved, included, received, converted, rented, bred, deterred, increased, encouraged, made, swapped, shot, offered, spent, impressed, discussed, missed, carried, injured, presented, surprised… Usually take an NP argument (often in conjunction with other arguments) 3diddid only 4All other verbsmiscellaneous 5used, named, tried, considered, tended, refused, wanted, managed, let, forced, began, appeared Typically take an S argument (but never just an SBAR) 6wound, grew, ended, closed, backedUsually take a particle

Did MDL make appropriate generalizations? The learned verb classes are clearly linguistically coherent But they don’t account for exactly which verbs can appear with which subcats  Linguists have proposed far more fine-grained classes Data available for learning was limited (subcats had no internal structure, Penn Treebank labels may not be sufficient)  But linguists can’t explain which verbs appear with which subcats either

Conclusions MDL (and only MDL) can determine when to make linguistic generalizations and when not to The same MDL metric can be used both on small sets of example sentences and on unrestricted corpora Work using corpora does not address the kind of issues that syntacticians are interested in

Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo.

Similar presentations

Presentation on theme: "Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo.

Similar presentations

Presentation on theme: "Using Minimum Description Length to make Grammatical Generalizations Mike Dowman University of Tokyo."— Presentation transcript:

Similar presentations

About project

Feedback