A Bayesian Approach to the Poverty of the Stimulus Amy Perfors MIT With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)

InnateLearned

InnateLearned Explicit Structure No explicit Structure

NoYes Language has hierarchical phrase structure

Why believe that language has hierarchical phrase structure?  Formal properties + information-theoretic, simplicity-based argument (Chomsky, 1956) Dependency structure of language: A finite-state grammar cannot capture the infinite sets of English sentences with dependencies like this If we restrict ourselves to only a finite set of sentences, then in theory a finite-state grammar could account for them : “but this grammar will be so complex as to be of little use or interest.”

Simple declarative: The girl is happy, They are eating Simple interrogative: Is the girl happy? Are they eating? 1.Linear: move the first “is” (auxiliary) in the sentence to the beginning 2.Hierarchical: move the auxiliary in the main clause to the beginning Result Hypotheses Complex declarative: The girl who is sleeping is happy. Data Children say: Is the girl who is sleeping happy? NOT: *Is the girl who sleeping is happy? Test Chomsky, 1965, 1980; Crain & Nakayama, 1987 Why believe that structure dependence is innate? The Argument from the Poverty of the Stimulus (PoS):

Why believe it’s not innate?  There are actually enough complex interrogatives (Pullum & Scholz 02)  Children’s behavior can be explained via statistical learning of natural language data (Lewis & Elman 01; Reali & Christiansen 05) It is not necessary to assume a grammar with explicit structure

InnateLearned Explicit Structure No explicit Structure

Our argument

 We suggest that, contra the PoS claim: It is possible, given the nature of the input and certain domain- general assumptions about the learning mechanism, that an ideal, unbiased learner can realize that language has a hierarchical phrase structure; therefore this knowledge need not be innate The reason: grammars with hierarchical phrase structure offer an optimal tradeoff between simplicity and fit to natural language data Our argument

Plan  Model Data: corpus of child-directed speech (CHILDES) Grammars  Linear & hierarchical  Both: Hand-designed & result of local search  Linear: automatic, unsupervised ML Evaluation  Complexity vs. fit  Results  Implications

The model: Data  Corpus from CHILDES database (Adam, Brown corpus)  55 files, age range 2;3 to 5;2  Sentences spoken by adults to children  Each word replaced by syntactic category det, n, adj, prep, pro, prop, to, part, vi, v, aux, comp, wh, c  Ungrammatical sentences and the most grammatically complex sentence types were removed: kept 21792 out of 25876 utterances Topicalized sentences(66); sentences serial verb constructions (459), subordinate phrases (845), sentential complements (1636), and conjunctions (634). Ungrammatical sentences (444)

Data  Final corpus contained 2336 individual sentence types corresponding to 21792 sentence tokens

Data: variation  Amount of evidence available at different points in development

Data: variation  Amount of evidence available at different points in development  Amount comprehended at different points in development

Data: amount available  Rough estimate – split by age Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 5 # FilesAge% types# types 2;3 to 3;1 2;3 to 2;8879 1295 1735 2090 2336 38% 55% 74% 89% 100% 11 2;3 to 3;5 2;3 to 4;2 2;3 to 5;2 2;31737.4% Epoch 0 1 33 22 55 44

Data: amount comprehended  Rough estimate – split by frequency Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Frequency# types% tokens% types 8 37 67 115 268 2336 500+ 100+ 50+ 25+ 10+ 1+ (all) 0.3% 1.6% 2.9% 4.9% 12% 100% 28% 55% 64% 71% 82% 100%

The model  Data Child-directed speech (CHILDES)  Grammars Linear & hierarchical Both: Hand-designed & result of local search Linear: automatic, unsupervised ML )  Evaluation Complexity vs. fit

Grammar types Context-free grammar RulesExample “Flat” grammar Rules List of each sentence Example Regular grammar Rules NT  t NT Example NT  t NT  NT NT NT  t NT NT  NT NT  t Hierarchical Linear Rules Example 1-state grammar Anything accepted

CFG-S Description Designed to be as linguistically plausible as possible Example productions Standard CFG CFG-L Description Derived from CFG-S; contains additional productions corresponding to different expansions of the same NT (puts less probability mass on recursive productions) Example productions Larger CFG 77 rules, 15 non-terminals 133 rules, 15 non-terminals Specific hierarchical grammars: Hand-designed

FLAT List of each sentence 2336 rules, 0 nonterminals 1-STATE Anything accepted 26 rules, 0 nonterminals Exact fit, no compression Poor fit, high compression Specific linear grammars: Hand-designed

REG-N Narrowest regular derived from CFG 289 rules, 85 nonterminals FLAT List of each sentence 2336 rules, 0 nonterminals 1-STATE Anything accepted 26 rules, 0 nonterminals Exact fit, no compression Poor fit, high compression Specific linear grammars: Hand-designed

Mid-level regular derived from CFG REG-M 169 rules, 14 nonterminals REG-N Narrowest regular derived from CFG 289 rules, 85 nonterminals FLAT List of each sentence 2336 rules, 0 nonterminals 1-STATE Anything accepted 26 prods, 0 nonterminals Exact fit, no compression Poor fit, high compression Specific linear grammars: Hand-designed

REG-B Broadest regular derived from CFG 117 rules, 10 nonterminals Mid-level regular derived from CFG REG-M 169 prods, 14 nonterminals REG-N Narrowest regular derived from CFG 289 prods, 85 nonterminals FLAT List of each sentence 2336 rules, 0 nonterminals 1-STATE Anything accepted 26 rules, 0 nonterminals Exact fit, no compression Poor fit, high compression Specific linear grammars: Hand-designed

Local search around hand- designed grammars Automated search Linear: unsupervised, automatic HMM learning Goldwater & Griffiths, 2007 Bayesian model for acquisition of trigram HMM (designed for POS tagging, but given a corpus of syntactic categories, learns a regular grammar)

The model  Data Child-directed speech (CHILDES)  Grammars Linear & hierarchical Hand-designed & result of local search Linear: automatic, unsupervised ML  Evaluation Complexity vs. fit

Grammars T: type of grammar G: Specific grammar D: Data Context-free Regular Flat, 1-state unbiased (uniform)

Grammars T: type of grammar G: Specific grammar D: Data Context-free Regular Flat, 1-state data fit (likelihood) complexity (prior)

 Low prior probability = more complex  Low likelihood = poor fit to the data Fit: low Simplicity: high Fit: moderate Simplicity: moderate Fit: high Simplicity: low Tradeoff: Complexity vs. Fit

Measuring complexity: prior  Designing a grammar (God’s eye view)  Grammars with more rules and nonterminals will have lower prior probability n = # of nonterminals N i = # items in production i P k = # productions of nonterminal k V = vocab size Θ k = production probability parameters for k

Measuring fit: likelihood  Probability of that grammar generating the data Product of the probability of each parse Ex: pro aux det n = 0.25= 0.5*0.25*1.0*0.25*0.5 = 0.016

Plan  Model Data: corpus of child-directed speech (CHILDES) Grammars  Linear & hierarchical  Hand-designed & result of local search  Linear: automated, unsupervised ML Evaluation  Complexity vs. fit  Results  Implications

Corpus level FLATREG-NREG-MREG-BREG- AUTO 1-STCFG-SCFG-L 1 -116-119 -125-135-161-176 2 -764-537-581-538-501-476-545-586 3 -1480-971-905-875-841-765-835-902 4 -7337-3284-2963-2787-3011-3339-2653-2784 5 -13466-5256-4896-4772-5083-6034-4545-4587 6 -85730-29441-27300-27561-28713-40360-27883-26967 Results: data split by frequency levels (estimate of comprehension) Log posterior probability (lower magnitude = better)

Results: data split by age (estimate of availability)

Log posterior probability (lower magnitude = better) Corpus epoch FLATREG-NREG-MREG-BREG- AUTO 1-STCFG-SCFG-L 0 -4849-3181-2671-2488-2422-2443-2187-2312 1 -28778-11608-10209-9891-11127-13379-9673-9522 2 -44158-16346-14972-14557-15643-20594-14541-14194 3 -61365-21757-20182-19775-20332-28765-20109-19527 4 -75570-26201-24507-24193-24786-35547-24706-23904 5 -85730-29441-27300-27561-28713-40360-27883-26967

Generalization: How well does each grammar predict sentences it hasn’t seen?

TypeIn corp? ExampleRGNRG-MRG-BAUTO1-STCFG-SCFG-L Simple Declarative Eagles do fly. (n aux vi) Simple Interrogative Do eagles fly? (aux n vi) Complex Declarative Eagles that are alive do fly. (n comp aux adj aux vi) Complex Interrogative Do eagles that are alive fly? (aux n comp aux adj vi) Complex Interrogative Are eagles that alive do fly? (aux n comp adj aux vi) Complex interrogatives

 Shown that given reasonable domain-general assumptions, an unbiased rational learner could realize that languages have a hierarchical structure based on typical child-directed input  This paradigm is valuable: it makes any assumptions explicit and enables us to rigorously evaluate how different representations capture the tradeoff between simplicity and fit to data  In some ways, “higher-order” knowledge may be easier to learn than specific details (the “blessing of abstraction”) Take-home messages

Implications for innateness?  Ideal learner  Strong(er) assumptions: The learner can find the best grammar in the space of possibilities  Weak(er) assumptions The learner has the ability to parse the corpus into syntactic categories The learner can represent both linear and hierarchical grammars Assume a particular way of calculating complexity & data fit  Have we actually found representative grammars?

The End Thanks also to the following for many helpful discussions: Virginia Savova, Jeff Elman, Danny Fox, Adam Albright, Fei Xu, Mark Johnson, Ken Wexler, Ted Gibson, Sharon Goldwater, Michael Frank, Charles Kemp, Vikash Mansinghka, Noah Goodman

Grammars T: grammar type G: Specific grammar D: Data

Grammars Context-free Regular Flat, 1-state T: grammar type G: Specific grammar D: Data

The Argument from the Poverty of the Stimulus (PoS) G B T D P1. Children show a specific pattern of behavior B P2. A particular generalization G must be grasped in order to produce B P3. It is impossible to reasonably induce G simply on the basis of the data D that children receive C1. Some abstract knowledge T, limiting which specific generalizations G are possible, is necessary

The Argument from the Poverty of the Stimulus (PoS) G B T D P1. Children show a specific pattern of behavior B P2. A particular generalization G must be grasped in order to produce B P3. It is impossible to reasonably induce G simply on the basis of the data D that children receive C1. Some abstract knowledge T, limiting which specific generalizations G are possible, is necessary Corollary: The abstract knowledge T could not itself be learned, or could not be learned before G is known C2. T must be innate

The Argument from the Poverty of the Stimulus (PoS) P1. Children show a specific pattern of behavior B P2. A particular generalization G must be grasped in order to produce B P3. It is impossible to reasonably induce G simply on the basis of the data D that children receive C1. Some abstract knowledge T, limiting which specific generalizations G are possible, is necessary Corollary: The abstract knowledge T could not itself be learned, or could not be learned before G is known C2. T must be innate G: a specific grammar D: typical child-directed speech input B: children don’t make certain mistakes (they don’t seem to entertain structure-independent hypotheses) T: language has hierarchical phrase structure

Data  Final corpus contained 2336 individual sentence types corresponding to 21792 sentence tokens  Why types? Grammar learning depends on what sentences are generated, not on how many of each type there are Much more computationally tractable The distribution of sentence tokens depends on many factors other than the grammar (e.g., pragmatics, semantics, discussion topics) [Goldwater, Griffiths, Johnson 05]

REG-B Broadest regular derived from CFG 117 rules, 10 nonterminals Mid-level regular derived from CFG REG-M 169 prods, 14 nonterminals REG-N Narrowest regular derived from CFG 289 prods, 85 nonterminals FLAT List of each sentence 2336 rules, 0 nonterminals 1-STATE Anything accepted 26 rules, 0 nonterminals Exact fit, no compression Poor fit, high compression Specific linear grammars: Hand-designed

Why these results?  Natural language actually is generated from a grammar that looks more like a CFG  The other grammars overfit and therefore do not capture important language-specific generalizations Flat

Computing the prior… CFG REG Context-free grammar Regular grammar NT  t NT NT  t NT  NT NT NT  t NT NT  NT NT  t

Likelihood, intuitively Z: rule out because it does not explain some of the data points X and Y both “explain” the data points, but X is the more likely source

Possible empirical tests  Present people with data the model learns FLAT, REG, and CFGs from; see which novel productions they generalize to Non-linguistic? To small children?  Examples of learning regular grammars in real life: does the model do the same?

Do people learn regular grammars? S1 s2 s3 w1 w1 w1 Miss Mary Mack, Mack, Mack All dressed in black, black, black With silver buttons, buttons, buttons All down her back, back, back She asked her mother, mother, mother, … X s1 s2 s3 Spanish dancer, do the splits. Spanish dancer, give a kick. Spanish dancer, turn around. Children’s Songs: Line level grammar

Do people learn regular grammars? Teddy bear, teddy bear, turn around. Teddy bear, teddy bear, touch the ground. Teddy bear, teddy bear, show your shoe. Teddy bear, teddy bear, that will do. Teddy bear, teddy bear, go upstairs. … Bubble gum, bubble gum, chew and blow, Bubble gum, bubble gum, scrape your toe, Bubble gum, bubble gum, tastes so sweet, Children’s Songs: Song level: X X s1 s2 s3 Dolly Dimple walks like this, Dolly Dimple talks like this, Dolly Dimple smiles like this, Dolly Dimple throws a kiss.

Do people learn regular grammars? A my name is Alice And my husband's name is Arthur, We come from Alabama, Where we sell artichokes. B my name is Barney And my wife's name is Bridget, We come from Brooklyn, Where we sell bicycles. … Songs containing items represented as lists (where order matters) Dough a Thing I Buy Beer With Ray a guy who buys me beer Me, the one who wants a beer Fa, a long way to the beer So, I think I'll have a beer La, -gers great but so is beer! Tea, no thanks I'll have a beer … Cinderella, dressed in yella, Went upstairs to kiss a fella, Made a mistake and kissed a snake, How many doctors did it take? 1, 2, 3, …

Do people learn regular grammars? You put your [body part] in You put your [body part] out You put your [body part] in and you shake it all about You do the hokey pokey And you turn yourself around And that's what it's all about! Most of the song is a template, with repeated (varying) element If I were the marrying kind I thank the lord I'm not sir The kind of rugger I would be Would be a rugby [position/item] sir Cos I'd [verb phrase] And you'd [verb phrase] We'd all [verb phrase] together … If you’re happy and you know it [verb] your [body part] If you’re happy and you know it then your face will surely show it If you’re happy and you know it [verb] your [body part]

Do people learn regular grammars? There was a farmer had a dog, And Bingo was his name-O. B-I-N-G-O! And Bingo was his name-O! (each subsequent verse, replace a letter with a clap) Other interesting structures… I know a song that never ends, It goes on and on my friends, I know a song that never ends, And this is how it goes: (repeat) Oh, Sir Richard, do not touch me (each subsequent verse, remove the last word at the end of the sentence)

New PRG: 1-state SEnd Det, n, pro, prop, prep, adj, aux, wh, comp, to, v, vi, part Log(prior) = 0; no free parameters

Another PRG: standard + noise  For instance, level-1 PRG + noise would be the best regular grammar for the corpus at level 1, plus the 1-state model This could parse all levels of evidence Perhaps this would be better than a more complicated PRG at later levels of evidence

Corpus level FlatRG-LRG-SCFG-SCFG-LFlatRG-LRG-SCFG-SCFG-L 1 68 17 116 19 101 18 133 24 164 26 85 135119157188 2 405 134 394 146 357 156 313 185 446 176 539540513 498 622 3 783 281 560 322 475 333 384 401 436 373 1064882808 785 809 4 1509 548 783 627 607 653 491 490 596 709 205514101260 1305 5 4087 1499 1343 1758 858 1863 541 2078 778 1941 558631012721 2619 2719 6 51489 18119 5084 24326 1559 25625 681 27289 1330 25754 69607294102718427970 27084 Results: frequency levels (comprehension estimates) Log prior, log likelihood (abs) Log posterior (smaller is better) P P P P P P L L L L L L

PeriodFlatRG-LRG-SCFG-SCFG-LFlatRG-LRG-SCFG-SCFG-L 0 2839 891 1457 1260 843 1342 552 1498 808 1425 373027172185 2050 2233 1 16831 5959 3360 7804 1349 8291 667 8879 1175 8373 22790111649640 9546 9548 2 26063 9272 3748 12168 1464 12891 674 13785 1273 13006 35335159161433514459 14289 3 36575 12932 4313 17185 1493 18123 681 19406 1296 18280 49507214981961620087 19576 4 45292 15969 4681 21376 1521 22536 681 24059 1296 22674 61261260572405724740 23970 5 51489 18119 5084 24326 1559 25625 681 27289 1330 25754 69607294102718427970 27084 Results: availability by age Log prior, log likelihood (abs) P P P P P P L L L L L L Log posterior (smaller is better)

 One type of hand-designed grammar 69 productions, 14 nonterminals 390 productions, 85 nonterminals Specific grammars of each type

 The other type of hand-designed grammar 126 productions, 14 nonterminals 170 productions, 14 nonterminals Specific grammars of each type

P1. It is impossible to have made some generalization G simply on the basis of data D P2. Children show behavior B P3. Behavior B is not possible without having made G G: a specific grammar D: typical child-directed speech input B: children don’t make certain mistakes (they don’t seem to entertain structure-independent hypotheses) T: language has hierarchical phrase structure C1. Some constraints T, which limit what type of generalizations G are possible, must be innate The Argument from the Poverty of the Stimulus (PoS)

#1: Children hear complex interrogatives  Well, a few, but not many  Adam (CHILDES) – 0.048% No yes-no questions Four wh-questions (e.g., “What is the music it’s playing?”)  Nina (CHILDES) – 0.068% No yes-no questions 14 wh-questions  In all, most estimates are << 1% of input Legate & Yang 2002

 Well, a few, but not many  Adam (CHILDES) – 0.048% No yes-no questions Four wh-questions (e.g., “What is the music it’s playing?”)  Nina (CHILDES) – 0.068% No yes-no questions 14 wh-questions  In all, most estimates are << 1% of input Legate & Yang 2002 How much is “enough”? #1: Children hear complex interrogatives

#2: Can get the behavior without structure  There is enough statistical information in the input to be able to conclude which type of complex interrogative is ungrammatical Reali & Christiansen 2004; Lewis & Elman, 2001 Rare: comp adj aux Common: comp aux adj

 Response: there is enough statistical information in the input to be able to conclude that “Are eagles that alive can fly?” is ungrammatical Reali & Christiansen 2004; Lewis & Elman, 2001 Rare: comp adj aux Common: comp aux adj  Sidesteps the question: does not address the innateness of structure (knowledge X)  Explanatorily opaque #2: Can get the behavior without structure

Why do linguists believe that language has hierarchical phrase structure?  Formal properties + information-theoretic, simplicity-based argument (Chomsky, 1956) A sentence has an (i,j) dependency if replacement of the ith symbol ai of S by bi requires a corresponding replacement of the jth symbolf aj of S by bj If S has an m-termed dependency set in L, at least 2^m states are necessary in the finite-state grammar that generates L  Therefore, if L is a finite-state language, then there is an m such that no sentence S of L has a dependency set of more than m terms in L The “mirror language” made up of sentences consisting of a string X followed by X in reverse (e.g., aa, abba, babbab, aabbaa, etc), has the property that for any m we can find a dependency set D = {(1,2m), (2,2m-1),..,(m,m+1)}. Therefore it cannot be captured by any finite-state grammar English has infinite sets of sentences with dependency sets with more than any fixed number of terms. E.g. “the man who said that S5 is arriving today”, there is a dependency between “man” and “is”. Therefore English cannot be finite-state There is the possible counterargument that since any finite corpus could be captured by a finite-state grammar, then English is only not finite-state in the limit – but in practice, it could be  Easy counterargument: simplicity considerations. Chomsky: “If the processes have a limit, then the construction of a finite-state grammar will not be literally impossible (since a list is a trivial finite-state grammar), but this grammar will be so complex as to be of little use or interest.”

InnateLearned The big picture

InnateLearned Grammar Acquisition (Chomsky)

P1. Children show behavior B B The Argument from the Poverty of the Stimulus (PoS)

P1. Children show behavior B P2. Behavior B is not possible without having some specific grammar or rule G G B The Argument from the Poverty of the Stimulus (PoS)

P1. Children show behavior B P2. Behavior B is not possible without having some specific grammar or rule G P3. It is impossible to have learned G simply on the basis of data D G B D X The Argument from the Poverty of the Stimulus (PoS)

C1. Some constraints T, which limit what type of grammars are possible, must be innate G B T D P1. Children show behavior B P2. Behavior B is not possible without having some specific grammar or rule G P3. It is impossible to have learned G simply on the basis of data D The Argument from the Poverty of the Stimulus (PoS)

There are enough complex interrogatives in D P1. It is impossible to have made some generalization G simply on the basis of data D P2. Children show behavior B P3. Behavior B is not possible without having made G e.g., Pullum & Scholz 2002 C1. Some constraints T, which limit what type of generalizations G are possible, must be innate Replies to the PoS argument

There are enough complex interrogatives in D P1. It is impossible to have made some generalization G simply on the basis of data D P2. Children show behavior B P3. Behavior B is not possible without having made G Pullum & Scholz, 2002 There is a route to B other than G (statistical learning) e.g., Lewis & Elman, 2001 Reali & Christiansen, 2005 C1. Some constraints T, which limit what type of generalizations G are possible, must be innate Replies to the PoS argument

InnateLearned

InnateLearned Explicit structure No explicit structure

 Assumptions: equipped with Capacity to represent both linear and hierarchical grammars (no bias) Rational Bayesian learning mechanism & probability calculation Ability to effectively search the space of possible grammars Our argument

 Shown that given reasonable domain-general assumptions, an unbiased rational learner could realize that languages have a hierarchical structure based on typical child-directed input Take-home message

 Shown that given reasonable domain-general assumptions, an unbiased rational learner could realize that languages have a hierarchical structure based on typical child-directed input  Can use this paradigm to explore the role of recursive elements in a grammar The “winning” grammar contains additional non-recursive counterparts for complex NPs Perhaps language, while fundamentally recursive, contains duplicate non-recursive elements that more precisely match the input? Take-home message

The role of recursion  Evaluated an additional grammar (CFG-DL) that contained no recursive complex NPs at all – instead, multiply- embedded, depth-limited ones  No sentence in the corpus occurred with more than two levels of nesting

Corpus level FLATREG-NREG-MREG-B1-ST CFG-S CFG-LCFG-DL 0-4849-3181-2671-2488-2443 -2123 -2272-2339 1-28778-11608-10209-9891-13379-9444 -9427 -9438 2-44158-16346-14972-14557-20594-14242 -14005 -14031 3-61365-21757-20182-19775-28765-19694 -19168 -19211 4-75570-26201-24507-24193-35547-24185 -23426 -23506 5-85730-29441-27300-27561-40360-27300 -26407 -26485 The role of recursion: results Log posterior probability (lower magnitude = better)

The role of recursion: Results Corpus level FLATREG-N REG-M REG-B1-STCFG-SCFG-L CFG-DL 1-116-119 -135 -164-146-191 2-764-537-581-538 -476 -584-626-692 3-1480-971-905-875 -765 -856-936-1032 4-7337-3284-2963-2787-3339 -2669 -2790-2824 5-13466-5256-4896-4772-6034 -4566 -4620-4603 6-85730-29441-27300-27561-40360-27300 -26407 -26485

The role of recursion: Implications  Optimal tradeoff results in a grammar that “goes beyond the data” in interesting ways: Auxiliary fronting Recursive complex NPs  A grammar with recursive complex NPs is more optimal, even though: Recursive productions hurt in the likelihood There are no sentences with more than two levels of nesting in the input

A Bayesian Approach to the Poverty of the Stimulus Amy Perfors MIT With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)

Similar presentations

Presentation on theme: "A Bayesian Approach to the Poverty of the Stimulus Amy Perfors MIT With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Bayesian Approach to the Poverty of the Stimulus Amy Perfors MIT With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)

Similar presentations

Presentation on theme: "A Bayesian Approach to the Poverty of the Stimulus Amy Perfors MIT With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)"— Presentation transcript:

Similar presentations

About project

Feedback