Download presentation
1
Amy Perfors & Josh Tenenbaum, MIT
Explorations in language learnability using probabilistic grammars and child-directed speech Amy Perfors & Josh Tenenbaum, MIT Terry Regier, U Chicago Acknowledgments: Tom Griffiths, Charles Kemp, The Computational Cognitive Science group at MIT All the researchers whose work I’ll discuss. Thanks also: the MIT Computational Cognitive Science group, Adam Albright, Jeff Elman, Danny Fox, Ted Gibson, Sharon Goldwater, Mark Johnson, Jay McClelland, Raj Singh, Ken Wexler, Fei Xu, NSF
2
Everyday inductive leaps
How can people learn so much about the world from such limited evidence? Kinds of objects and their properties Meanings and forms of words, phrases, and sentences Causal relations Intuitive theories of physics, psychology, … Social structures, conventions, and rules The goal: A general-purpose computational framework for understanding of how people make these inductive leaps, and how they can be successful. …the “biochemistry” of the mind.
3
The approach: reverse engineering induction
The big questions: How can abstract knowledge guide generalization from sparsely observed data? What is the form and content of abstract knowledge, across different domains? How could abstract knowledge itself be acquired? A computational toolkit for addressing these questions: Bayesian inference in probabilistic generative models. Probabilistic models defined over structured representations: graphs, grammars, schemas, predicate logic, lambda calculus, functional programs. Hierarchical probabilistic models, with inference at multiple levels of abstraction and multiple timescales.
4
? “UG” Grammar Phrase structure Utterance Speech signal
P(grammar | UG) Grammar P(phrase structure | grammar) Phrase structure P(utterance | phrase structure) Utterance P(speech | utterance) Speech signal (c.f. Chater & Manning, TiCS 2006)
5
Vision as probabilistic parsing
P(scene) Scene graph P(surfaces | scene) Surface configuration P(image | config) Image (Han & Zhu, 2006; c.f., Zhu, Yuanhao & Yuille NIPS 06 )
6
Learning about categories, labels, and hidden properties
P(structure | form) P(data | structure) P(form) Form Tree with species at leaf nodes mouse squirrel chimp gorilla rodent Structure animal primate Data
7
Learning causal theories
Magnets attract Metal. Every Magnet has a North Pole and a South Pole. Opposite magnetic poles attract; Like magnetic poles repel. Abstract Principles Behaviors can cause Diseases Diseases can cause Symptoms + + + Causal Structure N S - - S N + + + Event Data (Tenenbaum, Griffiths, Kemp, Niyogi, et al.)
8
Goal-directed action (production and comprehension)
(Wolpert, Doya and Kawato, 2003)
9
? UG Grammar Phrase structure Utterance Speech signal
P(grammar | UG) Grammar P(phrase structure | grammar) Phrase structure P(utterance | phrase structure) Utterance P(speech | utterance) Speech signal (c.f. Chater and Manning, 2006)
10
The “Poverty of the Stimulus” argument
The generic form: Children acquiring language infer the correct forms of complex syntactic constructions for which they have little or no direct evidence. They avoid simple but incorrect generalizations that would be consistent with their data, preferring much subtler rules that just happen to be correct. How do they do this? They must have some inductive bias – some abstract knowledge about how language works – leading them to prefer the correct hypotheses even in the absence of direct supporting data. That abstract knowledge is UG.
11
A “Poverty of the Stimulus” argument
E.g., aux-fronting in complex interrogatives: Data Simple declarative: The girl is happy. They are eating Simple interrogative: Is the girl happy? Are they eating? Hypotheses H1. Linear: move the first auxiliary in the sentence to the beginning. H2. Hierarchical: move the auxiliary in the main clause to the beginning. Generalization Complex declarative: The girl who is sleeping is happy. Complex interrogative: Is the girl who is sleeping happy? [via H2] *Is the girl who sleeping is happy? [via H1] => Inductive constraint Induction of specific grammatical rules must be guided by some abstract constraints to prefer certain hypotheses over others, e.g., syntactic rules are defined over hierarchical phrase structures rather than linear order of words.
12
Hierarchical phrase structure
No Yes
13
UG Grammar Phrase structure Utterance Speech signal
Hierarchical phrase-structure grammars, … P(grammar | UG) Recent objections: The right kind of statistics we can remove the need for such inductive constraints (e.g., Pullum, Elman, Reali & Christiansen)… NOT the question here! Grammar P(phrase structure | grammar) Phrase structure P(utterance | phrase structure) Utterance P(speech | utterance) Speech signal (c.f. Chater and Manning, 2006)
14
Our argument The Question: What form do constraints take and how do they arise? (When) must they be innately specified as part of the initial state of the language faculty? The Claim: It is possible that, given the data of child-directed speech and certain innate domain-general capacities, an unbiased ideal learner can recognize the hierarchical phrase structure of language; perhaps this inductive constraint need not be innately specified in the language faculty. Assumed domain-general capacities: Can represent grammars of various types: hierarchical, linear, … Can evaluate the Bayesian probability of a grammar given a corpus.
15
How? By inferring that a hierarchical phrase-structure grammar offers the best tradeoff between simplicity and fit to natural language data. Evaluating candidate grammars based on simplicity is an old idea: Chomsky, MMH, 1951: “As a first approximation to the notion of simplicity, we will here consider shortness of grammar as a measure of simplicity, and will use such notations as will permit similar statements to be coalesced…. Given the fixed notation, the criteria of simplicity governing the ordering of statements are as follows: that the shorter grammar is the simpler, and that among equally short grammars, the simplest is that in which the average length of derivation of sentences is least.” LSLT: Applies this idea to a multi-level generative system.
16
Ideal learnability analyses
A long history of related work Gold, Horning, Angluin, Berwick, Muggleton, Chater & Vitanyi, … Differences from our analysis Typically based on simplicity metrics that are either arbitrary or not computable. Bayes has several advantages: Gives a rational, objective way to trade off simplicity and data fit. Prescribes ideal inferences from any amount of data, not just infinite limit. Naturally handles ambiguity, noise, missing data. Mostly theorems. Our work is mostly based on empirical exploration. Typically considered highly simplified languages or an idealized corpus: infinite data, with all grammatical sentence types observed eventually and empirical frequencies given by the true grammar. The child’s corpus is very different! Small finite sample of sentence types from a very complex language, with a distribution that might depend on many other factors: semantics, pragmatics, performance, ....
17
The landscape of learnability analyses
Can X be learned from data? ideal learner ideal data Our focus here realistic learner ideal data ideal learner realistic data realistic learner realistic data
18
The Bayesian model T: type of grammar G: Specific grammar D: Data
Context-free Regular Flat, 1-state G: Specific grammar D: Data Unbiased (uniform)
19
The Bayesian model T: type of grammar G: Specific grammar D: Data
Context-free Regular Flat, 1-state G: Specific grammar D: Data Fit to data Simplicity “likelihood” “prior”
20
Bayesian learning: trading fit vs. simplicity
Data D Grammars G Simplicity: best good poor Fit: poor good best
21
Bayesian learning: trading fit vs. simplicity
Data D c.f. Subset principle Grammars G Prior: highest high low T = 1 region T = 2 regions T = 13 regions Likelihood: low high highest
22
Bayesian learning: trading fit vs. simplicity
Balance between fit and simplicity should be sensitive to the amount of data observed… c.f. Subset principle
23
The prior Measuring simplicity of a grammar
A probabilistic grammar for grammars (c.f., Horning): Grammars with more rules and more non-terminals will have lower prior probability n = # of nonterminals Pk = # productions expanding nonterminal k Θk = probabilities for expansions of nonterminal k Ni = # symbols in production i V = vocabulary size
24
The likelihood Measuring fit of a grammar
Probability of the corpus being generated from the grammar: Grammars that assign long derivations to sentences will be less probable. Ex: pro aux det n Probability of parse: 0.5*0.25*1.0*0.25*0.5 = 0.016 Grammars that generate sentences not observed in the corpus will be less probable, because they “waste” probability mass.
25
Different grammar types
Linear Hierarchical “Flat” grammar Regular grammar 1-state grammar Context-free grammar Rules Example Rules Rules Rules List of each sentence Anything accepted NT NT NT NT t NT NT NT NT t NT t NT NT t Example Example Example
26
Hierarchical grammars
Simpler, looser fit to data More complex, tighter fit to data CFG-S CFG-L Description Description Designed to be as linguistically plausible (and as compact) as possible Derived from CFG-S, with additional productions that put less probability mass on recursive productions (and hence overgenerate less). Example productions Example productions 69 rules, 14 non-terminals 120 rules, 14 non-terminals
27
Linear grammars 1-STATE REG-B REG-M REG-N FLAT Simplest, poorest fit
Most complex, exact fit 1-STATE Any sentence accepted (unigram prob. model) 25 rules, 0 non-terminals REG-B Broadest regular derived from CFG 117 rules, 10 non-terminals Mid-level regular derived from CFG REG-M 169 rules, 13 non-terminals REG-N Narrowest regular derived from CFG 389 rules, 85 non-terminals FLAT List of each sentence 2336 rules, 0 non-terminals + Local search refinements, automatic grammar construction based on machine learning methods (Goldwater & Griffiths, 2007)
28
Data Child-directed speech (CHILDES database, Adam, Brown corpus, age range 2;3 to 5;2) Each word replaced by syntactic category, e.g. det adj n v prop prep adj n (the baby bear discovers Goldilocks in his bed) wh aux det n part (what is the monkey eating?) pro aux det n comp v det n (this is the man who wrote the book) aux pro det adj n (is that a green car?) n comp aux adj aux vi (eagles that are alive do fly Final data comprise 2336 individual sentence types (corresponding to sentence tokens).
29
(Note: these are -log probabilities, so lower = better!)
Results: Full corpus (Note: these are -log probabilities, so lower = better!) Simpler Tighter fit Prior Likelihood Posterior CFG S L REG B M N CFG S L REG B M N CFG S L REG B M N
30
Results: Generalization
How well does each grammar predict unseen sentence forms? e.g., complex aux-fronted interrogatives: Type In corpus? Example FLAT RG-N RG-M RG-B 1-ST CFG-S CFG-L Simple Declarative Eagles do fly (n aux vi) Simple Interrogative Do eagles fly? (aux n vi) Complex Declarative Eagles that are alive do fly (n comp aux adj aux vi) Complex Interrogative Do eagles that are alive fly? (aux n comp aux adj vi) Are eagles that alive do fly? (aux n comp adj aux vi)
31
Results: First file (90 mins)
(Note: these are -log probabilities, so lower = better!) Simpler Tighter fit Prior Likelihood Posterior CFG S L REG B M N CFG S L REG B M N CFG S L REG B M N
32
Conclusions: poverty of the stimulus
An alternative to the standard version of linguistic nativism Hierarchical phrase structure is a crucial constraint on syntactic acquisition, but perhaps need not be specified innately in the language faculty. It can be inferred from child-directed speech by an ideal learner equipped with innate domain-general capacities to represent grammars of various types and to perform Bayesian inference. Many open questions: How close have we come to finding the best grammars of each type? How would these results extend to richer representations of syntax, or jointly learning multiple levels of linguistic structure? What are the prospects for computationally or psychologically realistic learning algorithms that might approximate this ideal learner?
33
Learnability of inductive constraints
Abstract domain knowledge may be acquired “top-down”, by Bayesian inference in a hierarchical model, and may be inferred from much less data than concrete details. Very different from “bottom-up” empiricist view of abstraction…. To what extent can other core inductive constraints observed in cognitive development be acquired? Word Learning Whole object bias (Markman) Principle of contrast (Clark) Shape bias (Smith) Causal reasoning Causal schemata (Kelley) Folk physics Objects are unified, persistent (Spelke) Number Counting principles (Gelman) Folk biology Taxonomic principle (Atran) Folk psychology Principle of rationality (Gergely) Predicability M-constraint (Keil) Syntax Various aspects of UG (Chomsky) Phonology Faithfulness, Markedness constraints (Prince, Smolensky) Learning at the top layer of a hierarchical model
34
What of the workshop hypothesis?
“Statistical approaches should work within the framework of classical linguistics rather than supplant it.”
35
What of the workshop hypothesis?
OK, with a minor change: “Classical linguistics should work within the frameworks of statistical approaches rather than ignore them.”
36
What of the workshop hypothesis?
“Classical linguistics and statistical approaches have much to gain by working together.” Classical linguistics suggests hypothesis spaces; statistical inference can evaluate these hypotheses in rigorous and powerful ways, using large-scale realistic corpora and whole grammars rather than just fragments of languages and grammars. Bayesian approaches may be particularly worth investigating: Rational, objective metric trading off simplicity and fit: less arbitrary than traditional metrics, better suited to learning from real data sources. Fits naturally with a multi-level generative system (e.g., LSLT). Recent progress in techniques for large-scale inference. Potential long-term benefits. A method for studying the contents of UG: formalizing stimulus poverty arguments, and assessing which linguistic universals or constraints are best attributed to UG. A tool to show that linguistic theory “works” in an engineering sense. May help integration with other areas of cognitive science and AI.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.