Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp.

Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp

The problem of generalization How can people learn so much from such limited evidence? –Kinds of objects and their properties –Meanings and forms of words, phrases, and sentences –Causal relations –Intuitive theories of physics, psychology, … –Social structures, conventions, and rules The goal: A general-purpose computational framework for understanding of how people make these inductive leaps, and how they can be successful.

The problem of generalization How can people learn so much from such limited evidence? –Learning word meanings from examples “horse”

The problem of generalization How can people learn so much from such limited evidence? –‘Poverty of the stimulus’ in syntactic acquisition Simple declaratives: The girl is happy. They are eating Simple interrogatives: Is the girl happy? Are they eating? H1. Linear: move the first auxiliary in the sentence to the beginning. H2. Hierarchical: move the auxiliary in the main clause to the beginning. Generalization Hypotheses Data Complex declarative: The girl who is sleeping is happy. Complex interrogative: Is the girl who is sleeping happy? [via H2] *Is the girl who sleeping is happy? [via H1] Aux-fronting in interrogatives (Chomsky, Crain & Nakayma)

How can people learn so much from such limited evidence? The answer: human learners have abstract knowledge that provides inductive constraints – restrictions or biases on the hypotheses to be considered. Word learning: whole-object principle, taxonomic principle, basic-level bias, shape bias, mutual exclusivity, … Syntax: syntactic rules are defined over hierarchical phrase structures rather than linear order of words. The problem of generalization Poverty of the stimulus as a scientific tool…

1.How does abstract knowledge guide generalization from sparsely observed data? 2. What form does abstract knowledge take, across different domains and tasks? 3. What are the origins of abstract knowledge? The big questions

1.How does abstract knowledge guide generalization from sparsely observed data? Priors for Bayesian inference: 2. What form does abstract knowledge take, across different domains and tasks? Probabilities defined over structured representations: graphs, grammars, predicate logic, schemas. 3. What are the origins of abstract knowledge? Hierarchical probabilistic models, with inference at multiple levels of abstraction and multiple timescales. The approach

Three case studies of generalization Learning words for object categories Learning abstract word-learning principles (“learning to learn words”) –Taxonomic principle –Shape bias Learning in syntax: unobserved syntactic forms, abstract syntactic knowledge

Word learning as Bayesian inference (Xu & Tenenbaum, Psych Review 2007) A Bayesian model can explain several core aspects of generalization in word learning… –learning from very few examples –learning from only positive examples –simultaneous learning of overlapping extensions –graded degrees of confidence –dependence on pragmatic and social context … arguably, better than previous computational accounts based on hypothesis elimination (e.g., Siskind) or associative learning (e.g., Regier).

Basics of Bayesian inference Bayes’ rule: An example –Data: John is coughing –Some hypotheses: 1. John has a cold 2. John has lung cancer 3. John has a stomach flu –Likelihood P(d|h) favors 1 and 2 over 3 –Prior probability P(h) favors 1 and 3 over 2 –Posterior probability P(h|d) favors 1 over 2 and 3

X Bayesian generalization “horse” ? ? ? ? ? ?

X Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions Bayesian generalization

X Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions Assume examples are sampled randomly from the word’s extension.

X Bayesian generalization Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions

X Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions Bayesian generalization

X Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “Size principle”: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases.

X Bayesian generalization Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “Size principle”: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases.

X c.f. Subset principle Bayesian generalization Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “Size principle”: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases.

Bayes Maximum likelihood or “subset principle” Generalization gradients Hypothesis averaging:

Word learning as Bayesian inference (Xu & Tenenbaum, Psych Review 2007) superordinate basic-level subordinate

Prior p(h): Choice of hypothesis space embodies traditional constraints: whole object principle, shape bias, taxonomic principle… –More fine-grained prior favors more distinctive clusters. Likelihood p(X | h): Random sampling assumption. –Size principle: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases. Word learning as Bayesian inference (Xu & Tenenbaum, Psych Review 2007)

Generalization experiments Bayesian model Children’s generalizations Not easily explained by hypothesis elimination or associative models.

Further questions Bayesian learning for other kinds of words? –Verbs (Niyogi; Alishahi & Stevenson; Perfors, Wonnacott, Tenenbaum) –Adjectives (Dowman; Schmidt, Goodman, Barner, Tenenbaum) How fundamental and general is learning by “suspicious coincidence” (the size principle)? –Other domains of inductive generalization in adults and children (Tenenbaum et al; Xu et al.) –Generalization in < 1-year-old infants (Gerken; Xu et al.) Bayesian word learning in more natural communicative contexts? –Cross-situational mapping with real-world scenes and utterances (Frank, Goodman & Tenenbaum; c.f., Yu)

Further questions Where do the hypothesis space and priors come from? How does word learning interact with conceptual development?

Principles T Structure S Data D A hierarchical Bayesian view ? ? “fep” ? ?... ? Whole-object principle Shape bias Taxonomic principle … “ziv”?“gip” ? “Basset hound” “dog” “animal” “cat” “tree” “daisy” “thing”

Different forms of structure Dominance Order LineRingFlat HierarchyTaxonomyGridCylinder

F: form S: structure D: data Tree-structured taxonomy Disjoint clusters Linear order X1 X3 X4 X5 X6 X7 X2 X1 X3 X4 X5 X6 X7 X2 X1 X3X2 X5 X4 X6 X7 Discovery of structural form (Kemp and Tenenbaum) X1 X2 X3 X4 X5 X6 X7 Features … P(S | F) Simplicity P(D | S) Fit to data P(F)P(F)

Data D Bayesian model selection: trading fit vs. simplicity Structure S Likelihood: low highhighest Prior: highesthigh low F = 1 region F = 2 regions F = 13 regions

Balance between fit and simplicity should be sensitive to the amount of data observed… Bayesian model selection: trading fit vs. simplicity

Development of structural forms as more object features are observed

Principles T Structure S Data D A hierarchical Bayesian view ? ? “fep” ? ?... ? Whole-object principle Shape bias Taxonomic principle … “ziv”?“gip” ? “Basset hound” “dog” “animal” “cat” “tree” “daisy” “thing”

The shape bias in word learning (Landau, Smith, Jones 1988) This is a dax.Show me the dax… A useful inductive constraint: many early words are labels for object categories, and shape may be the best cue to object category membership. English-speaking children typically show the shape bias at 24 months, but not at 20 months.

Is the shape bias learned? Smith et al (2002) trained 17-month-olds on labels for 4 artificial categories: After 8 weeks of training (20 min/week), 19-month- olds show the shape bias: “wib” “lug” “zup” “div” This is a dax. Show me the dax…

Transfer to real-world vocabulary The puzzle: The shape bias is a powerful inductive constraint, yet can be learned from very little data.

Learning abstract knowledge about feature variability “wib” “lug” “zup” “div” The intuition: - Shape varies across categories but relatively constant within nameable categories. - Other features (size, color, texture) vary both within and across nameable object categories.

Learning a Bayesian prior ? ? ? ? ? ? Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “horse” p(h) ~ uniform shape color

? ? ? ? ? ? Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “cat”“cup”“ball”“chair” “horse” p(h) ~ uniform shape color Learning a Bayesian prior

“horse” ? ? ? ? ? ? Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “cat”“cup”“ball”“chair” p(h) ~ long & narrow: high others: low shape color Learning a Bayesian prior

color Nameable object categories tend to be homogeneous in shape, but heterogeneous in color, material, … Level 1: specific categories Data Level 2: nameable object categories in general shape color “cat” “cup”“ball”“chair” shape color shape color ? ? Hierarchical Bayesian model

color Nameable object categories tend to be homogeneous in shape, but heterogeneous in color, material, … Level 1: specific categories Data Level 2: nameable object categories in general shape color “cat” “cup”“ball”“chair” shape color shape color Hierarchical Bayesian model

Level 1: specific categories Data Level 2: nameable object categories in general  shape  color “cat” “cup”“ball”“chair”  shape  color  shape  color p(  i ) ~ Exponential(  ) p(  i |  i ) ~ Dirichlet(  i ) p(y i |  i ) ~ Multinomial(  i )  i : within-category variability for feature i {y shape, y color } lowhigh lowhigh

Learning the shape bias “wib ” “lug” “zup” “div” Training

This is a dax. Show me the dax… Training Test Second-order generalization test

Three case studies of generalization Learning words for object categories Learning abstract word-learning principles (“learning to learn words”) –Taxonomic principle –Shape bias Learning in syntax: unobserved syntactic forms, abstract syntactic knowledge

“Poverty of the Stimulus” argument Simple declarative: The girl is happy. They are eating Simple interrogative: Is the girl happy? Are they eating? H1. Linear: move the first auxiliary in the sentence to the beginning. H2. Hierarchical: move the auxiliary in the main clause to the beginning. Generalization Hypotheses Data Complex declarative: The girl who is sleeping is happy. Complex interrogative: Is the girl who is sleeping happy? [via H2] *Is the girl who sleeping is happy? [via H1] Induction of specific grammatical rules must be guided by some abstract constraints to prefer certain hypotheses over others, e.g., syntactic rules are defined over hierarchical phrase structures rather than linear order of words. => Inductive constraint E.g., aux-fronting in complex interrogatives:

NoYes Hierarchical phrase structure Must this inductive constraint be innately specified as part of the initial state of the language faculty? Could it be inferred using more domain-general capacities?

Grammar type T Data D (CHILDES: ~21500 sentences, ~2300 sentence types) Specific grammar G Hierarchical Bayesian model Linear regular, … Hierarchical context-free

Grammar type T Hierarchical Bayesian model Specific grammar G SimplicityFit to data Linear regular, … Hierarchical context-free Data D (CHILDES: ~21500 sentences, ~2300 sentence types) Unbiased (uniform)

Grammar type T Hierarchical Bayesian model Specific grammar G Linear regular, … Hierarchical context-free Data D (CHILDES: ~21500 sentences, ~2300 sentence types) SimplicityFit to dataUnbiased (uniform)

Grammar type T Hierarchical Bayesian model Specific grammar G Data D (CHILDES) Linear regular, … Hierarchical context-free SimplicityFit to dataUnbiased (uniform)

Results: Full corpus SimplerTighter fit (Note: these are -log probabilities, so lower = better!) PriorLikelihoodPosterior CFG S L REG B M N CFG S L REG B M N CFG S L REG B M N

TypeIn corpus? ExampleFLATRG-NRG-MRG-B1-STCFG-SCFG-L Simple Declarative Eagles do fly. (n aux vi) Simple Interrogative Do eagles fly? (aux n vi) Complex Declarative Eagles that are alive do fly. (n comp aux adj aux vi) Complex Interrogative Do eagles that are alive fly? (aux n comp aux adj vi) Complex Interrogative Are eagles that alive do fly? (aux n comp adj aux vi) e.g., complex aux-fronted interrogatives: Generalization results How well does each grammar predict unseen sentence forms? Context-free Regular grammars grammars

Results: First file (90 mins) SimplerTighter fit PriorLikelihoodPosterior CFG S L REG B M N CFG S L REG B M N CFG S L REG B M N (Note: these are -log probabilities, so lower = better!)

Conclusions Bayesian inference over hierarchies of structured representations provides a way to study core questions of human cognition, in language and other domains. –What is the content and form of abstract knowledge? –How can abstract knowledge guide generalization from sparse data? –How can abstract knowledge itself be acquired? What is built in? Going beyond traditional dichotomies. –How can structured knowledge be acquired by statistical learning? –How can domain-general learning mechanisms acquire domain- specific inductive constraints? A different way to think about cognitive development. –Powerful abstractions (taxonomic structure, shape bias, hierarchical organization of syntax) can be inferred “top down”, from surprisingly little data, together with learning more concrete knowledge. –Very different from the traditional empiricist or nativist views of abstraction. Worth pursuing more generally…

–Word LearningWhole object bias Taxonomic principle (Markman) Shape bias (Smith) –Causal reasoningCausal schemata (Kelley) –Folk physicsObjects are unified, persistent (Spelke) –NumberCounting principles (Gelman) –Folk biologyPrinciples of taxonomic rank (Atran) –Folk psychologyPrinciple of rationality (Gergely) –OntologyM-constraint on predicability (Keil) –SyntaxUG (Chomsky) –Phonology Faithfulness, Markedness constraints (Prince, Smolensky) Abstract knowledge in cognitive development

Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp.

Similar presentations

Presentation on theme: "Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp.

Similar presentations

Presentation on theme: "Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp."— Presentation transcript:

Similar presentations

About project

Feedback