Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp.

Slides:

Advertisements

Similar presentations

Explanation-Based Learning (borrowed from mooney et al)

Advertisements

The influence of domain priors on intervention strategy Neil Bramley.

1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

Psych 156A/ Ling 150: Acquisition of Language II Lecture 12 Poverty of the Stimulus I.

Arguments for Nativism Various other facts about child language add support to the Nativist argument: Accuracy (few ‘errors’) Efficiency (quick, easy)

Causal learning in humans Alison Gopnik Dept. of Psychology UC-Berkeley.

Probabilistic Models of Cognition Conceptual Foundations Chater, Tenenbaum, & Yuille TICS, 10(7), (2006)

Psych 156A/ Ling 150: Acquisition of Language II Lecture 14 Poverty of the Stimulus III.

PSY 369: Psycholinguistics Language Acquisition: Learning words, syntax, and more.

The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Topics in Cognition and Language: Theory, Data and Models *Perceptual scene analysis: extraction of meaning events, causality, intentionality, Theory of.

Bayesian models of inductive learning

Language, Mind, and Brain by Ewa Dabrowska Chapter 10: The cognitive enterprise.

Psych 156A/ Ling 150: Psychology of Language Learning Lecture 11 Poverty of the Stimulus II.

Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.

Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian Christian, and Steve Lewandowsky.

1 Human simulations of vocabulary learning Présentation Interface Syntaxe-Psycholinguistique Y-Lan BOUREAU Gillette, Gleitman, Gleitman, Lederer.

Psycholinguistics 12 Language Acquisition. Three variables of language acquisition Environmental Cognitive Innate.

Bayesian Learning Rong Jin.

Meaning and Language Part 1.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Normative models of human inductive inference Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.

How Children Learn the Meanings of Nouns and Verbs Tingting “Rachel” Chung Ph. D. Candidate in Developmental Psychology University of Pittsburgh.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.

Psych 156A/ Ling 150: Acquisition of Language II Lecture 16 Language Structure II.

Learning causal theories Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)

Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?

Introduction to probabilistic models of cognition Josh Tenenbaum MIT.

THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)

Amy Perfors & Josh Tenenbaum, MIT

Chapter 10 - Language 4 Components of Language 1.Phonology Understanding & producing speech sounds Phoneme - smallest sound unit Number of phonemes varies.

Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Computational models of cognitive development: the grammar analogy Josh Tenenbaum MIT.

1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.

© 2008 The McGraw-Hill Companies, Inc. Chapter 8: Cognition and Language.

Infinite block models for belief networks, social networks, and cultural knowledge Josh Tenenbaum, MIT 2007 MURI Review Meeting Work of Charles Kemp, Chris.

First Language Acquisition

PSY270 Michaela Porubanova. Language  a system of communication using sounds or symbols that enables us to express our feelings, thoughts, ideas, and.

Explorations in language learnability using probabilistic grammars and child-directed speech Amy Perfors & Josh Tenenbaum, MIT Terry Regier, U Chicago.

Introduction Chapter 1 Foundations of statistical natural language processing.

Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.

Machine Learning 5. Parametric Methods.

Language and Cognition Colombo, June 2011 Day 2 Introduction to Linguistic Theory, Part 3.

Method. Input to Learning Two groups of learners each learn one of two new Semi-Artificial Languages. Both Languages: Example sentences: glim lion bee.

Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.

Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.

Structured Probabilistic Models: A New Direction in Cognitive Science

LANE 622 APPLIED LINGUISTICS

PSYC 206 Lifespan Development Bilge Yagmurlu.

Biointelligence Laboratory, Seoul National University

Cognitive Processes in SLL and Bilinguals:

An army of strawmen Input vs Nativism in language acquisition

Psych156A/Ling150: Psychology of Language Learning

Theories of Language Development

“Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's?” Alan Turing, 1950.

Bayesian models and cognitive development Josh Tenenbaum MIT MLSS 2010

Areas of Language Development Theories of Language Development

Josh Tenenbaum Statistical learning of abstract knowledge:

Revealing priors on category structures through iterated learning

The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.

Traditional Grammar VS. Generative Grammar

Linguistic aspects of interlanguage

Learning overhypotheses with hierarchical Bayesian models

Presentation transcript:

Bayesian models of inductive generalization in language acquisition Josh Tenenbaum MIT Joint work with Fei Xu, Amy Perfors, Terry Regier, Charles Kemp

The problem of generalization How can people learn so much from such limited evidence? –Kinds of objects and their properties –Meanings and forms of words, phrases, and sentences –Causal relations –Intuitive theories of physics, psychology, … –Social structures, conventions, and rules The goal: A general-purpose computational framework for understanding of how people make these inductive leaps, and how they can be successful.

The problem of generalization How can people learn so much from such limited evidence? –Learning word meanings from examples “horse”

The problem of generalization How can people learn so much from such limited evidence? –‘Poverty of the stimulus’ in syntactic acquisition Simple declaratives: The girl is happy. They are eating Simple interrogatives: Is the girl happy? Are they eating? H1. Linear: move the first auxiliary in the sentence to the beginning. H2. Hierarchical: move the auxiliary in the main clause to the beginning. Generalization Hypotheses Data Complex declarative: The girl who is sleeping is happy. Complex interrogative: Is the girl who is sleeping happy? [via H2] *Is the girl who sleeping is happy? [via H1] Aux-fronting in interrogatives (Chomsky, Crain & Nakayma)

How can people learn so much from such limited evidence? The answer: human learners have abstract knowledge that provides inductive constraints – restrictions or biases on the hypotheses to be considered. Word learning: whole-object principle, taxonomic principle, basic-level bias, shape bias, mutual exclusivity, … Syntax: syntactic rules are defined over hierarchical phrase structures rather than linear order of words. The problem of generalization Poverty of the stimulus as a scientific tool…

1.How does abstract knowledge guide generalization from sparsely observed data? 2. What form does abstract knowledge take, across different domains and tasks? 3. What are the origins of abstract knowledge? The big questions

1.How does abstract knowledge guide generalization from sparsely observed data? Priors for Bayesian inference: 2. What form does abstract knowledge take, across different domains and tasks? Probabilities defined over structured representations: graphs, grammars, predicate logic, schemas. 3. What are the origins of abstract knowledge? Hierarchical probabilistic models, with inference at multiple levels of abstraction and multiple timescales. The approach

Three case studies of generalization Learning words for object categories Learning abstract word-learning principles (“learning to learn words”) –Taxonomic principle –Shape bias Learning in syntax: unobserved syntactic forms, abstract syntactic knowledge

Word learning as Bayesian inference (Xu & Tenenbaum, Psych Review 2007) A Bayesian model can explain several core aspects of generalization in word learning… –learning from very few examples –learning from only positive examples –simultaneous learning of overlapping extensions –graded degrees of confidence –dependence on pragmatic and social context … arguably, better than previous computational accounts based on hypothesis elimination (e.g., Siskind) or associative learning (e.g., Regier).

Basics of Bayesian inference Bayes’ rule: An example –Data: John is coughing –Some hypotheses: 1. John has a cold 2. John has lung cancer 3. John has a stomach flu –Likelihood P(d|h) favors 1 and 2 over 3 –Prior probability P(h) favors 1 and 3 over 2 –Posterior probability P(h|d) favors 1 over 2 and 3

X Bayesian generalization “horse” ? ? ? ? ? ?

X Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions Bayesian generalization

X Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions Assume examples are sampled randomly from the word’s extension.

X Bayesian generalization Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions

X Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions Bayesian generalization

X Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “Size principle”: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases.

X Bayesian generalization Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “Size principle”: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases.

X c.f. Subset principle Bayesian generalization Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “Size principle”: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases.

Bayes Maximum likelihood or “subset principle” Generalization gradients Hypothesis averaging:

Word learning as Bayesian inference (Xu & Tenenbaum, Psych Review 2007) superordinate basic-level subordinate

Prior p(h): Choice of hypothesis space embodies traditional constraints: whole object principle, shape bias, taxonomic principle… –More fine-grained prior favors more distinctive clusters. Likelihood p(X | h): Random sampling assumption. –Size principle: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases. Word learning as Bayesian inference (Xu & Tenenbaum, Psych Review 2007)

Generalization experiments Bayesian model Children’s generalizations Not easily explained by hypothesis elimination or associative models.

Further questions Bayesian learning for other kinds of words? –Verbs (Niyogi; Alishahi & Stevenson; Perfors, Wonnacott, Tenenbaum) –Adjectives (Dowman; Schmidt, Goodman, Barner, Tenenbaum) How fundamental and general is learning by “suspicious coincidence” (the size principle)? –Other domains of inductive generalization in adults and children (Tenenbaum et al; Xu et al.) –Generalization in < 1-year-old infants (Gerken; Xu et al.) Bayesian word learning in more natural communicative contexts? –Cross-situational mapping with real-world scenes and utterances (Frank, Goodman & Tenenbaum; c.f., Yu)

Further questions Where do the hypothesis space and priors come from? How does word learning interact with conceptual development?

Principles T Structure S Data D A hierarchical Bayesian view ? ? “fep” ? ?... ? Whole-object principle Shape bias Taxonomic principle … “ziv”?“gip” ? “Basset hound” “dog” “animal” “cat” “tree” “daisy” “thing”

Principles T Structure S Data D A hierarchical Bayesian view ? ? “fep” ? ?... ? Whole-object principle Shape bias Taxonomic principle … “ziv”?“gip” ? “Basset hound” “dog” “animal” “cat” “tree” “daisy” “thing”

Different forms of structure Dominance Order LineRingFlat HierarchyTaxonomyGridCylinder

F: form S: structure D: data Tree-structured taxonomy Disjoint clusters Linear order X1 X3 X4 X5 X6 X7 X2 X1 X3 X4 X5 X6 X7 X2 X1 X3X2 X5 X4 X6 X7 Discovery of structural form (Kemp and Tenenbaum) X1 X2 X3 X4 X5 X6 X7 Features … P(S | F) Simplicity P(D | S) Fit to data P(F)P(F)

Data D Bayesian model selection: trading fit vs. simplicity Structure S Likelihood: low highhighest Prior: highesthigh low F = 1 region F = 2 regions F = 13 regions

Balance between fit and simplicity should be sensitive to the amount of data observed… Bayesian model selection: trading fit vs. simplicity

Development of structural forms as more object features are observed

Principles T Structure S Data D A hierarchical Bayesian view ? ? “fep” ? ?... ? Whole-object principle Shape bias Taxonomic principle … “ziv”?“gip” ? “Basset hound” “dog” “animal” “cat” “tree” “daisy” “thing”

The shape bias in word learning (Landau, Smith, Jones 1988) This is a dax.Show me the dax… A useful inductive constraint: many early words are labels for object categories, and shape may be the best cue to object category membership. English-speaking children typically show the shape bias at 24 months, but not at 20 months.

Is the shape bias learned? Smith et al (2002) trained 17-month-olds on labels for 4 artificial categories: After 8 weeks of training (20 min/week), 19-month- olds show the shape bias: “wib” “lug” “zup” “div” This is a dax. Show me the dax…

Transfer to real-world vocabulary The puzzle: The shape bias is a powerful inductive constraint, yet can be learned from very little data.

Learning abstract knowledge about feature variability “wib” “lug” “zup” “div” The intuition: - Shape varies across categories but relatively constant within nameable categories. - Other features (size, color, texture) vary both within and across nameable object categories.

Learning a Bayesian prior ? ? ? ? ? ? Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “horse” p(h) ~ uniform shape color

? ? ? ? ? ? Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “cat”“cup”“ball”“chair” “horse” p(h) ~ uniform shape color Learning a Bayesian prior

“horse” ? ? ? ? ? ? Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “cat”“cup”“ball”“chair” p(h) ~ long & narrow: high others: low shape color Learning a Bayesian prior

“horse” ? ? ? ? ? ? Hypothesis space H of possible word meanings (extensions): e.g., rectangular regions “cat”“cup”“ball”“chair” p(h) ~ long & narrow: high others: low shape color Learning a Bayesian prior

color Nameable object categories tend to be homogeneous in shape, but heterogeneous in color, material, … Level 1: specific categories Data Level 2: nameable object categories in general shape color “cat” “cup”“ball”“chair” shape color shape color ? ? Hierarchical Bayesian model

color Nameable object categories tend to be homogeneous in shape, but heterogeneous in color, material, … Level 1: specific categories Data Level 2: nameable object categories in general shape color “cat” “cup”“ball”“chair” shape color shape color Hierarchical Bayesian model

Level 1: specific categories Data Level 2: nameable object categories in general  shape  color “cat” “cup”“ball”“chair”  shape  color  shape  color p(  i ) ~ Exponential(  ) p(  i |  i ) ~ Dirichlet(  i ) p(y i |  i ) ~ Multinomial(  i )  i : within-category variability for feature i {y shape, y color } lowhigh lowhigh

Learning the shape bias “wib ” “lug” “zup” “div” Training

This is a dax. Show me the dax… Training Test Second-order generalization test

Three case studies of generalization Learning words for object categories Learning abstract word-learning principles (“learning to learn words”) –Taxonomic principle –Shape bias Learning in syntax: unobserved syntactic forms, abstract syntactic knowledge

“Poverty of the Stimulus” argument Simple declarative: The girl is happy. They are eating Simple interrogative: Is the girl happy? Are they eating? H1. Linear: move the first auxiliary in the sentence to the beginning. H2. Hierarchical: move the auxiliary in the main clause to the beginning. Generalization Hypotheses Data Complex declarative: The girl who is sleeping is happy. Complex interrogative: Is the girl who is sleeping happy? [via H2] *Is the girl who sleeping is happy? [via H1] Induction of specific grammatical rules must be guided by some abstract constraints to prefer certain hypotheses over others, e.g., syntactic rules are defined over hierarchical phrase structures rather than linear order of words. => Inductive constraint E.g., aux-fronting in complex interrogatives:

NoYes Hierarchical phrase structure Must this inductive constraint be innately specified as part of the initial state of the language faculty? Could it be inferred using more domain-general capacities?

Grammar type T Data D (CHILDES: ~21500 sentences, ~2300 sentence types) Specific grammar G Hierarchical Bayesian model Linear regular, … Hierarchical context-free

Grammar type T Hierarchical Bayesian model Specific grammar G SimplicityFit to data Linear regular, … Hierarchical context-free Data D (CHILDES: ~21500 sentences, ~2300 sentence types) Unbiased (uniform)

Grammar type T Hierarchical Bayesian model Specific grammar G Linear regular, … Hierarchical context-free Data D (CHILDES: ~21500 sentences, ~2300 sentence types) SimplicityFit to dataUnbiased (uniform)

Grammar type T Hierarchical Bayesian model Specific grammar G Data D (CHILDES) Linear regular, … Hierarchical context-free SimplicityFit to dataUnbiased (uniform)

Results: Full corpus SimplerTighter fit (Note: these are -log probabilities, so lower = better!) PriorLikelihoodPosterior CFG S L REG B M N CFG S L REG B M N CFG S L REG B M N

TypeIn corpus? ExampleFLATRG-NRG-MRG-B1-STCFG-SCFG-L Simple Declarative Eagles do fly. (n aux vi) Simple Interrogative Do eagles fly? (aux n vi) Complex Declarative Eagles that are alive do fly. (n comp aux adj aux vi) Complex Interrogative Do eagles that are alive fly? (aux n comp aux adj vi) Complex Interrogative Are eagles that alive do fly? (aux n comp adj aux vi) e.g., complex aux-fronted interrogatives: Generalization results How well does each grammar predict unseen sentence forms? Context-free Regular grammars grammars

Results: First file (90 mins) SimplerTighter fit PriorLikelihoodPosterior CFG S L REG B M N CFG S L REG B M N CFG S L REG B M N (Note: these are -log probabilities, so lower = better!)

Conclusions Bayesian inference over hierarchies of structured representations provides a way to study core questions of human cognition, in language and other domains. –What is the content and form of abstract knowledge? –How can abstract knowledge guide generalization from sparse data? –How can abstract knowledge itself be acquired? What is built in? Going beyond traditional dichotomies. –How can structured knowledge be acquired by statistical learning? –How can domain-general learning mechanisms acquire domain- specific inductive constraints? A different way to think about cognitive development. –Powerful abstractions (taxonomic structure, shape bias, hierarchical organization of syntax) can be inferred “top down”, from surprisingly little data, together with learning more concrete knowledge. –Very different from the traditional empiricist or nativist views of abstraction. Worth pursuing more generally…

–Word LearningWhole object bias Taxonomic principle (Markman) Shape bias (Smith) –Causal reasoningCausal schemata (Kelley) –Folk physicsObjects are unified, persistent (Spelke) –NumberCounting principles (Gelman) –Folk biologyPrinciples of taxonomic rank (Atran) –Folk psychologyPrinciple of rationality (Gergely) –OntologyM-constraint on predicability (Keil) –SyntaxUG (Chomsky) –Phonology Faithfulness, Markedness constraints (Prince, Smolensky) Abstract knowledge in cognitive development