Bayesian models of human learning and inference

Bayesian models of human learning and inference
Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL) Acknowledgments: Tom Griffiths, Charles Kemp, The Computational Cognitive Science group at MIT All the researchers whose work I’ll discuss.

Collaborators Chris Baker Noah Goodman Tom Griffiths Charles Kemp Game plan: biology If 5 minutes left at the end, do words and theory acquisition. If no time left, just do theory acquisition. Vikash Mansinghka Amy Perfors Lauren Schmidt Pat Shafto Funding: AFOSR Cognition and Decision Program, AFOSR MURI, DARPA IPTO, NSF, HSARPA, NTT Communication Sciences Laboratories, James S. McDonnell Foundation

Everyday inductive leaps
How can people learn so much about the world from such limited evidence? Learning concepts from examples “horse” “horse” “horse”

Learning concepts from examples
“tufa”

Everyday inductive leaps
How can people learn so much about the world from such limited evidence? Kinds of objects and their properties The meanings of words, phrases, and sentences Cause-effect relations The beliefs, goals and plans of other people Social structures, conventions, and rules

Modeling Goals Principled quantitative models of human behavior, with broad coverage and a minimum of free parameters and ad hoc assumptions. Explain how and why human learning and reasoning works, in terms of (approximations to) optimal statistical inference in natural environments. A framework for studying people’s implicit knowledge about the structure of the world: how it is structured, used, and acquired. A two-way bridge to state-of-the-art in statistics, machine learning and AI.

The approach: from statistics to intelligence
How does background knowledge guide learning from sparsely observed data? Bayesian inference: 2. What form does background knowledge take, across different domains and tasks? Probabilities defined over structured representations: graphs, grammars, predicate logic, schemas. 3. How is background knowledge itself acquired? Hierarchical probabilistic models, with inference at multiple levels of abstraction. Flexible nonparametric models in which complexity grows with the data.

Outline Predicting everyday events Learning concepts from examples
Learning the latent structure of the world

Bayesian inference in perception and sensorimotor integration
(Weiss, Simoncelli & Adelson 2002) (Kording & Wolpert 2004)

Everyday prediction problems (Griffiths & Tenenbaum, 2006)
You read about a movie that has made $60 million to date. How much money will it make in total? You see that something has been baking in the oven for 34 minutes. How long until it’s ready? You meet someone who is 78 years old. How long will they live? Your friend quotes to you from line 17 of his favorite poem. How long is the poem? You meet a US congressman who has served for 11 years. How long will he serve in total? You encounter a phenomenon or event with an unknown extent or duration, ttotal, at a random time or value of t <ttotal. What is the total extent or duration ttotal?

p(ttotal|t)  p(t|ttotal) p(ttotal)
Bayesian analysis p(ttotal|t)  p(t|ttotal) p(ttotal)  1/ttotal p(ttotal) Assume random sample (for 0 < t < ttotal else = 0) Form of p(ttotal)? e.g., uninformative (Jeffreys) prior  1/ttotal

Priors P(ttotal) based on empirically measured durations or magnitudes for many real-world events in each class: Median human judgments of the total duration or magnitude ttotal of events in each class, given that they are first observed at a duration or magnitude t, versus Bayesian predictions (median of P(ttotal|t)).

You learn that in ancient
Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?

You learn that in ancient
Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign? How long did the typical pharaoh reign in ancient egypt?

Summary: prediction Predictions about the extent or magnitude of everyday events follow Bayesian principles. Contrast with Bayesian inference in perception, motor control, memory: no “universal priors” here. Predictions depend rationally on priors that are appropriately calibrated for different domains. Form of the prior (e.g., power-law or exponential) Specific distribution given that form (parameters) Non-parametric distribution when necessary. In the absence of concrete experience, priors may be generated by qualitative background knowledge.

Learning concepts from examples
“tufa” Word learning “tufa” “tufa” Property induction Cows have T9 hormones. Seals have T9 hormones. Squirrels have T9 hormones. All mammals have T9 hormones. Cows have T9 hormones. Sheep have T9 hormones. Goats have T9 hormones. All mammals have T9 hormones.

The computational problem (c.f., semi-supervised learning)
? Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Seal Rhino Elephant ? Features New property (85 features from Osherson et al. E.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘quadrapedal’,…)

Semi-supervised learning: similarity-based approach
“has T9 hormones”, “is a tufa”

Similarity-based models of property induction
Data Model . X1 have property P. X2 have property P. X3 have property P. All mammals have property P. Each “ ” represents one argument:

Hierarchical Bayesian Framework (Kemp & Tenenbaum)
P(structure | form) P(data | structure) P(form) F: form Tree mouse squirrel chimp gorilla S: structure hormones Has T9 F1 F2 F3 F4 D: data mouse squirrel chimp gorilla ? …

The value of structural form knowledge: inductive bias

P(D|S): How the structure constrains the data of experience
Define a stochastic process over structure S that generates hypotheses h. For generic properties, prior should favor hypotheses that vary smoothly over structure. Many properties of biological species were actually generated by such a process (i.e., mutation + selection). Smooth: P(h) high Not smooth: P(h) low

Gaussian Process (~ random walk, diffusion) [Zhu, Ghahramani & Lafferty 2003] y Threshold h

Gaussian Process (~ random walk, diffusion) [Zhu, Lafferty & Ghahramani 2003] y Threshold h

Structure S Data D Features
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features (85 features from Osherson et al., e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘quadrapedal’,…)

[c.f., Lawrence, 2004; Smola & Kondor 2003]

Structure S Data D Features New property
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? Features New property (85 features from Osherson et al., e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘quadrapedal’,…)

Property induction “Given that X1, …, Xn have property P, how likely is it that Y does?” Tree 2D Horses All mammals Tree 2D Minneapolis Houston (Kemp & Tenenbaum)

Hierarchical Bayesian Framework
F: form Chain Tree Space chimp gorilla squirrel mouse mouse squirrel chimp gorilla mouse squirrel S: structure gorilla chimp F1 F2 F3 F4 D: data mouse squirrel chimp gorilla

Discovering structural forms
Snake Turtle Crocodile Robin Ostrich Bat Orangutan Discovering structural forms Snake Turtle Crocodile Robin Bat Ostrich Orangutan Ostrich Robin Crocodile Snake Turtle Bat Orangutan

Discovering structural forms
Snake Turtle Crocodile Robin Ostrich Bat Orangutan Discovering structural forms “Great chain of being” Rock Plant Snake Turtle Crocodile Robin Bat Ostrich Orangutan Angel God Linnaeus Ostrich Robin Crocodile Snake Turtle Bat Orangutan

Learning domain structures
People can discover structural forms… Children’s cognitive development: hierarchical structure of category labels, cyclical structure of seasons or days of the week, transitive structure for comparative relations. …but conventional algorithms assume fixed forms. E.g., principal components analysis assumes a low-dimensional spatial structure, hierarchical clustering assumes a tree structure, k-means clustering assumes a flat partition. Tree structure for biological species Periodic structure for chemical elements “great chain of being” Systema Naturae Kingdom Animalia Phylum Chordata Class Mammalia Order Primates Family Hominidae Genus Homo Species Homo sapiens (1579) (1735) (1837)

“Universal Structure Learner”
The ultimate goal “Universal Structure Learner” K-Means Hierarchical clustering Factor Analysis Guttman scaling Circumplex models Self-Organizing maps ··· Data Representation

A “universal grammar” for structural forms
Process Form Process

Hierarchical Bayesian Framework
F: form Favors simplicity Favors smoothness [Zhu et al., 2003] mouse squirrel chimp gorilla S: structure F1 F2 F3 F4 D: data mouse squirrel chimp gorilla

Model fitting Evaluate each form in parallel
For each form, heuristic search over structures based on greedy growth from a one-node seed:

Development of structural forms as more data are observed

Beyond “Nature” versus “Nurture” in human cognitive development
“Nativists”: Explicit knowledge of structural forms for core domains is innate. Atran (1998): The tendency to group living kinds into hierarchies reflects an “innately determined cognitive structure”. Chomsky (1980): “The belief that various systems of mind are organized along quite different principles leads to the natural conclusion that these systems are intrinsically determined, not simply the result of common mechanisms of learning or growth.” “Empiricists”: General-purpose learning systems without explicit knowledge of structural form. Connectionist networks (e.g., Rogers and McClelland, 2004). Traditional structure learning in probabilistic graphical models.

Summary: concept learning
Bayesian inference over hierarchies of structured representations provides a framework to understand core questions of human cognition: How does abstract domain knowledge guide learning of new concepts? How can this abstract knowledge be represented, how might it be learned, and what might be innate? F: form mouse squirrel S: structure chimp gorilla F1 F2 F3 F4 D: data mouse squirrel chimp gorilla How can probabilistic inference work together with flexibly structured, symbolic representations of knowledge?

The big picture What we need to understand: the mind’s ability to build rich models of the world from sparse data. Learning about objects, categories, their properties and relations Causal inference Understanding other people’s actions, plans, thoughts, goals Language comprehension and production Scene understanding What do we need to understand these abilities? Bayesian inference in probabilistic generative models Hierarchical models, with inference at all levels of abstraction Structured representations: graphs, grammars, logic Flexible representations, growing in response to observed data

Clustering models for relational data
Social networks: block models Does prisoner x like prisoner y? Does person x respect person y?

Learning systems of concepts with infinite relational models (Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06) concept predicate concept Biomedical predicate data from UMLS (McCrae et al.): 134 concepts: enzyme, hormone, organ, disease, cell function ... 49 predicates: affects(hormone, organ), complicates(enzyme, cell function), treats(drug, disease), diagnoses(procedure, disease) …

Learning a medical ontology
e.g., Diseases affect Organisms Chemicals interact with Chemicals Chemicals cause Diseases

Clustering arbitrary relational systems
International relations circa 1965 (Rummel) 14 countries: UK, USA, USSR, China, …. 54 binary relations representing interactions between countries: exports to( USA, UK ), protests( USA, USSR ), …. 90 (dynamic) country features: purges, protests, unemployment, communists, # languages, assassinations, ….

Learning a hierarchical ontology

Structural forms from relational data
Dominance hierarchy Tree Cliques Ring Primate troop Bush administration Prison inmates Kula islands “x beats y” “x told y” “x likes y” “x trades with y”

Learning causal relations
Abstract Principles Structure Data (Griffiths, Tenenbaum, Kemp et al.)

True structure of graphical model G: Graph G Data D Abstract Theory
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # of samples: Graph G edge (G) Data D Classes Z … … class (z) Abstract Theory c1 … c2 h c1 c2 c1 0.0 0.4 … c2 0.0 0.0 … edge (G) Graph G Data D (Mansinghka, Kemp, Tenenbaum, Griffiths UAI 06)

Goal-directed action (production and comprehension)
(Wolpert et al., 2003)

Goal inference as inverse probabilistic planning
Constraints Goals Goal inference as inverse probabilistic planning Rational planning (PO)MDP (Baker, Tenenbaum & Saxe) Actions human judgments model predictions

“Universal Grammar” Grammar Phrase structure Utterance Speech signal
Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG) P(phrase structure | grammar) P(utterance | phrase structure) P(speech | utterance) (c.f. Chater and Manning, 2006) P(grammar | UG) Grammar Phrase structure Utterance Speech signal

Vision as probabilistic parsing
(Han & Zhu, 2006; c.f., Zhu, Yuanhao & Yuille NIPS 06 )

The big picture What we need to understand: the mind’s ability to build rich models of the world from sparse data. Learning about objects, categories, and their properties. Causal inference Understanding other people’s actions, plans, thoughts, goals Language comprehension and production Scene understanding What do we need to understand these abilities? Bayesian inference in probabilistic generative models Hierarchical models, with inference at all levels of abstraction Structured representations: graphs, grammars, logic Flexible representations, growing in response to observed data

The “nonparametric safety-net”
12 True structure of graphical model G: 11 1 10 2 9 3 8 4 7 5 6 # of samples: Graph G edge (G) Data D edge (G) Abstract theory Z Graph G class (z) Data D

Bayesian prediction P(ttotal|tpast)  1/ttotal P(tpast)
posterior probability Random sampling Domain-dependent prior What is the best guess for ttotal? Compute t such that P(ttotal > t|tpast) = 0.5: P(ttotal|tpast) We compared the median of the Bayesian posterior with the median of subjects’ judgments… but what about the distribution of subjects’ judgments? ttotal

Sources of individual differences
Individuals’ judgments could by noisy. Individuals’ judgments could be optimal, but with different priors. e.g., each individual has seen only a sparse sample of the relevant population of events. Individuals’ inferences about the posterior could be optimal, but their judgments could be based on probability (or utility) matching rather than maximizing.

Individual differences in prediction
P(ttotal|tpast) ttotal Proportion of judgments below predicted value Quantile of Bayesian posterior distribution

Individual differences in prediction
P(ttotal|tpast) ttotal Average over all prediction tasks: movie run times movie grosses poem lengths life spans terms in congress cake baking times

Individual differences in concept learning

Why probability matching?
Optimal behavior under some (evolutionarily natural) circumstances. Optimal betting theory, portfolio theory Optimal foraging theory Competitive games Dynamic tasks (changing probabilities or utilities) Side-effect of algorithms for approximating complex Bayesian computations. Markov chain Monte Carlo (MCMC): instead of integrating over complex hypothesis spaces, construct a sample of high-probability hypotheses. Judgments from individual (independent) samples can on average be almost as good as using the full posterior distribution.

Markov chain Monte Carlo
(Metropolis-Hastings algorithm)

Bayesian models of human learning and inference

Similar presentations

Presentation on theme: "Bayesian models of human learning and inference"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian models of human learning and inference

Similar presentations

Presentation on theme: "Bayesian models of human learning and inference"— Presentation transcript:

Similar presentations

About project

Feedback