Bayesian models of human learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)

Slides:



Advertisements
Similar presentations
The influence of domain priors on intervention strategy Neil Bramley.
Advertisements

Probabilistic Models in Human and Machine Intelligence.
Dynamic Bayesian Networks (DBNs)
Probabilistic Models of Cognition Conceptual Foundations Chater, Tenenbaum, & Yuille TICS, 10(7), (2006)
Artificial Spiking Neural Networks
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.
Bayesian models of inductive learning
Bayesian models of inductive learning
Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.
A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.
Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Part III Hierarchical Bayesian Models. Phrase structure Utterance Speech signal Grammar Universal Grammar Hierarchical phrase structure grammars (e.g.,
Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.
Bayesian models of human inference Josh Tenenbaum MIT.
Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian Christian, and Steve Lewandowsky.
Infinite block models for belief networks, social networks, and cultural knowledge Josh Tenenbaum, MIT 2007 MURI Review Meeting Work of Charles Kemp, Chris.
Exploring cultural transmission by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana With thanks to: Anu Asnaani, Brian.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Bayesian models of human learning and inference Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL) Thanks.
Modeling Vision as Bayesian Inference: Is it Worth the Effort? Alan L. Yuille. UCLA. Co-Director: Centre for Image and Vision Sciences. Dept. Statistics.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Normative models of human inductive inference Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
9.94 The cognitive science of intuitive theories J. Tenenbaum, T. Lombrozo, L. Schulz, R. Saxe.
Todo How to handle intro quote. Yet I’d like to think…. I’m not the only one. Make sure to practice talking through the shape bias part. –How to explain.
Bayesian models of human inductive learning Josh Tenenbaum MIT.
Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Soft Computing Lecture 17 Introduction to probabilistic reasoning. Bayesian nets. Markov models.
Learning causal theories Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
Bayesian models of human inductive learning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.
DTU Medical Visionday May 27, 2009 Generative models for automated brain MRI segmentation Koen Van Leemput Athinoula A. Martinos Center for Biomedical.
What is Cognitive Science? Josh Tenenbaum MLSS 2010.
Introduction to probabilistic models of cognition Josh Tenenbaum MIT.
Artificial Intelligence
2 2  Background  Vision in Human Brain  Efficient Coding Theory  Motivation  Natural Pictures  Methodology  Statistical Characteristics  Models.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Bayesian models of inductive learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
Computational models of cognitive development: the grammar analogy Josh Tenenbaum MIT.
Bayesian models of inductive learning Tom Griffiths UC Berkeley Josh Tenenbaum MIT Charles Kemp CMU.
Infinite block models for belief networks, social networks, and cultural knowledge Josh Tenenbaum, MIT 2007 MURI Review Meeting Work of Charles Kemp, Chris.
Randomized Algorithms for Bayesian Hierarchical Clustering
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Probabilistic Models in Human and Machine Intelligence.
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Chapter 1 Introduction to Research in Psychology.
Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.
Todo One slide on “beyond similarity-based induction”
Modeling human action understanding as inverse planning
Bayesian models of human inference
Bayesian data analysis
“Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's?” Alan Turing, 1950.
Bayesian models of human learning and inference
Josh Tenenbaum Statistical learning of abstract knowledge:
Bayesian models of human learning and reasoning
Finding structure in data
Revealing priors on category structures through iterated learning
Probabilistic Models in Human and Machine Intelligence
Building and evaluating models of human-level intelligence
The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.
Learning overhypotheses with hierarchical Bayesian models
Presentation transcript:

Bayesian models of human learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)

Charles Kemp Pat Shafto Vikash Mansinghka Amy Perfors Lauren Schmidt Chris Baker Noah Goodman Lab members Tom Griffiths* Funding: AFOSR Cognition and Decision Program, AFOSR MURI, DARPA IPTO, NSF, HSARPA, NTT Communication Sciences Laboratories, James S. McDonnell Foundation

The probabilistic revolution in AI Principled and effective solutions for inductive inference from ambiguous data: –Vision –Robotics –Machine learning –Expert systems / reasoning –Natural language processing Standard view: no necessary connection to how the human brain solves these problems.

Bayesian models of cognition Visual perception [Weiss, Simoncelli, Adelson, Richards, Freeman, Feldman, Kersten, Knill, Maloney, Olshausen, Jacobs, Pouget,...] Language acquisition and processing [Brent, de Marken, Niyogi, Klein, Manning, Jurafsky, Keller, Levy, Hale, Johnson, Griffiths, Perfors, Tenenbaum, …] Motor learning and motor control [Ghahramani, Jordan, Wolpert, Kording, Kawato, Doya, Todorov, Shadmehr, …] Associative learning [Dayan, Daw, Kakade, Courville, Touretzky, Kruschke, …] Memory [Anderson, Schooler, Shiffrin, Steyvers, Griffiths, McClelland, …] Attention [Mozer, Huber, Torralba, Oliva, Geisler, Yu, Itti, Baldi, …] Categorization and concept learning [Anderson, Nosfosky, Rehder, Navarro, Griffiths, Feldman, Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, …] Reasoning [Chater, Oaksford, Sloman, McKenzie, Heit, Tenenbaum, Kemp, …] Causal inference [Waldmann, Sloman, Steyvers, Griffiths, Tenenbaum, Yuille, …] Decision making and theory of mind [Lee, Stankiewicz, Rao, Baker, Goodman, Tenenbaum, …]

Want to learn more? Special issue of Trends in Cognitive Sciences (TiCS), July 2006 (Vol. 10, no. 7), on “Probabilistic models of cognition”. Tom Griffiths’ reading list: Summer school on probabilistic models of cognition, July 2007, Institute for Pure and Applied Mathematics (IPAM) at UCLA.

Everyday inductive leaps How can people learn so much about the world from such limited evidence? –Learning concepts from examples “horse”

Learning concepts from examples “tufa”

Everyday inductive leaps How can people learn so much about the world from such limited evidence? –Kinds of objects and their properties –The meanings of words, phrases, and sentences –Cause-effect relations –The beliefs, goals and plans of other people –Social structures, conventions, and rules

Modeling Goals Principled quantitative models of human behavior, with broad coverage and a minimum of free parameters and ad hoc assumptions. Explain how and why human learning and reasoning works, in terms of (approximations to) optimal statistical inference in natural environments. A framework for studying people’s implicit knowledge about the structure of the world: how it is structured, used, and acquired. A two-way bridge to state-of-the-art AI.

1.How does background knowledge guide learning from sparsely observed data? Bayesian inference: 2. What form does background knowledge take, across different domains and tasks? Probabilities defined over structured representations: graphs, grammars, predicate logic, schemas, theories. 3. How is background knowledge itself acquired? Hierarchical probabilistic models, with inference at multiple levels of abstraction. Flexible nonparametric models in which complexity grows with the data. The approach : from statistics to intelligence

Parallel history of AI and Cog Sci Origins – mid 1980s mid 1980s – 2000 Present Understand intelligence in computational terms Build smarter applications that work in the real world “Human-level AI” “Cognitive systems” … Structured symbolic knowledge “Simple” statistics: neural nets, clustering, counting. “Sophisticated” statistics: structured statistical models AI Goal: Dominant approach in AI:

Basics of Bayesian inference Bayes’ rule: An example –Data: John is coughing –Some hypotheses: 1. John has a cold 2. John has lung cancer 3. John has a stomach flu –Likelihood P(d|h) favors 1 and 2 over 3 –Prior probability P(h) favors 1 and 3 over 2 –Posterior probability P(h|d) favors 1 over 2 and 3

You read about a movie that has made $60 million to date. How much money will it make in total? You see that something has been baking in the oven for 34 minutes. How long until it’s ready? You meet someone who is 78 years old. How long will they live? Your friend quotes to you from line 17 of his favorite poem. How long is the poem? You meet a US congressman who has served for 11 years. How long will he serve in total? You encounter a phenomenon or event with an unknown extent or duration, t total, at a random time or value of t <t total. What is the total extent or duration t total ? Everyday prediction problems (Griffiths & Tenenbaum, 2006)

Bayesian analysis p(t total |t)  p(t|t total ) p(t total )  1/t total p(t total ) Assume random sample (for 0 < t < t total else = 0) Form of p(t total )? e.g., uninformative (Jeffreys) prior  1/t total

Priors P(t total ) based on empirically measured durations or magnitudes for many real-world events in each class: Median human judgments of the total duration or magnitude t total of events in each class, given that they are first observed at a duration or magnitude t, versus Bayesian predictions (median of P(t total |t)).

“tufa” Concept learning Bayesian inference over tree- structured hypothesis space: (Xu & Tenenbaum; Schmidt & Tenenbaum)

Some questions How confident are we that a tree-structured model is the best way to characterize this learning task? How do people construct an appropriate tree- structured hypothesis space? What other kinds of structured probabilistic models may be needed to explain other inductive leaps that people make, and how do people acquire these different structured models? Are there general unifying principles that explain our capacity to learn and reason with structured probabilistic models across different domains?

Property induction “Similarity”, “Typicality”, “Diversity” Gorillas have T9 hormones. Seals have T9 hormones. Squirrels have T9 hormones. Horses have T9 hormones. Gorillas have T9 hormones. Chimps have T9 hormones. Monkeys have T9 hormones. Baboons have T9 hormones. Horses have T9 hormones. Gorillas have T9 hormones. Seals have T9 hormones. Squirrels have T9 hormones. Flies have T9 hormones. How can people generalize new concepts from just a few examples?

20 subjects rated the strength of 45 arguments: X 1 have property P. (e.g., Cows have T4 hormones.) X 2 have property P. X 3 have property P. All mammals have property P. [General argument] 20 subjects rated the strength of 36 arguments: X 1 have property P. X 2 have property P. Horses have property P. [Specific argument] Experiments on property induction ( Osherson, Smith, Wilkie, Lopez, Shafir, 1990)

People were given 48 animals, 85 features, and asked to rate whether each animal had each feature. E.g., elephant: 'gray' 'hairless' 'toughskin' 'big' 'bulbous' 'longleg' 'tail' 'chewteeth' 'tusks' 'smelly' 'walks' 'slow' 'strong' 'muscle’ 'quadrapedal' 'inactive' 'vegetation' 'grazer' 'oldworld' 'bush' 'jungle' 'ground' 'timid' 'smart' 'group' Feature rating data (Osherson and Wilkie)

The computational problem (c.f., semi-supervised learning) ???????????????? Features New property ? Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Seal Rhino Elephant 85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…

Model predictions Human judgments of argument strength Similarity-based models Gorillas have property P. Mice have property P. Seals have property P. All mammals have property P. Cows have property P. Elephants have property P. Horses have property P. All mammals have property P.

“has T9 hormones”, “is a tufa” Semi-supervised learning: similarity-based approach

“has T9 hormones”, “is a tufa” Semi-supervised learning: similarity-based approach

F: form S: structure D: data Tree with species at leaf nodes mouse squirrel chimp gorilla mouse squirrel chimp gorilla F1 F2 F3 F4 Has T9 hormones ?????? … P(structure | form) P(data | structure) P(form) Hierarchical Bayesian Framework

Beyond similarity-based induction Reasoning based on dimensional thresholds: (Smith et al., 1993) Reasoning based on causal relations: (Medin et al., 2004; Coley & Shafto, 2003) Poodles can bite through wire. German shepherds can bite through wire. Dobermans can bite through wire. German shepherds can bite through wire. Salmon carry E. Spirus bacteria. Grizzly bears carry E. Spirus bacteria. Salmon carry E. Spirus bacteria.

Different sources for priors Chimps have T9 hormones. Gorillas have T9 hormones. Poodles can bite through wire. Dobermans can bite through wire. Salmon carry E. Spirus bacteria. Grizzly bears carry E. Spirus bacteria. Taxonomic similarity Jaw strength Food web relations

F: form S: structure D: data Tree with species at leaf nodes mouse squirrel chimp gorilla mouse squirrel chimp gorilla F1 F2 F3 F4 Has T9 hormones ?????? … P(structure | form) P(data | structure) P(form) Background knowledge Hierarchical Bayesian Framework

The value of structural form knowledge: inductive bias

F: form S: structure D: data Tree with species at leaf nodes Hierarchical Bayesian Framework mouse squirrel chimp gorilla mouse squirrel chimp gorilla F1 F2 F3 F4 Has T9 hormones ?????? … Property induction

Smooth: P(h) high P(D|S): How the structure constrains the data of experience Define a stochastic process over structure S that generates hypotheses h. –Intuitively, properties should vary smoothly over structure. Not smooth: P(h) low

Smooth: P(h) high P(D|S): How the structure constrains the data of experience Define a stochastic process over structure S that generates hypotheses h. –For generic properties, prior should favor hypotheses that vary smoothly over structure. –Many properties of biological species were actually generated by such a process (i.e., mutation + selection). Not smooth: P(h) low

S y Gaussian Process (~ random walk, diffusion) Threshold P(D|S): How the structure constrains the data of experience [Zhu, Ghahramani & Lafferty 2003] h

S y Gaussian Process (~ random walk, diffusion) Threshold P(D|S): How the structure constrains the data of experience [Zhu, Lafferty & Ghahramani 2003] h

Let d ij be the length of the edge between i and j (= if i and j are not connected) A graph-based prior A Gaussian prior ~ N(0,  ), with (Zhu, Lafferty & Ghahramani, 2003)

Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Structure S Data D Features 85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…

[c.f., Lawrence, 2004; Smola & Kondor 2003]

Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 FeaturesNew property Structure S Data D ???????????????? 85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…

“Given that X 1, …, X n have property P, how likely is it that Y does?” Property induction Tree 2D Tree (Kemp & Tenenbaum) MinneapolisHouston All mammalsHorses

Gorillas have property P. Mice have property P. Seals have property P. All mammals have property P. Cows have property P. Elephants have property P. Horses have property P. Tree 2D

Reasoning about spatially varying properties “Native American artifacts” task

Property type “has T9 hormones” “can bite through wire” “carry E. Spirus bacteria” Theory Structure taxonomic tree directed chain directed network + diffusion process + drift process + noisy transmission Class C Class A Class D Class E Class G Class F Class B Class C Class A Class D Class E Class G Class F Class B Class A Class B Class C Class D Class E Class F Class G... Class C Class G Class F Class E Class D Class B Class A Hypotheses

Reasoning with two property types Biological property Disease property Tree Web Kelp Human Dolphin Sand shark Mako shark TunaHerring Kelp Human Dolphin Sand shark Mako shark Tuna Herring (Shafto, Kemp, Bonawitz, Coley & Tenenbaum) “Given that X has property P, how likely is it that Y does?”

Summary so far A framework for modeling human inductive reasoning as rational statistical inference over structured knowledge representations –Qualitatively different priors are appropriate for different domains of property induction. –In each domain, a prior that matches the world’s structure fits people’s judgments well, and better than alternative priors. –A language for representing different theories: graph structure defined over objects + probabilistic model for the distribution of properties over that graph. Remaining question: How can we learn appropriate theories for different domains?

Hierarchical Bayesian Framework F: form S: structure D: data mouse squirrel chimp gorilla F1 F2 F3 F4 Tree mouse squirrel chimp gorilla mouse squirrel chimp gorilla SpaceChain chimp gorilla squirrel mouse

Discovering structural forms OstrichRobinCrocodileSnakeBatOrangutanTurtle Ostrich Robin Crocodile Snake Bat Orangutan Turtle OstrichRobinCrocodileSnakeBatOrangutanTurtle

OstrichRobinCrocodileSnakeBatOrangutanTurtle Ostrich Robin Crocodile Snake Bat Orangutan Turtle Angel God Rock Plant OstrichRobinCrocodileSnakeBatOrangutanTurtle Discovering structural forms Linnaeus “Great chain of being”

People can discover structural forms Scientists –Tree structure for living kinds (Linnaeus) –Periodic structure for chemical elements (Mendeleev) Children –Hierarchical structure of category labels –Clique structure of social groups –Cyclical structure of seasons or days of the week –Transitive structure for value

Scientific discoveries Children’s cognitive development –Hierarchical structure of category labels –Clique structure of social groups –Cyclical structure of seasons or days of the week –Transitive structure for value People can discover structural forms Tree structure for biological species Periodic structure for chemical elements (1579) (1837) Systema Naturae Kingdom Animalia Phylum Chordata Class Mammalia Order Primates Family Hominidae Genus Homo Species Homo sapiens (1735) “great chain of being”

Typical structure learning algorithms assume a fixed structural form Flat Clusters K-Means Mixture models Competitive learning Line Guttman scaling Ideal point models Tree Hierarchical clustering Bayesian phylogenetics Circle Circumplex models Euclidean Space MDS PCA Factor Analysis Grid Self-Organizing Map Generative topographic mapping

The ultimate goal “Universal Structure Learner” K-Means Hierarchical clustering Factor Analysis Guttman scaling Circumplex models Self-Organizing maps ··· Data Representation

A “universal grammar” for structural forms Form Process

Node-replacement graph grammars Production (Line) Derivation

Production (Line) Derivation Node-replacement graph grammars

Production (Line) Derivation Node-replacement graph grammars

F: form S: structure D: data Hierarchical Bayesian Framework Favors simplicity Favors smoothness [Zhu et al., 2003] mouse squirrel chimp gorilla F1 F2 F3 F4 mouse squirrel chimp gorilla

Model fitting Evaluate each form in parallel For each form, heuristic search over structures based on greedy growth from a one-node seed:

Synthetic 2D data FlatLineRingTreeGrid FlatLineRingTreeGrid log posterior probabilities Model Selection results: Data: Continuous features drawn from a Gaussian field over these points.

FlatLineRingTreeGridScoresTrue

Primate troop Bush administration Prison inmates Kula islands “x beats y” “x told y”“x likes y” “x trades with y” Dominance hierarchy Tree Cliques Ring Structural forms from relational data

Development of structural forms as more data are observed

Beyond “Nativism” versus “Empiricism” “Nativism”: Explicit knowledge of structural forms for core domains is innate. –Atran (1998): The tendency to group living kinds into hierarchies reflects an “innately determined cognitive structure”. –Chomsky (1980): “The belief that various systems of mind are organized along quite different principles leads to the natural conclusion that these systems are intrinsically determined, not simply the result of common mechanisms of learning or growth.” “Empiricism”: General-purpose learning systems without explicit knowledge of structural form. –Connectionist networks (e.g., Rogers and McClelland, 2004). –Traditional structure learning in probabilistic graphical models.

Summary Bayesian inference over hierarchies of structured representations provides a framework to understand core questions of human cognition: –What is the content and form of human knowledge, at multiple levels of abstraction? –How does abstract domain knowledge guide learning of new concepts? –How is abstract domain knowledge learned? What must be built in? F: form S: structure D: data mouse squirrel chimp gorilla mouse squirrel chimp gorilla F1 F2 F3 F4 –How can domain-general learning mechanisms acquire domain- specific representations? How can probabilistic inference work together with symbolic, flexibly structured representations?

Phrase structure Utterance Speech signal Grammar “Universal Grammar” Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG) P(phrase structure | grammar) P(utterance | phrase structure) P(speech | utterance) (c.f. Chater and Manning, 2006) P(grammar | UG)

(Han & Zhu, 2006; c.f., Zhu, Yuanhao & Yuille NIPS 06 ) Vision as probabilistic parsing

Principles Structure Data Whole-object principle Shape bias Taxonomic principle Contrast principle Basic-level bias Learning word meanings

“tufa” Word learning Bayesian inference over tree- structured hypothesis space: (Xu & Tenenbaum; Schmidt & Tenenbaum)

Abstract Principles Structure Data (Griffiths, Tenenbaum, Kemp et al.) Learning causal relations

First-order probabilistic theories for causal inference

True structure of graphical model G: edge (G) class (z) edge (G) # of samples: Data D Graph G Data D Graph G Abstract Theory … … … … … (Mansinghka, Kemp, Tenenbaum, Griffiths UAI 06) c1c1 c2c2 c1c1 c2c2 c1c1 c2c2 Classes Z 

Learning grounded causal models (Goodman, Mansinghka & Tenenbaum) A child learns that petting the cat leads to purring, while pounding leads to growling. But how to learn these symbolic event concepts over which causal links are defined? a b c a b c

Goal-directed action (production and comprehension) (Wolpert et al., 2003)

Goal inference as inverse probabilistic planning (Baker, Tenenbaum & Saxe) ConstraintsGoals Actions Rational planning (PO)MDP model predictions human judgments

The big picture What we need to understand: the mind’s ability to build rich models of the world from sparse data. –Learning about objects, categories, and their properties. –Causal inference –Understanding other people’s actions, plans, thoughts, goals –Language comprehension and production –Scene understanding What do we need to understand these abilities? –Bayesian inference in probabilistic generative models –Hierarchical models, with inference at all levels of abstraction –Structured representations: graphs, grammars, logic –Flexible representations, growing in response to observed data

A raw data matrix: The chicken-and-egg problem of structure learning and feature selection

Conventional clustering (CRP mixture): The chicken-and-egg problem of structure learning and feature selection

Learning multiple structures to explain different feature subsets (Shafto, Kemp, Mansinghka, Gordon & Tenenbaum, 2006) System 1System 2System 3CrossCat:

The “nonparametric safety-net” edge (G) class (z) edge (G) # of samples: Data D Graph G Data D Graph G Abstract theory Z True structure of graphical model G:

Bayesian prediction P(t total |t past ) t total What is the best guess for t total ? Compute t such that P(t total > t|t past ) = 0.5: P(t total |t past )  1/t total P(t past ) posterior probability Random sampling Domain-dependent prior We compared the median of the Bayesian posterior with the median of subjects’ judgments… but what about the distribution of subjects’ judgments?

Individuals’ judgments could by noisy. Individuals’ judgments could be optimal, but with different priors. –e.g., each individual has seen only a sparse sample of the relevant population of events. Individuals’ inferences about the posterior could be optimal, but their judgments could be based on probability (or utility) matching rather than maximizing. Sources of individual differences

Individual differences in prediction P(t total |t past ) t total Quantile of Bayesian posterior distribution Proportion of judgments below predicted value

Individual differences in prediction Average over all prediction tasks: movie run times movie grosses poem lengths life spans terms in congress cake baking times P(t total |t past ) t total

Individual differences in concept learning

Optimal behavior under some (evolutionarily natural) circumstances. –Optimal betting theory, portfolio theory –Optimal foraging theory –Competitive games –Dynamic tasks (changing probabilities or utilities) Side-effect of algorithms for approximating complex Bayesian computations. –Markov chain Monte Carlo (MCMC): instead of integrating over complex hypothesis spaces, construct a sample of high-probability hypotheses. –Judgments from individual (independent) samples can on average be almost as good as using the full posterior distribution. Why probability matching?

Markov chain Monte Carlo (Metropolis-Hastings algorithm)

Bayesian inference in perception and sensorimotor integration (Weiss, Simoncelli & Adelson 2002)(Kording & Wolpert 2004)

You read about a movie that has made $60 million to date. How much money will it make in total? You see that something has been baking in the oven for 34 minutes. How long until it’s ready? You meet someone who is 78 years old. How long will they live? Your friend quotes to you from line 17 of his favorite poem. How long is the poem? You meet a US congressman who has served for 11 years. How long will he serve in total? You encounter a phenomenon or event with an unknown extent or duration, t total, at a random time or value of t <t total. What is the total extent or duration t total ? Everyday prediction problems (Griffiths & Tenenbaum, 2006)

Bayesian analysis p(t total |t)  p(t|t total ) p(t total )  1/t total p(t total ) Assume random sample (for 0 < t < t total else = 0) Form of p(t total )? e.g., uninformative (Jeffreys) prior  1/t total

Bayesian inference posterior probability Random sampling “Uninformative” prior P(t total |t past ) t total t past Best guess for t total : t such that P(t total > t|t past ) = 0.5 P(t total |t past )  1/t total 1/t total Yields Gott’s Rule: Guess t total = 2t past

Evaluating Gott’s Rule You read about a movie that has made $78 million to date. How much money will it make in total? –“$156 million” seems reasonable. You meet someone who is 35 years old. How long will they live? –“70 years” seems reasonable. Not so simple: –You meet someone who is 78 years old. How long will they live? –You meet someone who is 6 years old. How long will they live?

The importance of priors Different kinds of priors P(t total ) are appropriate in different domains. Gott’s rule P(t total )  t total -1

Evaluating human predictions Different domains with different priors: –A movie has made $60 million –Your friend quotes from line 17 of a poem –You meet a 78 year old man –A move has been running for 55 minutes –A U.S. congressman has served for 11 years –A cake has been in the oven for 34 minutes Use 5 values of t past for each. People predict t total.

Priors P(t total ) based on empirically measured durations or magnitudes for many real-world events in each class: Median human judgments of the total duration or magnitude t total of events in each class, given that they are first observed at a duration or magnitude t, versus Bayesian predictions (median of P(t total |t)).

You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?

You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign? How long did the typical pharaoh reign in ancient egypt?

Summary: prediction Predictions about the extent or magnitude of everyday events follow Bayesian principles. Contrast with Bayesian inference in perception, motor control, memory: no “universal priors” here. Predictions depend rationally on priors that are appropriately calibrated for different domains. –Form of the prior (e.g., power-law or exponential) –Specific distribution given that form (parameters) –Non-parametric distribution when necessary. In the absence of concrete experience, priors may be generated by qualitative background knowledge.

Learning concepts from examples Cows have T9 hormones. Sheep have T9 hormones. Goats have T9 hormones. All mammals have T9 hormones. Cows have T9 hormones. Seals have T9 hormones. Squirrels have T9 hormones. All mammals have T9 hormones. Property induction Word learning “tufa”

Clustering models for relational data Social networks: block models Does person x respect person y? Does prisoner x like prisoner y?

concept predicate Learning systems of concepts with infinite relational models (Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06) Biomedical predicate data from UMLS (McCrae et al.): –134 concepts: enzyme, hormone, organ, disease, cell function... –49 predicates: affects(hormone, organ), complicates(enzyme, cell function), treats(drug, disease), diagnoses(procedure, disease) …

Learning a medical ontology e.g., Diseases affect Organisms Chemicals interact with Chemicals Chemicals cause Diseases

Clustering arbitrary relational systems International relations circa 1965 (Rummel) –14 countries: UK, USA, USSR, China, …. –54 binary relations representing interactions between countries: exports to( USA, UK ), protests( USA, USSR ), …. –90 (dynamic) country features: purges, protests, unemployment, communists, # languages, assassinations, ….

Learning a hierarchical ontology

F: form S: structure D: data People cluster into cliques = “x likes y” Relational Data

Model predictions Human judgments of argument strength Similarity-based models Each “ ” represents one argument: X 1 have property P. X 2 have property P. X 3 have property P. All mammals have property P..