Bayesian models as a tool for revealing inductive biases Tom Griffiths University of California, Berkeley.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

The influence of domain priors on intervention strategy Neil Bramley.
Elena Popa.  Children’s causal learning and evidence.  Causation, intervention, and Bayes nets.  The conditional intervention principle and Woodward’s.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
5/17/20151 Probabilistic Reasoning CIS 479/579 Bruce R. Maxim UM-Dearborn.
What is Statistical Modeling
Causes and coincidences Tom Griffiths Cognitive and Linguistic Sciences Brown University.
Philosophy of science: the scientific method
Visual Recognition Tutorial
Part II: Graphical models
The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.
Introduction  Bayesian methods are becoming very important in the cognitive sciences  Bayesian statistics is a framework for doing inference, in a principled.
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.
Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 8 Introduction to Hypothesis Testing.
Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian Christian, and Steve Lewandowsky.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Markov chain Monte Carlo with people Tom Griffiths Department of Psychology Cognitive Science Program UC Berkeley with Mike Kalish, Stephan Lewandowsky,
Exploring cultural transmission by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana With thanks to: Anu Asnaani, Brian.
PSY 307 – Statistics for the Behavioral Sciences
Today Concepts underlying inferential statistics
Analyzing iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Normative models of human inductive inference Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.
EVAL 6970: Experimental and Quasi- Experimental Designs Dr. Chris L. S. Coryn Dr. Anne Cullen Spring 2012.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
The problem of sampling error in psychological research We previously noted that sampling error is problematic in psychological research because differences.
Standard Error of the Mean
Confidence Intervals and Hypothesis Testing - II
Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.
Learning causal theories Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)
Theory-based causal induction Tom Griffiths Brown University Josh Tenenbaum MIT.
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.
Single-Factor Experimental Designs
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Methodological Problems in Cognitive Psychology David Danks Institute for Human & Machine Cognition January 10, 2003.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Psych 156A/ Ling 150: Acquisition of Language II
 Descriptive Methods ◦ Observation ◦ Survey Research  Experimental Methods ◦ Independent Groups Designs ◦ Repeated Measures Designs ◦ Complex Designs.
Reasoning Under Uncertainty. 2 Objectives Learn the meaning of uncertainty and explore some theories designed to deal with it Find out what types of errors.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
A Bayesian approach to word segmentation: Theoretical and experimental results Sharon Goldwater Department of Linguistics Stanford University.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.
Human causal induction Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.
Lecture 1.31 Criteria for optimal reception of radio signals.
Bayesian data analysis
Reasoning Under Uncertainty in Expert System
Statistical Models for Automatic Speech Recognition
“Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's?” Alan Turing, 1950.
Analyzing cultural evolution by iterated learning
Revealing priors on category structures through iterated learning
Scientific Method Integrated Sciences.
The causal matrix: Learning the background knowledge that makes causal learning possible Josh Tenenbaum MIT Department of Brain and Cognitive Sciences.
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Bayesian models as a tool for revealing inductive biases Tom Griffiths University of California, Berkeley

Inductive problems blicket toma dax wug blicket wug S  X Y X  {blicket,dax} Y  {toma, wug} Learning languages from utterances Learning functions from (x,y) pairs Learning categories from instances of their members

Revealing inductive biases Many problems in cognitive science can be formulated as problems of induction –learning languages, concepts, and causal relations Such problems are not solvable without bias (e.g., Goodman, 1955; Kearns & Vazirani, 1994; Vapnik, 1995) What biases guide human inductive inferences? How can computational models be used to investigate human inductive biases?

Models and inductive biases Transparent

Reverend Thomas Bayes Bayesian models

Bayes’ theorem Posterior probability LikelihoodPrior probability Sum over space of hypotheses h: hypothesis d: data

Three advantages of Bayesian models Transparent identification of inductive biases through hypothesis space, prior, and likelihood Opportunity to explore a range of biases expressed in terms that are natural to the problem at hand Rational statistical inference provides an upper bound on human inferences from data

Two examples Causal induction from small samples (Josh Tenenbaum, David Sobel, Alison Gopnik) Statistical learning and word segmentation (Sharon Goldwater, Mark Johnson)

Two examples Causal induction from small samples (Josh Tenenbaum, David Sobel, Alison Gopnik) Statistical learning and word segmentation (Sharon Goldwater, Mark Johnson)

Blicket detector (Dave Sobel, Alison Gopnik, and colleagues) See this? It’s a blicket machine. Blickets make it go. Let’s put this one on the machine. Oooh, it’s a blicket!

–Two objects: A and B –Trial 1: A B on detector – detector active –Trial 2: B on detector – detector inactive –4-year-olds judge whether each object is a blicket A: a blicket (100% say yes) B: almost certainly not a blicket (16% say yes) “One cause” (Gopnik, Sobel, Schulz, & Glymour, 2001) AB Trial B Trial AB A Trial

Hypotheses: causal models Defines probability distribution over variables (for both observation, and intervention) E BA E BA E BA E BA (Pearl, 2000; Spirtes, Glymour, & Scheines, 1993)

Prior and likelihood: causal theory Prior probability an object is a blicket is q –defines a distribution over causal models Detectors have a deterministic “activation law” –always activate if a blicket is on the detector –never activate otherwise (Tenenbaum & Griffiths, 2003; Griffiths, 2005)

Prior and likelihood: causal theory P(E=1 | A=0, B=0): P(E=0 | A=0, B=0): P(E=1 | A=1, B=0): P(E=0 | A=1, B=0): P(E=1 | A=0, B=1): P(E=0 | A=0, B=1): P(E=1 | A=1, B=1): P(E=0 | A=1, B=1): E BA E BA E BA E BA P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2

Modeling “one cause” P(E=1 | A=0, B=0): P(E=0 | A=0, B=0): P(E=1 | A=1, B=0): P(E=0 | A=1, B=0): P(E=1 | A=0, B=1): P(E=0 | A=0, B=1): P(E=1 | A=1, B=1): P(E=0 | A=1, B=1): E BA E BA E BA E BA P(h 00 ) = (1 – q) 2 P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2

Modeling “one cause” P(E=1 | A=0, B=0): P(E=0 | A=0, B=0): P(E=1 | A=1, B=0): P(E=0 | A=1, B=0): P(E=1 | A=0, B=1): P(E=0 | A=0, B=1): P(E=1 | A=1, B=1): P(E=0 | A=1, B=1): E BA E BA E BA P(h 10 ) = q(1 – q)P(h 01 ) = (1 – q) qP(h 11 ) = q 2

Modeling “one cause” P(E=1 | A=0, B=0): P(E=0 | A=0, B=0): P(E=1 | A=1, B=0): P(E=0 | A=1, B=0): P(E=1 | A=0, B=1): P(E=0 | A=0, B=1): P(E=1 | A=1, B=1): P(E=0 | A=1, B=1): E BA P(h 10 ) = q(1 – q) A is definitely a blicket B is definitely not a blicket

–Two objects: A and B –Trial 1: A B on detector – detector active –Trial 2: B on detector – detector inactive –4-year-olds judge whether each object is a blicket A: a blicket (100% say yes) B: almost certainly not a blicket (16% say yes) “One cause” (Gopnik, Sobel, Schulz, & Glymour, 2001) AB Trial B Trial AB A Trial

Building on this analysis Transparent

Other physical systems From stick-ball machines… …to lemur colonies (Kushnir, Schulz, Gopnik, & Danks, 2003) (Griffiths, Baraff, & Tenenbaum, 2004) (Griffiths & Tenenbaum, 2007)

Two examples Causal induction from small samples (Josh Tenenbaum, David Sobel, Alison Gopnik) Statistical learning and word segmentation (Sharon Goldwater, Mark Johnson)

Bayesian segmentation In the domain of segmentation, we have: –Data: unsegmented corpus (transcriptions). –Hypotheses: sequences of word tokens. Optimal solution is the segmentation with highest prior probability = 1 if concatenating words forms corpus, = 0 otherwise. Encodes assumptions about the structure of language

Brent (1999) Describes a Bayesian unigram model for segmentation. –Prior favors solutions with fewer words, shorter words. Problems with Brent’s system: –Learning algorithm is approximate (non-optimal). –Difficult to extend to incorporate bigram info.

A new unigram model (Dirichlet process) Assume word w i is generated as follows: 1. Is w i a novel lexical item? Fewer word types = Higher probability

A new unigram model (Dirichlet process) Assume word w i is generated as follows: 2. If novel, generate phonemic form x 1 …x m : If not, choose lexical identity of w i from previously occurring words: Shorter words = Higher probability Power law = Higher probability

Unigram model: simulations Same corpus as Brent (Bernstein-Ratner, 1987) : –9790 utterances of phonemically transcribed child-directed speech (19-23 months). –Average utterance length: 3.4 words. –Average word length: 2.9 phonemes. Example input: yuwanttusiD6bUk lUkD*z6b7wIThIzh&t &nd6dOgi yuwanttulUk&tDIs...

Example results

What happened? Model assumes (falsely) that words have the same probability regardless of context. Positing amalgams allows the model to capture word-to-word dependencies. P( D&t ) =.024 P( D&t | WAts ) =.46 P( D&t | tu ) =.0019

What about other unigram models? Brent’s learning algorithm is insufficient to identify the optimal segmentation. –Our solution has higher probability under his model than his own solution does. –On randomly permuted corpus, our system achieves 96% accuracy; Brent gets 81%. Formal analysis shows undersegmentation is the optimal solution for any (reasonable) unigram model.

Bigram model (hierachical Dirichlet process) Assume word w i is generated as follows: 1.Is (w i-1,w i ) a novel bigram? 2.If novel, generate w i using unigram model (almost). If not, choose lexical identity of w i from words previously occurring after w i-1.

Example results

Conclusions Both adults and children are sensitive to the nature of mechanisms in using covariation Both adults and children can use covariation to make inferences about the nature of mechanisms Bayesian inference provides a formal framework for understanding how statistics and knowledge interact in making these inferences –how theories constrain hypotheses, and are learned

A probabilistic mechanism? Children in Gopnik et al. (2001) who said that B was a blicket had seen evidence that the detector was probabilistic –one block activated detector 5/6 times Replace the deterministic “activation law”… –activate with p = 1-  if a blicket is on the detector –never activate otherwise

Deterministic vs. probabilistic Probability of being a blicket One cause Deterministic Probabilistic mechanism knowledge affects intepretation of contingency data

At end of the test phase, adults judge the probability that each object is a blicket AB Trial B Trial BA I. Familiarization phase: Establish nature of mechanism II. Test phase: one cause Manipulating mechanisms same block

Manipulating mechanisms Expose to different kinds of mechanism –deterministic: detector always activates –probabilistic: detector activates with p = 1-  Test with “one cause” trials Model makes two qualitative predictions: –people will infer nature of mechanism –evaluation of B as a blicket will increase with the probabilistic mechanism (Griffiths & Sobel, submitted)

Probability of being a blicket One cause Bayes People Deterministic Probabilistic Manipulating mechanisms (n = 12 undergraduates per condition)

Probability of being a blicket One cause One control Three control Deterministic Probabilistic Bayes People Manipulating mechanisms (n = 12 undergraduates per condition)

At end of the test phase, adults judge the probability that each object is a blicket AB Trial B Trial BA I. Familiarization phase: Establish nature of mechanism II. Test phase: one cause Acquiring mechanism knowledge same block

Learning causal theories Apply Bayes’ rule as before: Sum over causal structures (h j ) to get P(d|T)

Results with children Tested 24 four-year-olds (mean age 54 months) Instead of rating, yes or no response Significant difference in one cause B responses –deterministic: 8% say yes –probabilistic: 79% say yes No significant difference in one control trials –deterministic: 4% say yes –probabilistic: 21% say yes (Griffiths & Sobel, submitted)

–Two objects: A and B –Trial 1: A B on detector – detector active –Trial 2: A on detector – detector active –4-year-olds judge whether each object is a blicket A: a blicket (100% of judgments) B: probably not a blicket (66% of judgments) “Backwards blocking” (Sobel, Tenenbaum & Gopnik, 2004) AB Trial A Trial AB

After each trial, adults judge the probability that each object is a blicket. AB Trial A Trial BA I. Pre-training phase: Establish baserate for blickets (q) II. Backwards blocking phase: Manipulating plausibility

AB Trial A Trial Initial

Comparison to previous results Proposed boundaries are more accurate than Brent’s, but fewer proposals are made. Result: word tokens are less accurate. Boundary Precision Boundary Recall Brent GGJ Token F-score Brent.68 GGJ.54 Precision: #correct / #found = [= hits / (hits + false alarms)] Recall: #found / #true = [= hits / (hits + misses)] F-score: an average of precision and recall.

Quantitative evaluation Compared to unigram model, more boundaries are proposed, with no loss in accuracy: Accuracy is higher than previous models: Boundary Precision Boundary Recall GGJ (unigram) GGJ (bigram) Token F-scoreType F-score Brent (unigram) GGJ (bigram).77.63

Two examples Causal induction from small samples (Josh Tenenbaum, David Sobel, Alison Gopnik) Statistical learning and word segmentation (Sharon Goldwater, Mark Johnson)