Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.

Slides:

Advertisements

Similar presentations

Introduction to Monte Carlo Markov chain (MCMC) methods

Advertisements

Probabilistic Models in Human and Machine Intelligence.

Exact Inference in Bayes Nets

Bayesian Estimation in MARK

Gibbs Sampling Qianji Zheng Oct. 5th, 2010.

Causes and coincidences Tom Griffiths Cognitive and Linguistic Sciences Brown University.

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.

1 Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)

CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov

Part IV: Monte Carlo and nonparametric Bayes. Outline Monte Carlo methods Nonparametric Bayesian models.

The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.

Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.

Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.

Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.

A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.

Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian Christian, and Steve Lewandowsky.

Bayesian Learning Rong Jin.

Markov chain Monte Carlo with people Tom Griffiths Department of Psychology Cognitive Science Program UC Berkeley with Mike Kalish, Stephan Lewandowsky,

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

Exploring cultural transmission by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana With thanks to: Anu Asnaani, Brian.

Bayesian Analysis for Extreme Events Pao-Shin Chu and Xin Zhao Department of Meteorology School of Ocean & Earth Science & Technology University of Hawaii-

Analyzing iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Normative models of human inductive inference Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.

Modeling Menstrual Cycle Length in Pre- and Peri-Menopausal Women Michael Elliott Xiaobi Huang Sioban Harlow University of Michigan School of Public Health.

Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,

Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?

Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.

Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.

Overview Particle filtering is a sequential Monte Carlo methodology in which the relevant probability distributions are iteratively estimated using the.

Statistical Learning (From data to distributions).

Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.

Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.

Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

Probabilistic Models in Human and Machine Intelligence.

Lecture 2: Statistical learning primer for biologists

An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.

Bayesian networks and their application in circuit reliability estimation Erin Taylor.

Lecture #9: Introduction to Markov Chain Monte Carlo, part 3

Ensemble Methods in Machine Learning

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Everyday inductive leaps Making predictions and detecting coincidences Tom Griffiths Department of Psychology Program in Cognitive Science University of.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.

Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Introduction to Sampling based inference and MCMC

Bayesian data analysis

Markov chain Monte Carlo with people

Analyzing cultural evolution by iterated learning

Revealing priors on category structures through iterated learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Ch13 Empirical Methods.

Outline Texture modeling - continued Markov Random Field models

Presentation transcript:

Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences

data behavior What computational problem is the brain solving? Do optimal solutions to that problem help to explain human behavior?

Inductive problems Inferring structure from data Perception –e.g. structure of 3D world from 2D visual data data hypotheses cube shaded hexagon

Inductive problems Inferring structure from data Perception –e.g. structure of 3D world from 2D data Cognition –e.g. form of causal relationship from samples datahypotheses

Reverend Thomas Bayes

Bayes’ theorem Posterior probability LikelihoodPrior probability Sum over space of hypotheses h: hypothesis d: data

Bayes’ theorem h: hypothesis d: data

Perception is optimal Körding & Wolpert (2004)

Cognition is not

Do people use priors? Standard answer: no (Tversky & Kahneman, 1974)

Explaining inductive leaps How do people –infer causal relationships –identify the work of chance –predict the future –assess similarity and make generalizations –learn functions, languages, and concepts... from such limited data?

Explaining inductive leaps How do people –infer causal relationships –identify the work of chance –predict the future –assess similarity and make generalizations –learn functions, languages, and concepts... from such limited data? What knowledge guides human inferences?

Prior knowledge matters when… …using a single datapoint –predicting the future …using secondhand data –effects of priors on cultural transmission

Outline …using a single datapoint –predicting the future –joint work with Josh Tenenbaum (MIT) …using secondhand data –effects of priors on cultural transmission –joint work with Mike Kalish (Louisiana) Conclusions

Outline …using a single datapoint –predicting the future –joint work with Josh Tenenbaum (MIT) …using secondhand data –effects of priors on cultural transmission –joint work with Mike Kalish (Louisiana) Conclusions

Predicting the future How often is Google News updated? t = time since last update t total = time between updates What should we guess for t total given t?

Everyday prediction problems You read about a movie that has made $60 million to date. How much money will it make in total? You see that something has been baking in the oven for 34 minutes. How long until it’s ready? You meet someone who is 78 years old. How long will they live? Your friend quotes to you from line 17 of his favorite poem. How long is the poem? You see taxicab #107 pull up to the curb in front of the train station. How many cabs in this city?

Making predictions You encounter a phenomenon that has existed for t units of time. How long will it continue into the future? (i.e. what’s t total ?) We could replace “time” with any other variable that ranges from 0 to some unknown upper limit.

Bayesian inference p(t total |t)  p(t|t total ) p(t total ) posterior probability likelihoodprior

Bayesian inference p(t total |t)  p(t|t total ) p(t total ) p(t total |t)  1/t total p(t total ) assume random sample (0 < t < t total ) posterior probability likelihoodprior

Bayesian inference p(t total |t)  p(t|t total ) p(t total ) p(t total |t)  1/t total 1/t total assume random sample (0 < t < t total ) posterior probability likelihoodprior “uninformative” prior (Gott, 1993)

How about maximal value of p(t total |t)? Bayesian inference p(t total |t)  1/t total 1/t total posterior probability What is the best guess for t total ? p(t total |t) t total t total = t random sampling “uninformative” prior

Bayesian inference p(t total |t) t total What is the best guess for t total ? Instead, compute t* such that p(t total > t*|t) = 0.5: p(t total |t)  1/t total 1/t total posterior probability random sampling “uninformative” prior

Bayesian inference Yields Gott’s Rule: P(t total > t*|t) = 0.5 when t* = 2t i.e., best guess for t total = 2t What is the best guess for t total ? Instead, compute t* such that p(t total > t*|t) = 0.5. p(t total |t)  1/t total 1/t total posterior probability random sampling “uninformative” prior

Applying Gott’s rule t  4000 years, t*  8000 years

Applying Gott’s rule t  130,000 years, t*  260,000 years

Predicting everyday events You read about a movie that has made $78 million to date. How much money will it make in total? –“$156 million” seems reasonable You meet someone who is 35 years old. How long will they live? –“70 years” seems reasonable Not so simple: –You meet someone who is 78 years old. How long will they live? –You meet someone who is 6 years old. How long will they live?

The effects of priors Different kinds of priors p(t total ) are appropriate in different domains. Gott: p(t total )  t total -1

The effects of priors Different kinds of priors p(t total ) are appropriate in different domains. e.g., wealth, contacts e.g., height, lifespan

The effects of priors

Evaluating human predictions Different domains with different priors: –A movie has made $60 million –Your friend quotes from line 17 of a poem –You meet a 78 year old man –A move has been running for 55 minutes –A U.S. congressman has served for 11 years –A cake has been in the oven for 34 minutes Use 5 values of t for each People predict t total

people parametric prior empirical prior Gott’s rule

You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?

How long did the typical pharaoh reign in ancient Egypt?

…using a single datapoint People produce accurate predictions for the duration and extent of everyday events Strong prior knowledge –form of the prior (power-law or exponential) –distribution given that form (parameters) –non-parametric distribution when necessary Reveals a surprising correspondence between probabilities in the mind and in the world

Outline …using a single datapoint –predicting the future –joint work with Josh Tenenbaum (MIT) …using secondhand data –effects of priors on cultural transmission –joint work with Mike Kalish (Louisiana) Conclusions

Cultural transmission Most knowledge is based on secondhand data Some things can only be learned from others –language –religious concepts How do priors affect cultural transmission?

Iterated learning (Briscoe, 1998; Kirby, 2001) data hypothesis learning production datahypothesis learning production Each learner sees data, forms a hypothesis, produces the data given to the next learner c.f. the playground game “telephone”

Explaining linguistic universals Human languages are a subset of all logically possible communication schemes –universal properties common to all languages (Comrie, 1981; Greenberg, 1963; Hawkins, 1988) Two questions: –why do linguistic universals exist? –why are particular properties universal?

Explaining linguistic universals Traditional answer: –linguistic universals reflect innate constraints specific to a system for acquiring language Alternative answer: –iterated learning imposes “information bottleneck” –universal properties survive this bottleneck (Briscoe, 1998; Kirby, 2001)

Analyzing iterated learning What are the consequences of iterated learning? Simulations Analytic results Complex algorithms Simple algorithms Komarova, Niyogi, & Nowak (2002) Brighton (2002) Kirby (2001) Smith, Kirby, & Brighton (2003) ?

Iterated Bayesian learning Learners are rational Bayesian agents –covers a wide range of learning algorithms Defines a Markov chain on (h, d) pairs d0d0 h1h1 d1d1 h2h2 inference sampling inference sampling p(h|d)p(h|d) p(d|h)p(d|h) p(d|h)p(d|h) p(h|d)p(h|d)

Analytic results Stationary distribution of Markov chain is Convergence under easily checked conditions Rate of convergence is geometric –iterated learning is a Gibbs sampler on p(d,h) –Gibbs sampler converges geometrically (Liu, Wong, & Kong, 1995)

Analytic results Stationary distribution of Markov chain is Corollaries: –distribution over hypotheses converges to p(h) –distribution over data converges to p(d) –the proportion of a population of iterated learners with hypothesis h converges to p(h)

An example: Gaussians If we assume… –data, d, is a single real number, x –hypotheses, h, are means of a Gaussian,  –prior, p(  ), is Gaussian(  0,  0 2 ) …then p(x n+1 |x n ) is Gaussian(  n,  x 2 +  n 2 )

 0 = 0,  0 2 = 1, x 0 = 20 Iterated learning results in rapid convergence to prior

Implications for linguistic universals Two questions: –why do linguistic universals exist? –why are particular properties universal? Different answers: –existence explained through iterated learning –universal properties depend on the prior Focuses inquiry on the priors of the learners

A method for discovering priors Iterated learning converges to the prior… …evaluate prior by producing iterated learning

Iterated function learning Assume –data, d, are pairs of real numbers (x, y) –hypotheses, h, are functions An example: linear regression –hypotheses have slope  and pass through origin –p(  ) is Gaussian(  0,  0 2 ) } x = 1  y

}  y  0 = 1,  0 2 = 0.1, y 0 = -1

Function learning in the lab Stimulus Response Slider Feedback Examine iterated learning with different initial data

Iteration Initial data

…using secondhand data Iterated Bayesian learning converges to the prior Constraints explanations of linguistic universals Provides a method for evaluating priors –concepts, causal relationships, languages, … Open questions in Bayesian language evolution –variation in priors –other selective pressures

Outline …using a single datapoint –predicting the future …using secondhand data –effects of priors on cultural transmission Conclusions

Bayes’ theorem A unifying principle for explaining inductive inferences

Bayes’ theorem inference = f(data,knowledge)

Bayes’ theorem inference = f(data,knowledge) A means of evaluating the priors that inform those inferences

Explaining inductive leaps How do people –infer causal relationships –identify the work of chance –predict the future –assess similarity and make generalizations –learn functions, languages, and concepts... from such limited data? What knowledge guides human inferences?

Markov chain Monte Carlo Sample from a Markov chain which converges to target distribution Allows sampling from an unnormalized posterior distribution Can compute approximate statistics from intractable distributions (MacKay, 2002)

Markov chain Monte Carlo States of chain are variables of interest Transition matrix chosen to give target distribution as stationary distribution xx x xx x x x Transition matrix P(x (t+1) |x (t) ) = T(x (t),x (t+1) )

Gibbs sampling Particular choice of proposal distribution (for single component Metropolis-Hastings) For variables x = x 1, x 2, …, x n Draw x i (t+1) from P(x i |x -i ) x -i = x 1 (t+1), x 2 (t+1),…, x i-1 (t+1), x i+1 (t), …, x n (t) (a.k.a. the heat bath algorithm in statistical physics)

Gibbs sampling (MacKay, 2002)