Markov chain Monte Carlo with people Tom Griffiths Department of Psychology Cognitive Science Program UC Berkeley with Mike Kalish, Stephan Lewandowsky,

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Bayesian Estimation in MARK

Gibbs Sampling Qianji Zheng Oct. 5th, 2010.

Causes and coincidences Tom Griffiths Cognitive and Linguistic Sciences Brown University.

Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.

CHAPTER 16 MARKOV CHAIN MONTE CARLO

BAYESIAN INFERENCE Sampling techniques

CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov

Part IV: Monte Carlo and nonparametric Bayes. Outline Monte Carlo methods Nonparametric Bayesian models.

The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.

. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:

Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.

Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.

A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.

Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.

Today Introduction to MCMC Particle filters and MCMC

Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.

Visual Recognition Tutorial

Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian Christian, and Steve Lewandowsky.

Bayesian Learning Rong Jin.

Exploring cultural transmission by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana With thanks to: Anu Asnaani, Brian.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Experimental Evaluation

Part II: How to make a Bayesian model. Questions you can answer… What would an ideal learner or observer infer from these data? What are the effects of.

Bayesian Analysis for Extreme Events Pao-Shin Chu and Xin Zhao Department of Meteorology School of Ocean & Earth Science & Technology University of Hawaii-

Analyzing iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.

Normative models of human inductive inference Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.

Bayesian models as a tool for revealing inductive biases Tom Griffiths University of California, Berkeley.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Bayes Factor Based on Han and Carlin (2001, JASA).

Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.

Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

Integrating Topics and Syntax -Thomas L

Randomized Algorithms for Bayesian Hierarchical Clustering

Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:

Academic Research Academic Research Dr Kishor Bhanushali M

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Tracking Multiple Cells By Correspondence Resolution In A Sequential Bayesian Framework Nilanjan Ray Gang Dong Scott T. Acton C.L. Brown Department of.

Lecture 2: Statistical learning primer for biologists

An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.

Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.

Lecture #9: Introduction to Markov Chain Monte Carlo, part 3

Categorization and density estimation Tom Griffiths UC Berkeley.

A shared random effects transition model for longitudinal count data with informative missingness Jinhui Li Joint work with Yingnian Wu, Xiaowei Yang.

Everyday inductive leaps Making predictions and detecting coincidences Tom Griffiths Department of Psychology Program in Cognitive Science University of.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Introduction to Sampling based inference and MCMC

Biointelligence Laboratory, Seoul National University

MCMC Output & Metropolis-Hastings Algorithm Part I

Advanced Statistical Computing Fall 2016

Bayesian data analysis

Markov chain Monte Carlo with people

Statistical Models for Automatic Speech Recognition

Analyzing cultural evolution by iterated learning

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

Revealing priors on category structures through iterated learning

Ch13 Empirical Methods.

Mathematical Foundations of BME Reza Shadmehr

Presentation transcript:

Markov chain Monte Carlo with people Tom Griffiths Department of Psychology Cognitive Science Program UC Berkeley with Mike Kalish, Stephan Lewandowsky, and Adam Sanborn

Inductive problems blicket toma dax wug blicket wug S  X Y X  {blicket,dax} Y  {toma, wug} Learning languages from utterances Learning functions from (x,y) pairs Learning categories from instances of their members

Computational cognitive science Identify the underlying computational problem Find the optimal solution to that problem Compare human cognition to that solution For inductive problems, solutions come from statistics

Statistics and inductive problems Cognitive science Categorization Causal learning Function learning Language … Statistics Density estimation Graphical models Regression Probabilistic grammars …

Statistics and human cognition How can we use statistics to understand cognition? How can cognition inspire new statistical models? –applications of Dirichlet process and Pitman-Yor process models to natural language –exchangeable distributions on infinite binary matrices via the Indian buffet process (priors on causal structure) –nonparametric Bayesian models for relational data

Statistics and human cognition How can we use statistics to understand cognition? How can cognition inspire new statistical models? –applications of Dirichlet process and Pitman-Yor process models to natural language –exchangeable distributions on infinite binary matrices via the Indian buffet process (priors on causal structure) –nonparametric Bayesian models for relational data

Statistics and human cognition How can we use statistics to understand cognition? How can cognition inspire new statistical models? –applications of Dirichlet process and Pitman-Yor process models to natural language –exchangeable distributions on infinite binary matrices via the Indian buffet process –nonparametric Bayesian models for relational data

Are people Bayesian? Reverend Thomas Bayes

Bayes’ theorem Posterior probability LikelihoodPrior probability Sum over space of hypotheses h: hypothesis d: data

People are stupid

Predicting the future How often is Google News updated? t = time since last update t total = time between updates What should we guess for t total given t?

The effects of priors

Evaluating human predictions Different domains with different priors: –a movie has made $60 million [power-law] –your friend quotes from line 17 of a poem [power-law] –you meet a 78 year old man [Gaussian] –a movie has been running for 55 minutes [Gaussian] –a U.S. congressman has served for 11 years [Erlang] Prior distributions derived from actual data Use 5 values of t for each People predict t total

people parametric prior empirical prior Gott’s rule

A different approach… Instead of asking whether people are rational, use assumption of rationality to investigate cognition If we can predict people’s responses, we can design experiments that measure psychological variables

Two deep questions What are the biases that guide human learning? –prior probability distribution P(h) What do mental representations look like? –category distribution P(x|c)

Two deep questions What are the biases that guide human learning? –prior probability distribution on hypotheses, P(h) What do mental representations look like? –distribution over objects x in category c, P(x|c) Develop ways to sample from these distributions

Outline Markov chain Monte Carlo Sampling from the prior Sampling from category distributions

Outline Markov chain Monte Carlo Sampling from the prior Sampling from category distributions

Variables x (t+1) independent of history given x (t) Converges to a stationary distribution under easily checked conditions (i.e., if it is ergodic) xx x xx x x x Transition matrix T = P(x (t+1) |x (t) ) Markov chains

Markov chain Monte Carlo Sample from a target distribution P(x) by constructing Markov chain for which P(x) is the stationary distribution Two main schemes: –Gibbs sampling –Metropolis-Hastings algorithm

Gibbs sampling For variables x = x 1, x 2, …, x n and target P(x) Draw x i (t+1) from P(x i |x -i ) x -i = x 1 (t+1), x 2 (t+1),…, x i-1 (t+1), x i+1 (t), …, x n (t)

Gibbs sampling (MacKay, 2002)

Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) Step 1: propose a state (we assume symmetrically) Q(x (t+1) |x (t) ) = Q(x (t) )|x (t+1) ) Step 2: decide whether to accept, with probability Metropolis acceptance function Barker acceptance function

Metropolis-Hastings algorithm p(x)p(x)

p(x)p(x)

p(x)p(x)

A(x (t), x (t+1) ) = 0.5 p(x)p(x)

Metropolis-Hastings algorithm p(x)p(x)

A(x (t), x (t+1) ) = 1 p(x)p(x)

Outline Markov chain Monte Carlo Sampling from the prior Sampling from category distributions

Iterated learning (Kirby, 2001) What are the consequences of learners learning from other learners?

Analyzing iterated learning P L (h|d): probability of inferring hypothesis h from data d P P (d|h): probability of generating data d from hypothesis h PL(h|d)PL(h|d) P P (d|h) PL(h|d)PL(h|d)

Iterated Bayesian learning PL(h|d)PL(h|d) P P (d|h) PL(h|d)PL(h|d) Assume learners sample from their posterior distribution:

Analyzing iterated learning d0d0 h1h1 d1d1 h2h2 PL(h|d)PL(h|d) PP(d|h)PP(d|h) PL(h|d)PL(h|d) d2d2 h3h3 PP(d|h)PP(d|h) PL(h|d)PL(h|d)  d P P (d|h)P L (h|d) h1h1 h2h2 h3h3 A Markov chain on hypotheses d0d0 d1d1  h P L (h|d) P P (d|h) d2d2 A Markov chain on data

Stationary distributions Markov chain on h converges to the prior, P(h) Markov chain on d converges to the “prior predictive distribution” (Griffiths & Kalish, 2005)

Explaining convergence to the prior PL(h|d)PL(h|d) P P (d|h) PL(h|d)PL(h|d) Intuitively: data acts once, prior many times Formally: iterated learning with Bayesian agents is a Gibbs sampler on P(d,h) (Griffiths & Kalish, in press)

Revealing inductive biases Many problems in cognitive science can be formulated as problems of induction –learning languages, concepts, and causal relations Such problems are not solvable without bias (e.g., Goodman, 1955; Kearns & Vazirani, 1994; Vapnik, 1995) What biases guide human inductive inferences? If iterated learning converges to the prior, then it may provide a method for investigating biases

Serial reproduction (Bartlett, 1932) Participants see stimuli, then reproduce them from memory Reproductions of one participant are stimuli for the next Stimuli were interesting, rather than controlled –e.g., “War of the Ghosts”

General strategy Use well-studied and simple stimuli for which people’s inductive biases are known –function learning –concept learning –color words Examine dynamics of iterated learning –convergence to state reflecting biases –predictable path to convergence

Iterated function learning Each learner sees a set of (x,y) pairs Makes predictions of y for new x values Predictions are data for the next learner datahypotheses (Kalish, Griffiths, & Lewandowsky, in press)

Function learning experiments Stimulus Response Slider Feedback Examine iterated learning with different initial data

Iteration Initial data

Identifying inductive biases Formal analysis suggests that iterated learning provides a way to determine inductive biases Experiments with human learners support this idea –when stimuli for which biases are well understood are used, those biases are revealed by iterated learning What do inductive biases look like in other cases? –continuous categories –causal structure –word learning –language learning

Iterated learning for MAP learners reduces to a form of the stochastic EM algorithm –Monte Carlo EM with a single sample Provides connections between cultural evolution and classic models used in population genetics –MAP learning of multinomials = Wright-Fisher More generally, an account of how products of cultural evolution relate to the biases of learners Statistics and cultural evolution

Outline Markov chain Monte Carlo Sampling from the prior Sampling from category distributions

Categories are central to cognition

Sampling from categories Frog distribution P(x|c)

A task Ask subjects which of two alternatives comes from a target category Which animal is a frog?

A Bayesian analysis of the task Assume:

Response probabilities If people probability match to the posterior, response probability is equivalent to the Barker acceptance function for target distribution p(x|c)

Collecting the samples Which is the frog? Trial 1Trial 2Trial 3

Verifying the method

Training Subjects were shown schematic fish of different sizes and trained on whether they came from the ocean (uniform) or a fish farm (Gaussian)

Between-subject conditions

Choice task Subjects judged which of the two fish came from the fish farm (Gaussian) distribution

Examples of subject MCMC chains

Estimates from all subjects Estimated means and standard deviations are significantly different across groups Estimated means are accurate, but standard deviation estimates are high –result could be due to perceptual noise or response gain

Sampling from natural categories Examined distributions for four natural categories: giraffes, horses, cats, and dogs Presented stimuli with nine-parameter stick figures (Olman & Kersten, 2004)

Choice task

Samples from Subject 3 (projected onto plane from LDA)

Mean animals by subject giraffe horse cat dog S1S2S3S4S5S6S7S8

Marginal densities (aggregated across subjects) Giraffes are distinguished by neck length, body height and body tilt Horses are like giraffes, but with shorter bodies and nearly uniform necks Cats have longer tails than dogs

Relative volume of categories Minimum Enclosing Hypercube GiraffeHorseCatDog Convex hull content divided by enclosing hypercube content Convex Hull

Discrimination method (Olman & Kersten, 2004)

Parameter space for discrimination Restricted so that most random draws were animal-like

MCMC and discrimination means

Conclusion Markov chain Monte Carlo provides a way to sample from subjective probability distributions Many interesting questions can be framed in terms of subjective probability distributions –inductive biases (priors) –mental representations (category distributions) Other MCMC methods may provide further empirical methods… –Gibbs for categories, adaptive MCMC, …

A different approach… Instead of asking whether people are rational, use assumption of rationality to investigate cognition If we can predict people’s responses, we can design experiments that measure psychological variables Randomized algorithms  Psychological experiments

r = 1 r = 2 r =  From sampling to maximizing

General analytic results are hard to obtain –(r =  is Monte Carlo EM with a single sample) For certain classes of languages, it is possible to show that the stationary distribution gives each hypothesis h probability proportional to P(h) r –the ordering identified by the prior is preserved, but not the corresponding probabilities (Kirby, Dowman, & Griffiths, in press) From sampling to maximizing

Implications for linguistic universals When learners sample from P(h|d), the distribution over languages converges to the prior –identifies a one-to-one correspondence between inductive biases and linguistic universals As learners move towards maximizing, the influence of the prior is exaggerated –weak biases can produce strong universals –cultural evolution is a viable alternative to traditional explanations for linguistic universals

Iterated concept learning Each learner sees examples from a species Identifies species of four amoebae Iterated learning is run within-subjects data hypotheses (Griffiths, Christian, & Kalish, in press)

Two positive examples data (d) hypotheses (h)

Bayesian model (Tenenbaum, 1999; Tenenbaum & Griffiths, 2001) d: 2 amoebae h: set of 4 amoebae m: # of amoebae in the set d (= 2) |h|: # of amoebae in the set h (= 4) Posterior is renormalized prior What is the prior?

Classes of concepts (Shepard, Hovland, & Jenkins, 1958) Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 shape size color

Experiment design (for each subject) Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 6 iterated learning chains 6 independent learning “chains”

Estimating the prior data (d) hypotheses (h)

Estimating the prior Class 1 Class 2 Class 3 Class 4 Class 5 Class Prior r = Bayesian model Human subjects

Two positive examples (n = 20) Probability Iteration Probability Iteration Human learners Bayesian model

Two positive examples (n = 20) Probability Bayesian model Human learners

Three positive examples data (d) hypotheses (h)

Three positive examples (n = 20) Probability Iteration Probability Iteration Human learners Bayesian model

Three positive examples (n = 20) Bayesian model Human learners

Classification objects

Parameter space for discrimination Restricted so that most random draws were animal-like

MCMC and discrimination means

Problems with classification objects Category 1 Category 2 Category 1 Category 2

Problems with classification objects Minimum Enclosing Hypercube GiraffeHorseCatDog Convex hull content divided by enclosing hypercube content Convex Hull

Allowing a Wider Range of Behavior An exponentiated choice rule results in a Markov chain with stationary distribution corresponding to an exponentiated version of the category distribution, proportional to p(x|c) 

Category drift For fragile categories, the MCMC procedure could influence the category representation Interleaved training and test blocks in the training experiments