Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian Christian, and Steve Lewandowsky.

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

Probabilistic Models of Cognition Conceptual Foundations Chater, Tenenbaum, & Yuille TICS, 10(7), (2006)
Data mining in 1D: curve fitting
Causes and coincidences Tom Griffiths Cognitive and Linguistic Sciences Brown University.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
The loss function, the normal equation,
Part IV: Monte Carlo and nonparametric Bayes. Outline Monte Carlo methods Nonparametric Bayesian models.
The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.
A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.
Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.
Tom Griffiths CogSci C131/Psych C123 Computational Models of Cognition.
Priors and predictions in everyday cognition Tom Griffiths Cognitive and Linguistic Sciences.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Bayesian Learning Rong Jin.
Markov chain Monte Carlo with people Tom Griffiths Department of Psychology Cognitive Science Program UC Berkeley with Mike Kalish, Stephan Lewandowsky,
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Exploring cultural transmission by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana With thanks to: Anu Asnaani, Brian.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Part II: How to make a Bayesian model. Questions you can answer… What would an ideal learner or observer infer from these data? What are the effects of.
Analyzing iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Normative models of human inductive inference Tom Griffiths Department of Psychology Cognitive Science Program University of California, Berkeley.
Bayesian models as a tool for revealing inductive biases Tom Griffiths University of California, Berkeley.
For Better Accuracy Eick: Ensemble Learning
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Learning from Observations Chapter 18 Through
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Processing Sequential Sensor Data The “John Krumm perspective” Thomas Plötz November 29 th, 2011.
Chapter 6 Bayesian Learning
INTRODUCTION TO Machine Learning 3rd Edition
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.
Chapter 6 Sampling and Sampling Distributions
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Biointelligence Laboratory, Seoul National University
Review of Probability.
Probability Theory and Parameter Estimation I
Bayesian data analysis
Markov chain Monte Carlo with people
Data Mining Lecture 11.
Analyzing cultural evolution by iterated learning
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Revealing priors on category structures through iterated learning
The loss function, the normal equation,
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Presentation transcript:

Revealing inductive biases with Bayesian models Tom Griffiths UC Berkeley with Mike Kalish, Brian Christian, and Steve Lewandowsky

Inductive problems blicket toma dax wug blicket wug S  X Y X  {blicket,dax} Y  {toma, wug} Learning languages from utterances Learning functions from (x,y) pairs Learning categories from instances of their members

Generalization requires induction Generalization: predicting the properties of an entity from observed properties of others y x

What makes a good inductive learner? Hypothesis 1: more representational power –more hypotheses, more complexity –spirit of many accounts of learning and development

Some hypothesis spaces Linear functions Quadratic functions 8th degree polynomials

Minimizing squared error

Measuring prediction error

What makes a good inductive learner? Hypothesis 1: more representational power –more hypotheses, more complexity –spirit of many accounts of learning and development Hypothesis 2: good inductive biases –constraints on hypotheses that match the environment

Outline The bias-variance tradeoff Bayesian inference and inductive biases Revealing inductive biases Conclusions

Outline The bias-variance tradeoff Bayesian inference and inductive biases Revealing inductive biases Conclusions

A simple schema for induction Data D are n pairs (x,y) generated from function f Hypothesis space of functions, y = g(x) Error is E = (y - g(x)) 2 Pick function g that minimizes error on D Measure prediction error, averaging over x and y y x

Bias and variance A good learner makes (f(x) - g(x)) 2 small g is chosen on the basis of the data D Evaluate learners by the average of (f(x) - g(x)) 2 over data D generated from f biasvariance (Geman, Bienenstock, & Doursat, 1992)

Making things more intuitive… The next few slides were generated by: –choosing a true function f(x) –generating a number of datasets D from p(x,y) defined by uniform p(x), p(y|x) = f(x) plus noise –finding the function g(x) in the hypothesis space that minimized the error on D Comparing average of g(x) to f(x) reveals bias Spread of g(x) around average is the variance

Linear functions (n = 10)

} bias pink is g(x) for each dataset red is average g(x) black is f(x) y x } variance

Quadratic functions ( n = 10) pink is g(x) for each dataset red is average g(x) black is f(x) y x

8-th degree polynomials ( n = 10) pink is g(x) for each dataset red is average g(x) black is f(x) y x

Bias and variance (for our (quadratic) f(x), with n = 10) Linear functions high bias, medium variance Quadratic functions low bias, low variance 8-th order polynomials low bias, super-high variance

In general… Larger hypothesis spaces result in higher variance, but low bias across several f(x) The bias-variance tradeoff: –if we want a learner that has low bias on a range of problems, we pay a price in variance This is mainly an issue when n is small –the regime of much of human learning

Quadratic functions ( n = 100) pink is g(x) for each dataset red is average g(x) black is f(x) y x

8-th degree polynomials ( n = 100) pink is g(x) for each dataset red is average g(x) black is f(x) y x

The moral General-purpose learning mechanisms do not work well with small amounts of data –more representational power isn’t always better To make good predictions from small amounts of data, you need a bias that matches the problem –these biases are the key to successful induction, and characterize the nature of an inductive learner So… how can we identify human inductive biases?

Outline The bias-variance tradeoff Bayesian inference and inductive biases Revealing inductive biases Conclusions

Bayesian inference Reverend Thomas Bayes Rational procedure for updating beliefs Foundation of many learning algorithms Lets us make the inductive biases of learners precise

Bayes’ theorem Posterior probability LikelihoodPrior probability Sum over space of hypotheses h: hypothesis d: data

Priors and biases Priors indicate the kind of world a learner expects to encounter, guiding their conclusions In our function learning example… –likelihood gives probability to data that decrease with sum squared errors (i.e. a Gaussian) –priors are uniform over all functions in hypothesis spaces of different kinds of polynomials –having more functions corresponds to a belief in a more complex world…

Outline The bias-variance tradeoff Bayesian inference and inductive biases Revealing inductive biases Conclusions

Two ways of using Bayesian models Specify models that make different assumptions about priors, and compare their fit to human data (Anderson & Schooler, 1991; Oaksford & Chater, 1994; Griffiths & Tenenbaum, 2006) Design experiments explicitly intended to reveal the priors of Bayesian learners

Iterated learning (Kirby, 2001) What are the consequences of learners learning from other learners?

Objects of iterated learning Knowledge communicated across generations through provision of data by learners Examples: –religious concepts –social norms –myths and legends –causal theories –language

Analyzing iterated learning P L (h|d): probability of inferring hypothesis h from data d P P (d|h): probability of generating data d from hypothesis h PL(h|d)PL(h|d) P P (d|h) PL(h|d)PL(h|d)

Variables x (t+1) independent of history given x (t) Converges to a stationary distribution under easily checked conditions (i.e., if it is ergodic) xx x xx x x x Transition matrix T = P(x (t+1) |x (t) ) Markov chains

Analyzing iterated learning d0d0 h1h1 d1d1 h2h2 PL(h|d)PL(h|d) PP(d|h)PP(d|h) PL(h|d)PL(h|d) d2d2 h3h3 PP(d|h)PP(d|h) PL(h|d)PL(h|d)  d P P (d|h)P L (h|d) h1h1 h2h2 h3h3 A Markov chain on hypotheses d0d0 d1d1  h P L (h|d) P P (d|h) d2d2 A Markov chain on data

Iterated Bayesian learning PL(h|d)PL(h|d) P P (d|h) PL(h|d)PL(h|d) Assume learners sample from their posterior distribution:

Stationary distributions Markov chain on h converges to the prior, P(h) Markov chain on d converges to the “prior predictive distribution” (Griffiths & Kalish, 2005)

Explaining convergence to the prior PL(h|d)PL(h|d) P P (d|h) PL(h|d)PL(h|d) Intuitively: data acts once, prior many times Formally: iterated learning with Bayesian agents is a Gibbs sampler on P(d,h) (Griffiths & Kalish, in press)

Revealing inductive biases If iterated learning converges to the prior, it might provide a tool for determining the inductive biases of human learners We can test this by reproducing iterated learning in the lab, with stimuli for which human biases are well understood

Iterated function learning Each learner sees a set of (x,y) pairs Makes predictions of y for new x values Predictions are data for the next learner datahypotheses (Kalish, Griffiths, & Lewandowsky, in press)

Function learning experiments Stimulus Response Slider Feedback Examine iterated learning with different initial data

Iteration Initial data

Identifying inductive biases Formal analysis suggests that iterated learning provides a way to determine inductive biases Experiments with human learners support this idea –when stimuli for which biases are well understood are used, those biases are revealed by iterated learning What do inductive biases look like in other cases? –continuous categories –causal structure –word learning –language learning

Outline The bias-variance tradeoff Bayesian inference and inductive biases Revealing inductive biases Conclusions

Solving inductive problems and forming good generalizations requires good inductive biases Bayesian inference provides a way to make assumptions about the biases of learners explicit Two ways to identify human inductive biases: –compare Bayesian models assuming different priors –design tasks to extract biases from Bayesian learners Iterated learning provides a lens for magnifying the inductive biases of learners –small effects for individuals are big effects for groups

Iterated concept learning Each learner sees examples from a species Identifies species of four amoebae Iterated learning is run within-subjects data hypotheses (Griffiths, Christian, & Kalish, in press)

Two positive examples data (d) hypotheses (h)

Bayesian model (Tenenbaum, 1999; Tenenbaum & Griffiths, 2001) d: 2 amoebae h: set of 4 amoebae m: # of amoebae in the set d (= 2) |h|: # of amoebae in the set h (= 4) Posterior is renormalized prior What is the prior?

Classes of concepts (Shepard, Hovland, & Jenkins, 1961) Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 shape size color

Experiment design (for each subject) Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 6 iterated learning chains 6 independent learning “chains”

Estimating the prior data (d) hypotheses (h)

Estimating the prior Class 1 Class 2 Class 3 Class 4 Class 5 Class Prior r = Bayesian model Human subjects

Two positive examples (n = 20) Probability Iteration Probability Iteration Human learners Bayesian model

Two positive examples (n = 20) Probability Bayesian model Human learners

Three positive examples data (d) hypotheses (h)

Three positive examples (n = 20) Probability Iteration Probability Iteration Human learners Bayesian model

Three positive examples (n = 20) Bayesian model Human learners

Serial reproduction (Bartlett, 1932) Participants see stimuli, then reproduce them from memory Reproductions of one participant are stimuli for the next Stimuli were interesting, rather than controlled –e.g., “War of the Ghosts”

Discovering the biases of models Generic neural network:

Discovering the biases of models EXAM (Delosh, Busemeyer, & McDaniel, 1997):

Discovering the biases of models POLE (Kalish, Lewandowsky, & Kruschke, 2004):