Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Basics of Statistical Estimation
A Tutorial on Learning with Bayesian Networks
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Announcements CS Ice Cream Social 9/5 3:30-4:30, ECCR 265 includes poster session, student group presentations.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Kevin Murphy UBC CS & Stats 9 February 2005
Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.
Probabilistic Models of Cognition Conceptual Foundations Chater, Tenenbaum, & Yuille TICS, 10(7), (2006)
Parameter Estimation using likelihood functions Tutorial #1
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.
Introduction  Bayesian methods are becoming very important in the cognitive sciences  Bayesian statistics is a framework for doing inference, in a principled.
Bayesian models of inductive learning
Bayesian models of inductive learning
Learning with Bayesian Networks David Heckerman Presented by Colin Rickert.
A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, March 6, 2000 William.
Bayesian Learning and Learning Bayesian Networks.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Recitation 1 Probability Review
Bayesian Model Comparison and Occam’s Razor Lecture 2.
Bayesian approaches to cognitive sciences. Word learning Bayesian property induction Theory-based causal inference.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Optimal predictions in everyday cognition Tom Griffiths Josh Tenenbaum Brown University MIT Predicting the future Optimality and Bayesian inference Results.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Bayesian models of inductive learning Tom Griffiths UC Berkeley Josh Tenenbaum MIT Charles Kemp CMU.
Statistical Learning (From data to distributions).
Randomized Algorithms for Bayesian Hierarchical Clustering
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
CSC321: Neural Networks Lecture 16: Hidden Markov Models
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Logistics Class size? Who is new? Who is listening? Everyone on Athena mailing list “concepts- and-theories”? If not write to me. Everyone on stellar yet?
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
FIXETH LIKELIHOODS this is correct. Bayesian methods I: theory.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Model Comparison.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Bayesian Estimation and Confidence Intervals
Bayesian data analysis
Bayes Net Learning: Bayesian Approaches
Special Topics In Scientific Computing
Data Mining Lecture 11.
Review of Probability and Estimators Arun Das, Jason Rebello
LECTURE 07: BAYESIAN ESTIMATION
Machine Learning: Lecture 6
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT

Bayes’ rule Posterior probability LikelihoodPrior probability Sum over space of alternative hypotheses For any hypothesis h and data d,

Bayesian inference Bayes’ rule: An example –Data: John is coughing –Some hypotheses: 1. John has a cold 2. John has lung cancer 3. John has a stomach flu –Prior P(h) favors 1 and 3 over 2 –Likelihood P(d|h) favors 1 and 2 over 3 –Posterior P(h|d) favors 1 over 2 and 3

Plan for this lecture Some basic aspects of Bayesian statistics –Model fitting –Model averaging –Model selection A case study in Bayesian cognitive modeling –The number game

Coin flipping Comparing two hypotheses –data = HHTHT or HHHHH –compare two simple hypotheses: P( H ) = 0.5 vs. P( H ) = 1.0 Parameter estimation (Model fitting) –compare many hypotheses in a parameterized family P( H ) =  : Infer  Model selection –compare qualitatively different hypotheses, often varying in complexity: P( H ) = 0.5 vs. P( H ) = 

Coin flipping HHTHT HHHHH What process produced these sequences?

Comparing two simple hypotheses Contrast simple hypotheses: –h 1 : “fair coin”, P( H ) = 0.5 –h 2 :“always heads”, P( H ) = 1.0 Bayes’ rule: With two hypotheses, use odds form

Comparing two simple hypotheses D: HHTHT H 1, H 2 : “fair coin”, “always heads” P(D|H 1 ) =1/2 5 P(H 1 ) =? P(D|H 2 ) =0 P(H 2 ) = 1-?

Comparing two simple hypotheses D: HHTHT H 1, H 2 : “fair coin”, “always heads” P(D|H 1 ) =1/2 5 P(H 1 ) =999/1000 P(D|H 2 ) =0 P(H 2 ) = 1/1000

Comparing two simple hypotheses D: HHHHH H 1, H 2 : “fair coin”, “always heads” P(D|H 1 ) =1/2 5 P(H 1 ) =999/1000 P(D|H 2 ) =1 P(H 2 ) = 1/1000

Comparing two simple hypotheses D: HHHHHHHHHH H 1, H 2 : “fair coin”, “always heads” P(D|H 1 ) =1/2 10 P(H 1 ) =999/1000 P(D|H 2 ) =1 P(H 2 ) = 1/1000

The role of priors The fact that HHTHT looks representative of a fair coin, and HHHHH does not, depends entirely on our hypothesis space and prior probabilities. Should we be worried about that? … or happy?

The role of intuitive theories The fact that HHTHT looks representative of a fair coin, and HHHHH does not, reflects our implicit theories of how the world works. –Easy to imagine how a trick all-heads coin could work: high prior probability. –Hard to imagine how a trick “ HHTHT ” coin could work: low prior probability.

Coin flipping Basic Bayes –data = HHTHT or HHHHH –compare two hypotheses: P( H ) = 0.5 vs. P( H ) = 1.0 Parameter estimation (Model fitting) –compare many hypotheses in a parameterized family P( H ) =  : Infer  Model selection –compare qualitatively different hypotheses, often varying in complexity: P( H ) = 0.5 vs. P( H ) = 

Assume data are generated from a parameterized model: What is the value of  ? –each value of  is a hypothesis H –requires inference over infinitely many hypotheses Parameter estimation d 1 d 2 d 3 d 4 P( H ) =  

Assume hypothesis space of possible models: Which model generated the data? –requires summing out hidden variables –requires some form of Occam’s razor to trade off complexity with fit to the data. Model selection d1d1 d2d2 d3d3 d4d4 Fair coin: P( H ) = 0.5 d1d1 d2d2 d3d3 d4d4 P( H ) =   d1d1 d2d2 d3d3 d4d4 Hidden Markov model: s i {Fair coin, Trick coin} s1s1 s2s2 s3s3 s4s4

Parameter estimation vs. Model selection across learning and development Causality: learning the strength of a relation vs. learning the existence and form of a relation Language acquisition: learning a speaker's accent, or frequencies of different words vs. learning a new tense or syntactic rule (or learning a new language, or the existence of different languages) Concepts: learning what horses look like vs. learning that there is a new species (or learning that there are species) Intuitive physics: learning the mass of an object vs. learning about gravity or angular momentum

A hierarchical learning framework model data Parameter estimation: parameter setting

A hierarchical learning framework model data Model selection:Parameter estimation: model class parameter setting

Assume data are generated from a model: What is the value of  ? –each value of  is a hypothesis H –requires inference over infinitely many hypotheses Bayesian parameter estimation d 1 d 2 d 3 d 4 P( H ) =  

D = 10 flips, with 5 heads and 5 tails.  = P( H ) on next flip? 50% Why? 50% = 5 / (5+5) = 5/10. Why? “The future will be like the past” Suppose we had seen 4 heads and 6 tails. P( H ) on next flip? Closer to 50% than to 40%. Why? Prior knowledge. Some intuitions

Posterior distribution P(  | D) is a probability density over  = P( H ) Need to specify likelihood P(D |  ) and prior distribution P(  ). Integrating prior knowledge and data

Likelihood and prior Likelihood: Bernoulli distribution P(D |  ) =  N H (1-  ) N T –N H : number of heads –N T : number of tails Prior: P(  )  ?

D = 10 flips, with 5 heads and 5 tails.  = P( H ) on next flip? 50% Why? 50% = 5 / (5+5) = 5/10. Why? Maximum likelihood: Suppose we had seen 4 heads and 6 tails. P( H ) on next flip? Closer to 50% than to 40%. Why? Prior knowledge. Some intuitions

A simple method of specifying priors Imagine some fictitious trials, reflecting a set of previous experiences –strategy often used with neural networks or building invariance into machine vision. e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair In fact, this is a sensible statistical idea...

Likelihood and prior Likelihood: Bernoulli(  ) distribution P(D |  ) =  N H (1-  ) N T –N H : number of heads –N T : number of tails Prior: Beta(F H,F T ) distribution P(  )   F H-1 (1-  ) F T-1 –F H : fictitious observations of heads –F T : fictitious observations of tails

Shape of the Beta prior

F H = 0.5, F T = 0.5F H = 0.5, F T = 2 F H = 2, F T = 0.5F H = 2, F T = 2

Posterior is Beta(N H +F H,N T +F T ) –same form as prior! Bayesian parameter estimation P(  | D)  P(D |  ) P(  ) =  N H+ F H-1 ( 1 -  ) N T+ F T-1

Conjugate priors A prior p(  ) is conjugate to a likelihood function p(D |  ) if the posterior has the same functional form of the prior. –Parameter values in the prior can be thought of as a summary of “fictitious observations”. –Different parameter values in the prior and posterior reflect the impact of observed data. –Conjugate priors exist for many standard models (e.g., all exponential family models)

d 1 d 2 d 3 d 4  FH,FTFH,FT H Posterior predictive distribution: D = N H,N T P(  | D)  P(D |  ) P(  ) =  N H+ F H-1 ( 1 -  ) N T+ F T-1 Bayesian parameter estimation P( H| D, F H, F T ) = P( H|  ) P(  D, F H, F T ) d  “Bayesian model averaging”

d 1 d 2 d 3 d 4  FH,FTFH,FT H Posterior predictive distribution: D = N H,N T P(  | D)  P(D |  ) P(  ) =  N H+ F H-1 ( 1 -  ) N T+ F T-1 Bayesian parameter estimation (N H+ F H+ N T+ F T ) (N H+ F H ) P( H| D, F H, F T ) =

Some examples e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair After seeing 4 heads, 6 tails, P( H ) on next flip = 1004 / ( ) = 49.95% e.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair After seeing 4 heads, 6 tails, P( H ) on next flip = 7 / (7+9) = 43.75% Prior knowledge too weak

But… flipping thumbtacks e.g., F ={4 heads, 3 tails} ~ weak expectation that tacks are slightly biased towards heads After seeing 2 heads, 0 tails, P( H ) on next flip = 6 / (6+3) = 67% Some prior knowledge is always necessary to avoid jumping to hasty conclusions... Suppose F = { }: After seeing 1 heads, 0 tails, P( H ) on next flip = 1 / (1+0) = 100%

Origin of prior knowledge Tempting answer: prior experience Suppose you have previously seen 2000 coin flips: 1000 heads, 1000 tails

Problems with simple empiricism Haven’t really seen 2000 coin flips, or any flips of a thumbtack –Prior knowledge is stronger than raw experience justifies Haven’t seen exactly equal number of heads and tails –Prior knowledge is smoother than raw experience justifies Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins –Prior knowledge is more structured than raw experience

A simple theory “Coins are manufactured by a standardized procedure that is effective but not perfect, and symmetric with respect to heads and tails. Tacks are asymmetric, and manufactured to less exacting standards.” –Justifies generalizing from previous coins to the present coin. –Justifies smoother and stronger prior than raw experience alone. –Explains why seeing 10 flips each for 200 coins is more valuable than seeing 2000 flips of one coin.

A hierarchical Bayesian model d 1 d 2 d 3 d 4 FH,FTFH,FT 11  ~ Beta(F H,F T ) Coin 1Coin 2Coin 22  200 physical knowledge Qualitative physical knowledge (symmetry) can influence estimates of continuous parameters (F H, F T ). Explains why 10 flips of 200 coins are better than 2000 flips of a single coin: more informative about F H, F T. Coins

Learning the parameters of a generative model as Bayesian inference. Prediction by Bayesian model averaging. Conjugate priors –an elegant way to represent simple kinds of prior knowledge. Hierarchical Bayesian models –integrate knowledge across instances of a system, or different systems within a domain, to explain the origins of priors. Summary: Bayesian parameter estimation

A hierarchical learning framework model data Model selection: Model fitting: model class parameter setting

Stability versus Flexibility Can all domain knowledge be represented with conjugate priors? Suppose you flip a coin 25 times and get all heads. Something funny is going on … But with F ={1000 heads, 1000 tails}, P(heads) on next flip = 1025 / ( ) = 50.6%. Looks like nothing unusual. How do we balance stability and flexibility? –Stability: 6 heads, 4 tails  ~ 0.5 –Flexibility: 25 heads, 0 tails  ~ 1

Bayesian model selection Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P( H ) =  ? d 1 d 2 d 3 d 4 Fair coin, P( H ) = 0.5 vs. d 1 d 2 d 3 d 4 P( H ) =  

P( H ) =  is more complex than P( H ) = 0.5 in two ways: –P( H ) = 0.5 is a special case of P( H ) =  –for any observed sequence D, we can choose  such that D is more probable than if P( H ) = 0.5 Comparing simple and complex hypotheses

Probability  = 0.5 D = HHHHH

Comparing simple and complex hypotheses Probability  = 0.5  = 1.0 D = HHHHH

Comparing simple and complex hypotheses Probability D = HHTHT  = 0.5  = 0.6

P( H ) =  is more complex than P( H ) = 0.5 in two ways: –P( H ) = 0.5 is a special case of P( H ) =  –for any observed sequence X, we can choose  such that X is more probable than if P( H ) = 0.5 How can we deal with this? –Some version of Occam’s razor? –Bayes: automatic version of Occam’s razor follows from the “law of conservation of belief”. Comparing simple and complex hypotheses

P(h 1 |D) P(D|h 1 ) P(h 1 ) P(h 0 |D) P(D|h 0 ) P(h 0 ) = x Comparing simple and complex hypotheses The “evidence” or “marginal likelihood”: The probability that randomly selected parameters from the prior would generate the data.

Model class hypothesis: is this coin fair or unfair? Example probabilities: –P(fair) = –P(  |fair) is Beta(1000,1000) –P(  |unfair) is Beta(1,1) 25 heads in a row propagates up, affecting  and then P(fair|D) d 1 d 2 d 3 d 4  P(fair|25 heads) P(25 heads|fair) P(fair) P(unfair|25 heads) P(25 heads|unfair) P(unfair) = ~ FH,FTFH,FT fair/unfair? Stability versus Flexibility revisited

Bayesian Occam’s Razor All possible data sets d p(D = d | M ) M1M1 M2M2 For any model M, Law of “conservation of belief”: A model that can predict many possible data sets must assign each of them low probability.

Occam’s Razor in curve fitting

D p(D = d | M ) M1M1 M2M2 M3M3 Observed data M1M1 M2M2 M3M3 M 1 : A model that is too simple is unlikely to generate the data. M 3 : A model that is too complex can generate many possible data sets, so it is unlikely to generate this particular data set at random.

The “blessing of abstraction” Often easier to learn at higher levels of abstraction –Easier to learn that you have a biased coin than to learn its precise bias, or to learn that you have a second-order polynomial than to learn the precise coefficients. –Easier to learn causal structure than causal strength. –Easier to learn that you are hearing two languages (vs. one), or to learn that language has a hierarchical phrase structure, than to learn how any one language works. Why? Hypothesis space gets smaller as you go up. –But the total hypothesis space gets bigger when we add levels of abstraction (e.g., model selection). –Can make better predictions by expanding the hypothesis space, if we introduce good inductive biases.

Summary Three kinds of Bayesian inference –Comparing two simple hypotheses –Parameter estimation The importance and subtlety of prior knowledge –Model selection Bayesian Occam’s razor, the blessing of abstraction Key concepts –Probabilistic generative models –Hierarchies of abstraction, with statistical inference at all levels –Flexibly structured representations