Bayesian Reasoning: Markov Chain Monte Carlo

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

CS479/679 Pattern Recognition Dr. George Bebis
Monte Carlo Methods and Statistical Physics
Bayesian Estimation in MARK
Efficient Cosmological Parameter Estimation with Hamiltonian Monte Carlo Amir Hajian Amir Hajian Cosmo06 – September 25, 2006 Astro-ph/
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Markov-Chain Monte Carlo
Probability & Statistical Inference Lecture 3
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian statistics – MCMC techniques
Suggested readings Historical notes Markov chains MCMC details
BAYESIAN INFERENCE Sampling techniques
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
1 CE 530 Molecular Simulation Lecture 8 Markov Processes David A. Kofke Department of Chemical Engineering SUNY Buffalo
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Evaluating Hypotheses
G. Cowan RHUL Physics Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School page 1 Statistical Methods for Particle Physics (2) CERN-FNAL.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
G. Cowan Lectures on Statistical Data Analysis Lecture 13 page 1 Statistical Data Analysis: Lecture 13 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Introduction to Monte Carlo Methods D.J.C. Mackay.
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 11 Some materials adapted from Prof. Keith E. Gubbins:
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Module 1: Statistical Issues in Micro simulation Paul Sousa.
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.
Overview Particle filtering is a sequential Monte Carlo methodology in which the relevant probability distributions are iteratively estimated using the.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Simulated Annealing.
Monte Carlo Methods So far we have discussed Monte Carlo methods based on a uniform distribution of random numbers on the interval [0,1] p(x) = 1 0  x.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Molecular Systematics
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
G. Cowan RHUL Physics Bayesian Higgs combination page 1 Bayesian Higgs combination based on event counts (follow-up from 11 May 07) ATLAS Statistics Forum.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Reducing MCMC Computational Cost With a Two Layered Bayesian Approach
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Gil McVean, Department of Statistics Thursday February 12 th 2009 Monte Carlo simulation.
STAT 534: Statistical Computing
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Introduction to Computer Simulation of Physical Systems (Lecture 10) Numerical and Monte Carlo Methods (CONTINUED) PHYS 3061.
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Introduction to Sampling based inference and MCMC
MCMC Output & Metropolis-Hastings Algorithm Part I
Markov Chain Monte Carlo methods --the final project of stat 6213
Optimization of Monte Carlo Integration
Advanced Statistical Computing Fall 2016
Basic simulation methodology
Properties of the Normal Distribution
Jun Liu Department of Statistics Stanford University
Markov Chain Monte Carlo
Markov chain monte carlo
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Multidimensional Integration Part I
Lecture 15 Sampling.
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
VQMC J. Planelles.
Presentation transcript:

Bayesian Reasoning: Markov Chain Monte Carlo A/Prof Geraint F. Lewis Rm 560: gfl@physics.usyd.edu.au

Sampling the Posterior So far we have investigated how to use Bayesian techniques to determine posterior probability distribution for a set of parameters in light of some data. However, our parameter set may be highly-dimensional, and we may only be interested in a sub-set of (marginalized) parameters. Markov Chain Monte Carlo (MCMC) is an efficient approach to this problem. http://www.physics.usyd.edu.au/~gfl/Lecture

Why Monte Carlo? Monte Carlo is a district of Monaco which is famous for casinos and gambling. Monte Carlo techniques rely on a sequence of random numbers (like the roll of dice) to make calculations. By throwing random numbers over the x-y plane it’s easy to see that an approximation of the integral is the fraction of points “under the curve” times the area of interest. http://www.physics.usyd.edu.au/~gfl/Lecture

Integration of Functions Instead of sampling the area we can sample a function directly. For a posterior distribution p(X|D) we can calculate an expectation value through This can be calculated through a Monte Carlo sample by multiplying the volume by the mean value of the sampled function through http://www.physics.usyd.edu.au/~gfl/Lecture

Markov Chains Straight-forward Monte Carlo integration suffers from some problems, especially if your posterior probability is peaked in a small volume of your parameter space. What we would like is a method to throw down more points into the volume in regions of interest, and not waste points where the integrand is negligible. We can use a Markov Chain to “walk” through the parameter space, loitering in regions of high significance, and avoiding everywhere else. http://www.physics.usyd.edu.au/~gfl/Lecture

Metropolis-Hastings Metropolis-Hastings (1953) presented a (very simple) approach to extract a sample from the target distribution, p(X|D,I), via a weighted random walk. To do this, we need a transition probability, p(Xt+1|Xt), which takes us from one point in the parameter space to the next. Suppose we are at a point Xt and we choose another point Y (drawn from the proposal distribution, p(Y|Xt)), how do we choose to accept the step XtY? http://www.physics.usyd.edu.au/~gfl/Lecture

Metropolis-Hastings We can then calculate the Metropolis ratio where If the proposal distribution is symmetric, the right-most factor is unity. If r>1 we take the step. If not, we accept it with a probability of r (i.e. compare to a uniformly drawn random deviate). http://www.physics.usyd.edu.au/~gfl/Lecture

Metropolis-Hastings Hence, the probability that we take the step can be summarized in the acceptance probability given by and the Metropolis-Hastings algorithm is simply; Set t=0 and initialize X0 Repeat { Chose Y from q(Y|Xt), Obtain a random deviate U=U(0,1), If U≤r then Xt+1=Y, else Xt+1=Xt; increment t } http://www.physics.usyd.edu.au/~gfl/Lecture

Example 1 Suppose that the posterior distribution is given by This is the Poisson distribution and X can only take non-negative integer values. Our MCMC algorithm is; Set t=0 and initialize X0 Repeat { Given Xt, choose U1=U(0,1); If U1>0.5, propose Y=Xt+1 else Y=Xt-1; Calculate r = p(Y|D,I)/p(Xt|D,1) = Y-Xt Xt! / Y! ; If U2=U(0,1)<r, then Xt+1=Y, else Xt+1=Xt } http://www.physics.usyd.edu.au/~gfl/Lecture

Example 1 The result is a chain of integer values. How are we to interpret the chain. Neglecting the burn-in the distribution of integer values is the same as the posterior distribution we are sampling, as seen in the histogram of their values (solid line is the Poisson distribution). http://www.physics.usyd.edu.au/~gfl/Lecture

Example 1 So the values in the chain are a representation of the posterior probability distribution. Hence, the mean value of out parameter is simply given by http://www.physics.usyd.edu.au/~gfl/Lecture

Example 2 Clearly, the nature of the previous problem required us to take integer steps through the posterior distribution, but in general, our parameters will be continuous; what is our proposal distribution, p(Y|Xt), in this case? Typically, this is taken to be a multi-variate Gaussian distribution, centered upon Xt, with a width . If we have n parameters (and hence dimensions), the proposal is; (Note, each dimension may have a different i). http://www.physics.usyd.edu.au/~gfl/Lecture

Example 2 Here is a case where we have a 2-dimensional, bimodal posterior distribution. Clearly the points are clustered in the parameter space in two “lumps”. http://www.physics.usyd.edu.au/~gfl/Lecture

Example 2 The contours show that the density of points mirrors the underlying posterior distribution. Furthermore, the projected (integrated) density of points corresponds to the marginalized distribution. http://www.physics.usyd.edu.au/~gfl/Lecture

Proposal Choice If we are going to step through the posterior probability distribution, the step-size of proposal distribution is going to be related to how efficiently we can explore the volume. Clearly, if your step-size is too small, it will take many steps (a lot of time) to cover the volume. If the step size is too large, we can skip over the volume without spending time in regions of interest. http://www.physics.usyd.edu.au/~gfl/Lecture

Proposal Choice Changing  by a factor of 10 illustrate this. If we want to wait long enough, any of these proposal distributions will cover, but how do we know what to choose? http://www.physics.usyd.edu.au/~gfl/Lecture

Autocorrelation Function So, how do you choose the ideal proposal distribution? Unfortunately, this is an art which comes with experience, although the auto-correlation function gives us a good clue through; The MCMC samples are not independent and so the auto-correlation reveals the convergence (i.e. we want solutions which the ACF goes to zero as quickly as possible). http://www.physics.usyd.edu.au/~gfl/Lecture

Autocorrelation Function The ACF is represented as an exponential function of the form; The smaller the h the faster the convergence. But. Of course, we need to try several values to find the sweet spot. However, we have a rule of thumb to guide this. http://www.physics.usyd.edu.au/~gfl/Lecture

Autocorrelation Function Work by Roberts, Gelman & Gilks (1997) recommend calibrating the acceptance rate to 25% for high dimensional problems, and 50% for low (1-2) dimensional problems. But there are questions remaining; How long is the burn-in? When do we stop the chain? What’s the proposal? Answer: Experience :) http://www.physics.usyd.edu.au/~gfl/Lecture