Bayesian statistics – MCMC techniques

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Bayesian Estimation in MARK
Bayesian statistics 2 More on priors plus model choice.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
1 Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University
Bayesian Reasoning: Markov Chain Monte Carlo
BAYESIAN INFERENCE Sampling techniques
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
Making rating curves - the Bayesian approach. Rating curves – what is wanted? A best estimate of the relationship between stage and discharge at a given.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Today Introduction to MCMC Particle filters and MCMC
Statistical Background
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Bayes Factor Based on Han and Carlin (2001, JASA).
WSEAS AIKED, Cambridge, Feature Importance in Bayesian Assessment of Newborn Brain Maturity from EEG Livia Jakaite, Vitaly Schetinin and Carsten.
Correlation With Errors-In-Variables3/28/20021 Correlation with Errors-In-Variables and an Application to Galaxies William H. Jefferys University of Texas.
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
Priors, Normal Models, Computing Posteriors
2 nd Order CFA Byrne Chapter 5. 2 nd Order Models The idea of a 2 nd order model (sometimes called a bi-factor model) is: – You have some latent variables.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Bayesian statistics Probabilities for everything.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Improved Cross Entropy Method For Estimation Presented by: Alex & Yanna.
G. Cowan RHUL Physics Bayesian Higgs combination page 1 Bayesian Higgs combination based on event counts (follow-up from 11 May 07) ATLAS Statistics Forum.
Ch. 14: Markov Chain Monte Carlo Methods based on Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009.; C, Andrieu, N, de Freitas,
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Application of the MCMC Method for the Calibration of DSMC Parameters James S. Strand and David B. Goldstein The University of Texas at Austin Sponsored.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Sampling and estimation Petter Mostad
Trond Reitan (Division of statistics and insurance mathematics, Department of Mathematics, University of Oslo) Statistical modelling and latent variables.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Gil McVean, Department of Statistics Thursday February 12 th 2009 Monte Carlo simulation.
Introduction: Metropolis-Hasting Sampler Purpose--To draw samples from a probability distribution There are three steps 1Propose a move from x to y 2Accept.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
MCMC Output & Metropolis-Hastings Algorithm Part I
Advanced Statistical Computing Fall 2016
ERGM conditional form Much easier to calculate delta (change statistics)
Jun Liu Department of Statistics Stanford University
Bayesian inference Presented by Amir Hadadi
Markov Chain Monte Carlo
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
CAP 5636 – Advanced Artificial Intelligence
Predictive distributions
Multidimensional Integration Part I
CS 188: Artificial Intelligence
Presentation transcript:

Bayesian statistics – MCMC techniques How to sample from the posterior distribution

Bayes formula – the problem term Bayes theorem: Seems straight forward. You multiply the likelihood (which you know) with your prior (which you have specified). So far, so analytic. But, the integral may not be analytically tractable. Solutions: Numerical integration or Monte Carlo-integration of the integral ... or sampling from the posterior distribution using Markov chain Monte Carlo (MCMC).

Normalization Bayes theorem: Note that the problematic integral is simply a normalization factor. The result of that integral contains no function dependency on the parameters, . Normalization is that constant in a probability density function which makes the probabilities integrate to one. The functional form of the posterior distribution depends only on the two terms in the denominator, the likelihood f(D| ) and the prior f(). If we can find a way of sampling from a distribution without knowing it’s normalization constant, we can use those samples to tell us about all properties of the posterior distribution (mean, variance, covariance, quantiles etc.) How to sample from a distribution f(x)=g(x)/A without knowing A?

Markov chains … X1 X2 X3 Xi-1 Xi Xi+1 … Xn Organize a series of stochastic variables, X1,...,Xn like this: Each variable depends on it’s past only through the nearest past. f(xt|xt-1,...,x1)=f(xt|xt-1). This simplifies the model for the combined series a lot. f(xt,xt-1,...,x1)= f(xt|xt-1,...,x1) f(xt|xt-1,...,x1)...f(x1). Markov chains can be very practical for modelling data with some time-dependecy. Example: The autoregressive model in the “water temperature” series. … X1 X2 X3 Xi-1 Xi Xi+1 … Xn

Markov chains –stationary distribution Each element in the chain will have a distribution. By simulating the chain multiple times and plotting the histograms, you get a sense of this. Sometimes, a Markov chain will converge towards a fixed distribution, the stationary distribution. Ex: Random walk with reflective boundaries (-1,1). Xi=Xi-1+ i, iN(0,2), if Xi<-1 then set Xi to -1-(Xi+1), if Xi>1 then set Xi to 1-(Xi-1). While we sampled from the normal distribution, we get something that is uniformly distributed. Even if we did not know how to sample from the uniform directly, we could do it indirectly using this Markov chain. Idea: Make a Markov chain that converges towards the posterior, f(|D). X1 X2 X3 X4 X5 X6

MCMC – why? i=i* i-2 i-1 i=i-1 accept reject Make a Markov chains timeseries of parameter samples, (1, 2, 3,…), which has stationary distribution f(|D), in the following way (Metropolis-Hastings): Acceptance with probability: We don’t need the troublesome normalization constant, f(D). We can drop it! We may even drop further normalization constants in the likelihood, prior and proposal density. Solves the normalization problem in Bayesian statistics! i=i* accept Propose a new value from the previous values, i*,f(i*| i-1) i-2 i-1 reject i=i-1

Everything solved? What about the proposal distribution? Almost every proposal distribution, i*,f(i*| i-1), will do. There’s a lot of (too much?) freedom here. Can restrict ourselves to go through each the parameters, one at a time. For =((1), (2)), propose a change in (1), accept/reject, then do the same for (2). Pros: Simpler. Cons: Great parameter dependency means little change. Special case: A Gibbs sample of a parameter comes from it’s distribution conditioned on the data and all the rest of the parameters. The proposal is now specified by the model, so no need to choose anything. Automatic methods possible, where we only need to specify the model (WinBUGS). Make a Markov chains timeseries of parameter samples, (1, 2, 3,…), which has stationary distribution f(|D), in the following way (Metropolis-Hastings): Acceptance with probability: We don’t need the troublesome normalization constant, f(D). We can drop it! We may even drop further normalization constants in the likelihood, prior and proposal density. Solves the normalization problem in Bayesian statistics! (1) f((1), (2)|D) (2)

MCMC - convergence Sampling using MCMC means that if we keep doing it for enough time, we get a sample from the posterior distribution. Question 1: How much is “enough time”? Markov chains will not converge immediately. We need ways of testing whether convergence has occurred or not. Start many chains (starting with different initial positions in parameter space). When do they start to mix? Compare plots of the parameter values or of the likelihoods. Gelman’s ANOVA test for MCMC samples. Check if we are getting the sample results if we rerun the analysis from different starting points. After determining the point of convergence, we discard the samples from before that (burn-in) and only keep samples from later than this point.

MCMC - dependency Question 2: How many samples do we need? In a Markov chain, one sample depends on the previous one. How will that affect the precision of the total sample and how should we relate to that? Look at the timeseries plot of a single chain. Do we see dependency? Estimate auto-correlation -> effective number of samples. Keep only each k’th sample, so that the dependency is almost gone. We can then get a target number of (almost) independent samples. If we have a set of independent samples, we can then use standard sampling theory to say something about the precision of means and covariances, for instance. Autocorrelation plot. Keep each sample vs keep each 600th sample.

MCMC – causes for concern High posterior dependency between parameters: Gives high dependency between samples and thus low efficiency. Look at scatterplots and to see if this is the case. Multimodality – posterior density has many “peaks”. Chains starting from different places “converge” to different places. Greatly reduced mixing. Usual solution within WinBUGS: Re-parametrize or sample a lot!