Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics

Slides:



Advertisements
Similar presentations
Bayesian Estimation in MARK
Advertisements

11 - Markov Chains Jim Vallandingham.
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Lecture 3: Markov processes, master equation
Bayesian Reasoning: Markov Chain Monte Carlo
Bayesian statistics – MCMC techniques
Suggested readings Historical notes Markov chains MCMC details
BAYESIAN INFERENCE Sampling techniques
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Lecture II-2: Probability Review
Bayes Factor Based on Han and Carlin (2001, JASA).
Correlation With Errors-In-Variables3/28/20021 Correlation with Errors-In-Variables and an Application to Galaxies William H. Jefferys University of Texas.
Priors, Normal Models, Computing Posteriors
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
7. Metropolis Algorithm. Markov Chain and Monte Carlo Markov chain theory describes a particularly simple type of stochastic processes. Given a transition.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
CS774. Markov Random Field : Theory and Application Lecture 15 Kyomin Jung KAIST Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Introduction: Metropolis-Hasting Sampler Purpose--To draw samples from a probability distribution There are three steps 1Propose a move from x to y 2Accept.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Introduction to Sampling based inference and MCMC
MCMC Output & Metropolis-Hastings Algorithm Part I
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Markov Chain Monte Carlo methods --the final project of stat 6213
Advanced Statistical Computing Fall 2016
Basic simulation methodology
Gibbs sampling.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Robust Estimation The term “robust” was coined in statistics by G.E.P. Box in Various definitions of greater or lesser mathematical rigor are possible.
Jun Liu Department of Statistics Stanford University
CS 4/527: Artificial Intelligence
Markov chain monte carlo
Physics 114: Exam 2 Review Material from Weeks 7-11
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
CAP 5636 – Advanced Artificial Intelligence
Hidden Markov Models Part 2: Algorithms
Multidimensional Integration Part I
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ch13 Empirical Methods.
CS 188: Artificial Intelligence
Lecture 15 Sampling.
#21 Marginalize vs. Condition Uninteresting Fitted Parameters
#22 Uncertainty of Derived Parameters
#10 The Central Limit Theorem
Computing and Statistical Data Analysis / Stat 10
Presentation transcript:

Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics by Bill Press #39 MCMC and Gibbs Sampling Professor William H. Press, Department of Computer Science, the University of Texas at Austin

Markov Chain Monte Carlo (MCMC) Data set Parameters (sorry, we’ve changed notation!) We want to go beyond simply maximizing and get the whole Bayesian posterior distribution of Bayes says this is proportional to but with an unknown proportionality constant (the Bayes denominator). It seems as if we need this denominator to find confidence regions, e.g., containing 95% of the posterior probability. But no! MCMC is a way of drawing samples from the distribution without having to know its normalization! With such a sample, we can compute any quantity of interest about the distribution of , e.g., confidence regions, means, standard deviations, covariances, etc. Professor William H. Press, Department of Computer Science, the University of Texas at Austin

Two ideas due to Metropolis and colleagues make this possible: 1. Instead of sampling unrelated points, sample a Markov chain where each point is (stochastically) determined by the previous one by some chosen distribution Although locally correlated, it is possible to make this sequence ergodic, meaning that it visits every x in proportion to p(x). 2. Any distribution that satisfies (“detailed balance”) will be such an ergodic sequence! Deceptively simple proof: Compute distribution of x1’s successor point So how do we find such a p(xi|xi-1) ? Professor William H. Press, Department of Computer Science, the University of Texas at Austin

Metropolis-Hastings algorithm: Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953), Hastings (1970) Pick more or less any “proposal distribution” (A multivariate normal centered on x1 is a typical example.) Then the algorithm is: 1. Generate a candidate point x2c by drawing from the proposal distribution around x1 1953 Notice that the q’s cancel out if symmetric on arguments, as is a multivariate Gaussian 2. Calculate an “acceptance probability” by 3. Choose x2 = x2c with probability a, x2 = x1 with probability (1-a) So, It’s something like: always accept a proposal that increases the probability, and sometimes accept one that doesn’t. (Not exactly this because of ratio of q’s.) Professor William H. Press, Department of Computer Science, the University of Texas at Austin

and also the other way around Proof: So, But and also the other way around So, which is just detailed balance, q.e.d. Professor William H. Press, Department of Computer Science, the University of Texas at Austin

History has treated Metropolis perhaps more kindly than he deserves. Professor William H. Press, Department of Computer Science, the University of Texas at Austin

Metropolis-Hastings along one direction looks like this: The Gibbs Sampler is an interesting special case of Metropolis-Hastings A “full conditional distribution” of is the normalized distribution obtained by sampling along one coordinate direction (i.e. “drilling through” the full distribution. We write it as . “given all coordinate values except one” Theorem: A multivariate distribution is uniquely determined by all of its full conditional distributions. Proof (sort-of): It’s a hugely overdetermined set of linear equations, so any degeneracy is infinitely unlikely! Metropolis-Hastings along one direction looks like this: Choose the proposal distribution Then we always accept the step! But a proposal distribution must be normalized, so we actually do need to be able to calculate but only along one “drill hole” at a time! Professor William H. Press, Department of Computer Science, the University of Texas at Austin

So, Gibbs sampling looks like this: Cycle through the coordinate directions of x Hold the values of all the other coordinates fixed “Drill”, i.e., sample from the one dimensional distribution along the non-fixed coordinate. this requires knowing the normalization, if necessary by doing an integral or sum along the line Now fix the coordinate at the sampled value and go on to the next coordinate direction. source: Vincent Zoonekynd Amazingly, this actually samples the full (joint) posterior distribution! Professor William H. Press, Department of Computer Science, the University of Texas at Austin

Gibbs sampling can really win if there are only a few discrete values along each drill hole. Cl1 Cl2 Cl3 A o B C D E F G H I J K Example: Assigning objects to clusters. Now x is the vector of assignments of each object. Each component of x has just (here) 3 possible values. We assume of course some statistical model that is the relative likelihood of a particular assignment (e.g., that all the objects in a given column are in fact drawn from that column’s distribution). Then Gibbs sampling says just cycle through the objects, reassigning each in turn by its conditional distribution. Amazingly, this actually samples the full (joint) posterior distribution! Professor William H. Press, Department of Computer Science, the University of Texas at Austin

Burn-in can have multiple timescales (e. g Burn-in can have multiple timescales (e.g., ascent to a ridge, travel along ridge) Wikipedia Professor William H. Press, Department of Computer Science, the University of Texas at Austin