Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.

Slides:

Advertisements

Similar presentations

Introduction to Monte Carlo Markov chain (MCMC) methods

Advertisements

CS479/679 Pattern Recognition Dr. George Bebis

METHODS FOR HAPLOTYPE RECONSTRUCTION

Bayesian Estimation in MARK

Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.

Part 24: Bayesian Estimation 24-1/35 Econometrics I Professor William Greene Stern School of Business Department of Economics.

Gibbs Sampling Qianji Zheng Oct. 5th, 2010.

Markov-Chain Monte Carlo

Lecture 3 Nonparametric density estimation and classification

CHAPTER 16 MARKOV CHAIN MONTE CARLO

Bayesian statistics – MCMC techniques

BAYESIAN INFERENCE Sampling techniques

Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.

. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.

Bayesian Analysis of X-ray Luminosity Functions A. Ptak (JHU) Abstract Often only a relatively small number of sources of a given class are detected in.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.

Machine Learning CUNY Graduate Center Lecture 7b: Sampling.

Evaluating Hypotheses

Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Statistical Background

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Introduction to Monte Carlo Methods D.J.C. Mackay.

Bayes Factor Based on Han and Carlin (2001, JASA).

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,

Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

Module 1: Statistical Issues in Micro simulation Paul Sousa.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.

Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.

Three Frameworks for Statistical Analysis. Sample Design Forest, N=6 Field, N=4 Count ant nests per quadrat.

Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.

Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.

1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.

Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.

A tutorial on Markov Chain Monte Carlo. Problem  g (x) dx I = If{X } form a Markov chain with stationary probability  i  I  g(x ) i  (x ) i  i=1.

Tutorial I: Missing Value Analysis

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Gil McVean, Department of Statistics Thursday February 12 th 2009 Monte Carlo simulation.

Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:

Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.

Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.

Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.

How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Markov Chain Monte Carlo in R

(joint work with Ai-ru Cheng, Ron Gallant, Beom Lee)

MCMC Output & Metropolis-Hastings Algorithm Part I

Advanced Statistical Computing Fall 2016

Markov Chain Monte Carlo

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Econometrics Chengyuan Yin School of Mathematics.

Parametric Methods Berlin Chen, 2005 References:

Presentation transcript:

Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006

OUTLINE MCMC Convergence Diagnostics: Introduce 4 Methods in details Focus on Prescriptive summary Underlying theoretical basis Advantages and disadvantages Comparative results

1. Gelman and Rubin (1992) 1/4 What ? Based on normal theory approximation to exact Bayesian posterior inference: Focus on applied inference for Bayesian posterior distributions in real problem, which often tend toward normality after transformations and marginalization. Two major steps: Create an overdispersed estimate of the target distribution and use it to start several independent sequences. Analyze the multiple sequences to form a distributional estimate of what is known about the target r.v. given the simulations so far. The distributional estimate is a Student’s t distribution of each scalar quantity of interest. Convergence: Convergence is monitored by estimating the factor by which the scale parameter might shrink for infinite sampling.

1. Gelman and Rubin (1992) 2/4 How ? Step 1: Creating a starting distribution Locate the high-density regions of the target distribution of x and find the K modes. Approximate the high-density regions by a GMM: Form an overdispersed distribution by first drawing from the GMM and then dividing each sample by a positive number, which results in a mixture t distributions: Sharpen the overdispersed approximation by downweighting regions that have relatively low density through importance resampling for example.

1. Gelman and Rubin (1992) 3/4 Step 2: Re-estimating the target distributions Independently simulate m sequences of length 2 n from the overdispersed distribution and discard the first n iterations. For each scalar parameter of interest, estimate the following quantity from the last n iterations of m sequences: B : the variance between the means from m sequences; W : the average of the m within-sequence variances; : estimate of target mean: mean of mn samples : estimate of target variance (unbiased): Estimate the posterior of target distribution as a t distribution (considering variability of the estimates and ) with center and scale. Monitor the convergence by shrink factor, as it near 1 for all scalars, collect burn-out samples.

1. Gelman and Rubin (1992) 4/4 Comments: approaches to 1: within-sequences variance dominant between-sequences variance, all sequences escaped the influence of starting points and traverse all target distributions. Quantitative. Criticisms: Rely on the user’s ability to find a start distribution. Rely on normal approximation for diagnosing convergence to the true posterior. Inefficient, multiple sequences and discard a large number of early iterations.

2. Geweke (1992) 1/3 What ? Use methods from spectral analysis to assess convergence and the intent is to estimate the mean E[g(  )] of some function g(  ) of interest. Collect g(  (j) ) after each iteration Treat { g(  (j) ) } j=1,p as time series and compute spectral density S G (  ). Use numerical standard error (NSE) and relative numerical efficiency (RNE) to monitor convergence. Assumption: The MCMC process and the importance function g(  ), jointly imply the existence of a spectrum, and the existence of a spectral density with no discontinuities at the frequency 0.

2. Geweke (1992) 2/3 How ? Estimate E[g(  )] from p iterations Asymptotically estimator: Asymptotic variance: Determine preliminary iterations: Given the sequence {G(j)} j=1,p, if {G(j)} is stationary, as p->inf Determine sufficient iterations: Numerical standard error (NSE): Relative numerical efficiency (RNE) Indicating the number of draws wound be required to produce the same numerical accuracy if the draws had been made from an iid sample drawn directly from the posterior distribution. 0 1

2. Geweke (1992) 3/3 Comments Address the issues of both bias and variance. Is univariate. Require a single sampler chain. Disadvantages: Is sensitive to the spectral window. Not specify a procedure for applying the diagnostic but leave to the subjective choice of the users.

3. Ritter and Tanner (1992) 1/3 The Gibbs Stopper Convert the output of the Gibbs sampler to a sample from the exact distribution. Assign a weight w to the d-dimensional vector X drawn from the current iteration: q is a function proportional to the joint distribution; g i is the current Gibbs sampler approximation. Assess the convergence: If the current approximation to the joint distribution is “close” to the true one, then the distribution of the weights will be degenerate about a constant.

3. Ritter and Tanner (1992) 2/3 Compute g i : Let The joint distribution of the samples obtained at iteration i +1 is g i+1 ( X ) = The integration can be approximated by Monte Carlo method g i+1 ( X )  X 1, …, X m are samples drawn at iteration i. Probability of moving from X’ (at iteration i) to X at iteration i+1.

3. Ritter and Tanner (1992) 3/3 Comments: Assess distributional convergence; Disadvantages: Applicable only with the Gibbs sampler; Coding is problem-specific; Computation of weights can be time-intensive; If full conditionals are not standard distributions, we must estimate the normalizing constants.

4. Zellner and Min (1995) 1/3 Gibbs Sampler Convergence Criteria (GSC 2 ) Aim to determine whether the Gibbs sampler not only has converged, but also has converged to a correct result. Divide the model parameters into two parts ,  Derive analytical forms for Three convergence criterions: Assume (  1,  1 ) and (  2,  2 ) are two points in the parameter space Easily sampled posterior conditionals priorlikelihood

4. Zellner and Min (1995) 2/3 1. The anchored ratio convergence criterion (ARC 2 ) Calculate: If the Gibbs sampler output is “satisfactory”, then and will be close to. 2. The difference convergence criterion (DC 2 ) Since If ->0, then “satisfactory” 3. The ratio convergence criterion (RC 2 ) If ->1, then “satisfactory” Ratio of joint posterior density for the selected two points

4. Zellner and Min (1995) 3/3 Comments: Quantitative; Require a single sampler chain; Coding is problem-specific and analytical work is needed; Disadvantage: Application is limited when the factorization cannot be achieved.

Comparative results 1/3 Trivariate Normal with high correlations Run the samplers for relatively few iterations to test these methods detect convergence failure or ambiguity.

Comparative results 2/3 1. Gelman & Rubin: shrink factors (->1) 2. Geweke: NSE (->0)

Comparative results 3/4 Ritter & Tanner: Gibbs stopper (weights w -> constant)

Comparative results 4/4 Zellner & Min: Difference convergence Criterion ( -> 0)

Comparative results 5/5 Remarks: Geweke’s diagnostic appears to be premature; Gelman & Rubin’s method may be consistent with the fact; however choosing the starting points is critical; The results of other methods are difficult to interpret.

Summary, Discussion, and Recommendation Be cautious when using these diagnostics; Use a variety of diagnostic tools rather than any single one; Learn as much as possible about the target density before applying MCMC algorithm;