The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Monte Carlo Methods and Statistical Physics
Bayesian Estimation in MARK
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Markov-Chain Monte Carlo
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian Reasoning: Markov Chain Monte Carlo
Bayesian statistics – MCMC techniques
Suggested readings Historical notes Markov chains MCMC details
BAYESIAN INFERENCE Sampling techniques
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Bayes Factor Based on Han and Carlin (2001, JASA).
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
Correlation With Errors-In-Variables3/28/20021 Correlation with Errors-In-Variables and an Application to Galaxies William H. Jefferys University of Texas.
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
Model Inference and Averaging
Priors, Normal Models, Computing Posteriors
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
Improved Cross Entropy Method For Estimation Presented by: Alex & Yanna.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Sampling and estimation Petter Mostad
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
A tutorial on Markov Chain Monte Carlo. Problem  g (x) dx I = If{X } form a Markov chain with stationary probability  i  I  g(x ) i  (x ) i  i=1.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #31 A Tale.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
Review of statistical modeling and probability theory Alan Moses ML4bio.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Introduction: Metropolis-Hasting Sampler Purpose--To draw samples from a probability distribution There are three steps 1Propose a move from x to y 2Accept.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
Markov Chain Monte Carlo in R
MCMC Output & Metropolis-Hastings Algorithm Part I
Advanced Statistical Computing Fall 2016
Jun Liu Department of Statistics Stanford University
Markov chain monte carlo
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Computing and Statistical Data Analysis / Stat 7
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
#22 Uncertainty of Derived Parameters
Presentation transcript:

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2008 The University of Texas at Austin Unit 13: Markov Chain Monte Carlo (MCMC)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 2 Markov Chain Monte Carlo (MCMC) samples an unnormalized distribution –usually it’s the Bayes posterior, where we don’t know the normalizing denominator –often in many dimensions (the parameters of a statistical model) –from the sample, we can estimate all quantities of interest The key ideas go back to Metropolis et al. –sample via a Markov chain –impose detailed balance and get ergodicity The Metropolis-Hastings algorithm satisfies detailed balance –it’s an “always-take-favorable, sometimes-take-unfavorable” random stepping scheme –you have to design the steps appropriate to the problem –Gibbs sampler (not done here) is a special case We try MCMC on the Student-t exon length model –see that it is actually bimodel, explaining the sensitivity previously seen We also try the Lazy Birdwatcher problem (in NR3) –see that we can measure the rate parameter in a Poisson process by the fluctuation level, without even knowing the rate of events –easy for MCMC, would be much harder by frequentist methods Unit 13: Markov Chain Monte Carlo (MCMC) (Summary)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 3 Markov Chain Monte Carlo (MCMC) Data set Parameters We want to go beyond simply maximizing and get the whole Bayesian posterior distribution of Bayes says this is proportional to but with an unknown proportionality constant (the Bayes denominator). With such a sample, we can compute any quantity of interest about the distribution of, e.g., means, standard deviations, covariances, etc. (sorry, we’ve changed notation yet again!) MCMC is a way of drawing samples from the distribution without having to know its normalization!

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 4 Two ideas due to Metropolis and colleagues make this possible: 1. Instead of sampling unrelated points, sample a Markov chain where each point is (stochastically) determined by the previous one by some chosen distribution Although locally correlated, it is possible to make this sequence ergodic, meaning that it visits every x in proportion to  (x). 2. Any distribution that satisfies (“detailed balance”) will be such an ergodic sequence! Deceptively simple proof: Compute distribution of x 1 ’s successor point So how do we find such a p(x i |x i-1 ) ?

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 5 Metropolis-Hastings algorithm: Pick more or less any “proposal distribution” (A multivariate normal centered on x 1 is a typical example.) Then the algorithm is: 1. Generate a candidate point x 2c by drawing from the proposal distribution around x 1 2. Calculate an “acceptance probability” by Notice that the q’s cancel out if symmetric on arguments, as is a multivariate Gaussian 3. Choose x 2 = x 2c with probability , x 2 = x 1 with probability (1-  ) So, It’s something like: always accept a proposal that increases the probability, and sometimes accept one that doesn’t. (Not exactly this because of ratio of q’s.)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 6 Proof: which is just detailed balance! (“Gibbs sampler”, beyond our scope, is a special case of Metropolis- Hastings. See, e.g., NR3.)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 7 g = readgenestats('genestats.dat'); exloglen = log10(cell2mat(g.exonlen)); global exsamp; exsamp = randsample(exloglen,600); [count cbin] = hist(exsamp,(1:.1:4)); count = count(2:end-1); cbin = cbin(2:end-1); bar(cbin,count,'y') Let’s try MCMC on our two-Student-t model, assuming (as before) 600 data points, drawn as a random sample from the full data set Get the data: cstart = [ ] loglikefn twostudloglikenu(cc,exsamp); covar = 0.1 * inv(hessian(loglikefn,cstart,.001)) cstart = covar = Get a starting value from one of our previous fits, and use its (scaled) covariance matrix for the proposal distribution: (they’re not really zeros, they just print that way) shown here binned, but we won’t actually use bins

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 8 Here’s the Metropolis-Hastings step function: function cnew = mcmcstep(cold, covar) cprop = mvnrnd(cold,covar); alpha = min(1,exp(loglikefn(cold)-loglikefn(cprop))); if (rand < alpha) cnew = cprop; else cnew = cold; end function ll = loglikefn(cc) %subfunction global exsamp; ll = twostudloglikenu(cc,exsamp); chain = zeros(1000,6); chain(1,:) = cstart; for i=2:1000 chain(i,:) = mcmcstep(chain(i-1,:),covar); end plot(chain(:,5),chain(:,6)) Let’s see the first 1000 steps: width of 2 nd component Student-t index we’re only plotting 2 components of chain, but it of course actually is a sample of the joint distribution of all the parameters

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 9 chain = zeros(10000,6); chain(1,:) = cstart; for i=2:10000, chain(i,:) = mcmcstep(chain(i-1,:),covar); end plot(chain(:,5),chain(:,6),'.r') Try steps: OK, plausibly ergodic. Should probably do steps, but Matlab is too slow and I’m too lazy to program it in C. (Don’t you be!) There are various ways of checking for convergence more rigorously, none of them foolproof; see NR3.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 10 The big payoff now is that we can look at the posterior distribution of any quantity, or derived quantity, or joint distribution of quantities, etc., etc. areas = chain(:,2)./(chain(:,3).*chain(:,5)); plot(areas,chain(:,6)) Student t index ratio of areas

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 11 hist(areas,1:.2:15) We can now see why the ratio of areas is so hard to determine: For some of our data samples, it can be bimodal. Maximum likelihood and derivatives of the log-likelihood (Fisher information matrix) don’t capture this. MCMC and Bayes posteriors do.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 12 Incidentally, the distributions are very sensitive to the tail data. Different samples of 600 points give (e.g.) With only one data set, you could diagnose this sensitivity by bootstrap resampling. However, good Bayesians show only the posteriors from all the data actually known.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 13 Let’s do another MCMC example to show how it can be used with models that might be analytically intractable (e.g., discontinuous or non-analytic). [This is the example worked in NR3.] The lazy birdwatcher problem You hire someone to sit in the forest and look for mockingbirds. They are supposed to report the time of each sighting t i –But they are lazy and only write down (exactly) every k 1 sightings (e.g., k 1 = every 3rd) Even worse, at some time t c they get a school kid to do the counting for them –He doesn’t recognize mockingbirds and counts grackles instead –And, he writes down only every k 2 sightings, which may be different from k 1 You want to salvage something from this data –E.g., average rate of sightings of mockingbirds and grackles –Given only the list of times –That is, k 1, k 2, and t c are all unknown nuisance parameters This all hinges on the fact that every second (say) event in a Poisson process is statistically distinguishable from every event in a Poisson process at half the mean rate –same mean rates –but different fluctuations –We are hoping that the difference in fluctuations is enough to recover useful information Perfect problem for MCMC

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 14 Waiting time to the kth event in a Poisson process with rate is distributed as Gamma(k, ) And non-overlapping intervals are independent: So Proof: p ( ¿ ) d ¿ = P ( k ¡ 1 coun t s i n¿ ) £ P ( l as t d ¿ h asacoun t ) = P o i sson ( k ¡ 1 ; ¸ ¿ ) £ ( ¸d ¿ ) = ( ¸ ¿ ) k ¡ 1 ( k ¡ 1 ) ! e ¡ ¸ ¿ ¸d ¿

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 15 In the acceptance probability the ratio of the q’s in is just x 2c /x 1, because What shall we take as our proposal generator? This is often the creative part of getting MCMC to work well! For t c, step by small additive changes (e.g., normal) For 1 and 2, step by small multiplicative changes (e.g., lognormal) Bad idea: For k 1,2 step by 0 or ±1 This is bad because, if the ’s have converged to about the right rate, then a change in k will throw them way off, and therefore nearly always be rejected. Even though this appears to be a “small” step of a discrete variable, it is not a small step in the model! Good idea: For k 1,2 step by 0 or ±1, also changing 1,2 so as to keep /k constant in the step This is genuinely a small step, since it changes only the clumping statistics, by the smallest allowed amount.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 16 Let’s try it. We simulate 1000 t i ’s with the secretly known 1 =3.0,  2 =2.0, t c =200, k 1 =1, k 2 =2 Start with wrong values 1 =1.0,  2 =3.0, t c =100, k 1 =1, k 2 =1 “burn-in” period while it locates the Bayes maximum ergodic period during which we record data for plotting, averages, etc.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 17 Histogram of quantities during a long-enough ergodic time These are the actual Bayesian posteriors of the model! Could as easily do joint probabilities, covariances, etc., etc. Notice does not converge to being centered on the true values, because the (finite available) data is held fixed. Convergence is to the Bayesian posterior for that data.