Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ.

Slides:



Advertisements
Similar presentations
02/12/ a tutorial on Markov Chain Monte Carlo (MCMC) Dima Damen Maths Club December 2 nd 2008.
Advertisements

Exact Inference in Bayes Nets
The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies.
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Markov Chains Modified by Longin Jan Latecki
Bayesian Methods with Monte Carlo Markov Chains III
Markov Chains 1.
11 - Markov Chains Jim Vallandingham.
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian Reasoning: Markov Chain Monte Carlo
Graduate School of Information Sciences, Tohoku University
Suggested readings Historical notes Markov chains MCMC details
BAYESIAN INFERENCE Sampling techniques
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Today Introduction to MCMC Particle filters and MCMC
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Experimental Evaluation
Bayesian Analysis for Extreme Events Pao-Shin Chu and Xin Zhao Department of Meteorology School of Ocean & Earth Science & Technology University of Hawaii-
Monte Carlo Methods in Partial Differential Equations.
Introduction to Monte Carlo Methods D.J.C. Mackay.
6. Markov Chain. State Space The state space is the set of values a random variable X can take. E.g.: integer 1 to 6 in a dice experiment, or the locations.
Bayes Factor Based on Han and Carlin (2001, JASA).
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Simulation techniques Summary of the methods we used so far Other methods –Rejection sampling –Importance sampling Very good slides from Dr. Joo-Ho Choi.
Chapter 61 Continuous Time Markov Chains Birth and Death Processes,Transition Probability Function, Kolmogorov Equations, Limiting Probabilities, Uniformization.
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
Ch. 14: Markov Chain Monte Carlo Methods based on Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009.; C, Andrieu, N, de Freitas,
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
7. Metropolis Algorithm. Markov Chain and Monte Carlo Markov chain theory describes a particularly simple type of stochastic processes. Given a transition.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference of Non-Overlapping Camera Network Topology by Measuring Statistical Dependence Date :
CS774. Markov Random Field : Theory and Application Lecture 15 Kyomin Jung KAIST Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Assessing the significance of (data mining) results Data D, an algorithm A Beautiful result A (D) But: what does it mean? How to determine whether the.
STAT 534: Statistical Computing
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Introduction to Sampling based inference and MCMC
Markov Chain Monte Carlo methods --the final project of stat 6213
Advanced Statistical Computing Fall 2016
Intro to Sampling Methods
Jun Liu Department of Statistics Stanford University
Markov chain monte carlo
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Haim Kaplan and Uri Zwick
Lecture 15 Sampling.
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
Markov Networks.
Presentation transcript:

Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ.

References

Recall: Sampling Motivation If we can generate random samples x i from a given distribution P(x), then we can estimate expected values of functions under this distribution by summation, rather than integration. That is, we can approximate: by first generating N i.i.d. samples from P(x) and then forming the empirical estimate:

unknown normalization factor we only know how to sample from a few “nice” multidimensional distributions (uniform, normal)

Recall: Sampling Methods Inverse Transform Sampling (CDF) Rejection Sampling Importance Sampling

Problem Intuition: In high dimension problems, the “Typical Set” (volume of nonnegligable prob in state space) is a small fraction of the total space.

Recall: Markov Chain Question Assume you start in some state, and then run the simulation for a large number of time steps. What percentage of time do you spend at X1, X2 and X3? Recall: Transpose of transition matrix (columns sum to one)

four possible initial distributions [ ] initial distribution distribution after one time step all eventually end up with same distribution -- this is the stationary distribution!

in matlab: [E,D] = eigs(K)

General Idea Start in some state, and then run the simulation for some number of time steps. After you have run it “long enough” start keeping track of the states you visit. {... X1 X2 X1 X3 X3 X2 X1 X2 X1 X1 X3 X3 X2...} These are samples from the distribution you want, so you can now compute any expected values with respect to that distribution empirically.

Theory If the Markov chain is positive recurrent, there exists a stationary distribution. If it is positive recurrent and irreducible, there exists a unique stationary distribution. Then, the average of a function f over samples of the Markov chain is equal to the average with respect to the stationary distribution (cause it’s important) expected return time to every state is finite every state is accessible from every other state. We can compute this empirically as we generate samples. This is what we want to compute, and is infeasible to compute in any other way.

But how to “design” the chain? Assume you want to spend a particular percentage of time at X1, X2 and X3. What should the transition probabilities be? X1 X2 X3 P(x1) =.2 P(x2) =.3 P(x3) =.5 K = [ ? ? ? ? ? ? ? ? ? ]

Let’s consider a specific example

Example: People counting Problem statement: Given a foreground image, and person-sized bounding box*, find a configuration (number and locations) of bounding boxes that cover a majority of foreground pixels while leaving a majority of background pixels uncovered. * note: height, width and orientation of the bounding box may depend on image location… we determine these relationships beforehand through a calibration procedure. foreground image person-sized bounding box

Likelihood Score To measure how “good” a proposed configuration is, we generate a foreground image from it and compare with the observed foreground image to get a likelihood score. observed foreground image generated foreground image config = {{x 1,y 1,w 1,h 1,theta 1 },{x 2,y 2,w 2,h 2,theta 2 },{x 3,y 3,w 3,h 3,theta 3 }} compare Likelihood Score

Bernoulli distribution model likelihood simplify, by assuming log likelihood Number of pixels that disagree!

Searching for the Max config k = {{x 1,y 1,w 1,h 1,theta 1 },{x 2,y 2,w 2,h 2,theta 2 },…,{x k,y k,w k,h k,theta k }} The space of configurations is very large. We can’t exhaustively search for the max likelihood configuration. We can’t even really uniformly sample the space to a reasonable degree of accuracy. Let N = number of possible locations for (x i,y i ) in a k-person configuration. Size of config k = N k And we don’t even know how many people there are... Size of config space = N 0 + N 1 + N 2 + N 3 + … If we also wanted to search for width, height and orientation, this space would be even more huge.

Searching for the Max Local Search Approach –Given a current configuration, propose a small change to it –Compare likelihood of proposed config with likelihood of the current config –Decide whether to accept the change

Proposals Add a rectangle (birth) add current configuration proposed configuration

Proposals Remove a rectangle (death) remove current configuration proposed configuration

Proposals Move a rectangle move current configuration proposed configuration

Searching for the Max Naïve Acceptance –Accept proposed configuration if it has a larger likelihood score, i.e. Compute a = L(proposed) L(current) Accept if a > 1 – Problem: leads to hill-climbing behavior that gets stuck in local maxima Likelihood start But we really want to be over here! Brings us here

Searching for the Max The MCMC approach –Generate random configurations from a distribution proportional to the likelihood! Likelihood Generates many high likelihood configurations Generates few low likelihood ones.

Searching for the Max The MCMC approach –Generate random configurations from a distribution proportional to the likelihood! – This searches the space of configurations in an efficient way. – Now just remember the generated configuration with the highest likelihood.

Sounds good, but how to do it? Think of configurations as nodes in a graph. Put a link between nodes if you can get from one config to the other in one step (birth, death, move) config A config B config C config E config D birth death birth move birth death birth move death birth Note links come in pairs: birth/death; move/move

Detailed Balance Consider a pair of configuration nodes r,s Want to generate them with frequency relative to their likelihoods L(r) and L(s) Let q(r,s) be relative frequency of proposing configuration s when the current state is r (and vice versa) r s L(r) L(s) q(r,s) q(s,r) A sufficient condition to generate r,s with the desired frequency is L(r) q(r,s) = L(s) q(s,r) “detailed balance”

Detailed Balance Typically, your proposal frequencies do NOT satisfy detailed balance (unless you are extremely lucky). To “fix this”, we introduce a computational fudge factor a r s L(r) L(s) a * q(r,s) Detailed balance: a* L(r) q(r,s) = L(s) q(s,r) Solve for a: a = L(s) q(s,r) L(r) q(r,s) q(s,r)

MCMC Sampling Metropolis Hastings algorithm Propose a new configuration Compute a = L(proposed) q(proposed,current) L(current) q(current,proposed) Accept if a > 1 Else accept anyways with probability a Difference from Naïve algorithm

Trans-dimensional MCMC Green’s reversible-jump approach (RJMCMC) gives a general template for exploring and comparing states of differing dimension (diff numbers of rectangles in our case). Proposals come in reversible pairs: birth/death and move/move. We should add another term to the acceptance ratio for pairs that jump across dimensions. However, that term is 1 for our simple proposals.

MCMC in Action Sequence of proposed configurationsSequence of accepted configurations movies

MCMC in Action Max likelihood configurationLooking good!

Examples

diff with rejection sampling: instead of throwing away rejections, you replicate them into next time step. Note: you can just make q up on-the-fly.

Metropolis Hastings Example X1 X2 X3 P(x1) =.2 P(x2) =.3 P(x3) =.5 Proposal distribution q(xi, (xi-1)mod3 ) =.4 q(xi, (xi+1)mod3) =.6 Matlab demo

Variants of MCMC there are many variations on this general approach, some derived as special cases of the Metropolis-Hastings algorithm

q(x’,x) q(x, x’) cancels e.g. Gaussian

simpler version, using 1D conditional distributions or line search, or...

1D marginal wrt x1 1D marginal wrt x2 interleave

Gibbs Sampler Special case of MH with acceptance ratio always 1 (so you always accept the proposal). S.Brooks, “Markov Chain Monte Carlo and its Application” where

Simulated Annealing introduce a “temperature” term that makes it more likely to accept proposals early on. This leads to more aggressive exploration of the state space. Gradually reduce the temperature, causing the process to spend more time exploring high likelihood states. Rather than remember all states visited, keep track of the best state you’ve seen so far. This is a method that attempts to find the global max (MAP) state.

Trans-dimensional MCMC Exploring alternative state spaces of differing dimensions (example, when doing EM, also try to estimate number of clusters along with parameters of each cluster). Green’s reversible-jump approach (RJMCMC) gives a general template for exploring and comparing states of differing dimension.