MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.

Image Modeling & Segmentation

Bayesian Estimation in MARK

Expectation Maximization

Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.

Lecture 3 Probability and Measurement Error, Part 2.

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

CHAPTER 16 MARKOV CHAIN MONTE CARLO

Visual Recognition Tutorial

First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.

Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.

Lecture 5: Learning models using EM

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

Expectation Maximization Algorithm

Visual Recognition Tutorial

Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Introduction to Monte Carlo Methods D.J.C. Mackay.

The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.

Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,

EM and expected complete log-likelihood Mixture of Experts

Model Inference and Averaging

G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.

A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.

Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

First topic: clustering and pattern recognition Marc Sobel.

- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.

Mixture of Gaussians This is a probability distribution for random variables or N-D vectors such as… –intensity of an object in a gray scale image –color.

Lecture 2: Statistical learning primer for biologists

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Introduction to Sampling Methods Qi Zhao Oct.27,2004.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.

G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.

Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

MCMC Output & Metropolis-Hastings Algorithm Part I

Model Inference and Averaging

Jun Liu Department of Statistics Stanford University

Latent Variables, Mixture Models and EM

Statistical Learning Dong Liu Dept. EEIS, USTC.

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry

More about Posterior Distributions

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

EM for Inference in MV Data

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Expectation-Maximization & Belief Propagation

Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics

Learning From Observed Data

EM for Inference in MV Data

Stochastic Methods.

Presentation transcript:

MCMC (Part II) By Marc Sobel

Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative constant of proportionality. Newton- Raphson says that we can pick a point nearer a mode by using the transformation:

Langevin Algorithms  Monte Carlo demands that we explore the distribution rather than simply moving toward a mode. Therefore, we can introduce a noise factor via:  (Note that we have replaced ‘ε’ by σ. We can just use it as is or combine it with a Metropolis Hastings step:

Langevin Algorithm with Metropolis Hastings  The move probability is:

Extending the Langevin to a Hybrid Monte Carlo Algorithm  Instead of moving based entirely on the gradient (with noise added on) we could add  ‘kinetic energy’ via:  Iterate this algorithm. 

Matlab Code for Hybrid MC: A total of Tau steps along the constant energy path  g=gradient(x); (set gradient)  E=log(f(x)); (set energy)  For i=1:L  P=randnorm(size(x));  H=p’*p/2 + E;  gnew=G; xnew=x;  for tau=1:Tau  p=p-epsilon*gnew/2; (make half step in p)  xnew=xnew+epsilon*p (make an x step)  gnew=gradient(xnew); (update gradient)  p=p-epsilon*gnew/2; (make another half step in p)  end  Enew=log(f(xnew)); (find anew value)  Hnew=p’*p/2+Enew; (find new H)  dH=Hnew-H;  if(rand<exp(-dH)) Accept=1; else Accept=0; end  if(Accept==1) H=Hnew; end  end 

Example  Log(f(x))= x 2 +a 2 -log(cosh(ax)); k(p)=p 2 ;

Project  Use Hybrid MC to sample from a multimodal multivariate density. Does it improve simulation?

Monte Carlo Optimization: Feedback, random updates, and maximization  Can monte Carlo help us search for the optimum value of a function. We’ve already talked about simulated annealing. There are other methods as well.

Random Updates to get to the optimum  Suppose we return to the problem of finding modes: Let ζ denote a uniform random variable on the unit sphere, and α x, β x are determined by numerical analytic considerations (see Duflo 1998). (We don’t get stuck using this).

Optimization of a function depending on the data  Minimize the (two-way) KLD between a density q(x) and a Gaussian mixture  f=∑α i φ(x-θ i ) using samples. The two way KLD is:  We can minimize this by first sampling X 1,…,X n from q, and then sampling Y 1,…,Y n from s 0 (x) (assuming it contains the support of the f’s) and minimizing

Example (two-way) KLD  Monte Carlo rules dictate that we can’t sample from a distribution which depends on the parameters we want to optimize. Hence we importance sample the second KLD equation using s 0. We also employ an EM type step involving latent variables Z:

Prior Research  We (Dr Latecki, Dr. Lakaemper and I) minimized the one way KLD between a nonparametric density q and a gaussian mixture. (paper pending)  But note that for mixture models which put large weight on places where the NPD is not well-supported, minimizing may not give you the best possible result. 

Project  Use this formulation to minimize the KLD distance between q (e.g., a nonparametric density based on a data set) and a gaussian mixture.

General Theorem in Monte Carlo Optimization  One way of finding an optimal value for a function f(θ), defined on a closed bounded set, is as follows: Define a distribution:  for a parameter λ which we let tend to infinity. If we then simulate θ 1,…,θ n ≈ h(θ), then 

Monte Carlo Optimization Observe (X 1,…,X n |θ)≈ L(X|θ): Simulate, θ 1,…,θ n from the prior distribution π(θ). Observe (X 1,…,X n |θ)≈ L(X|θ): Simulate, θ 1,…,θ n from the prior distribution π(θ). Define the posterior (up to a constant of proportionality) by, l(θ|X). It follows that, converges to the MLE. Proof uses laplace approximation (see Robert (1993)).

Exponential Family Example  Let X~exp{λθx-λψ(θ)}, and θ~π

Possible Example  It is known that calculating maximum likelihood estimators for the parameters in a k- component mixture model are hard to compute. If, instead maximizing the likelihood, we treat the mixture as a Bayesian model together with a scale parameter λ and an indifference prior, we can (typically) use Gibbs sampling to sample from this model. Letting λ tend to infinity leads to our being able to construct MLE’s.

Project  Implement an algorithm to find the MLE for a simple 3 component mixture model. (Use Robert (1993)).