STAT 534: Statistical Computing Hari Narayanan harin@uw.edu
Course objectives Write programs in R and C tailored to specifics of statistics problems you want to solve Familiarize yourself with: optimization techniques Markov Chain Monte Carlo (mcmc)
Logistics Class: Office hours: Textbooks: Evaluation: Tuesdays and Thursdays 12:00pm – 1:20pm Office hours: Thursday 2:30pm – 4pm (Padelford B-301) or by appt Textbooks: Robert & Casella : Introducing Monte Carlo Methods with R Kernighan & Ritchie : C Programming Language Evaluation: 4 assignments 2 quizzes Final project
Note to students These slides contain more material than Was covered in class. The main source http://pages.uoregon.edu/dlevin/MARKOV/markovmixing.pdf is referred to at http://faculty.washington.edu/harin/stat534-lectures.html
Review of MCMC done so far Natural questions 1. What is the Monte carlo method? an elementary example - (a) finding the area of an irregular region on the plane. (b) integrating a nonnegative function over an area
(a) Area of an irregular region Bound the region by a rectangle Pick points uniformly with rectangle Area of region/Area of rectangle= (Number in)/Total Example: Area of a circle of radius 1 /(area of square of side 2)= π
Approximating Pi Draw a circle with radius =1 Inside a 2x2 square – [-1,1] x [-1,1] Pick points (x,y) taking x<- rnorm(n, min=-1,max=1); y<- rnorm(n, min=-1,max=1) Let int be the number of points which fell inside the circle. Approx_pi = int/4
Integrating a nonnegative function over an area Pick points according to a suitable distribution How to pick points according to a given distribution?
Picking points randomly for area of irregular region pick points uniformly at random within a larger regular region and count how many fall within the region in question. for finding the integral of a nonnegative function over an irregular region, pick points according to a suitable distribution How to pick points according to a given distribution?
The markov chain method Pick a Markov chain whose steady state distribution is the desired distribution. Start from any point and let the markov chain run for sometime. Pick the point you reach. You would have picked the point you reached with approximately the steady state probability for that point
Markov Chain + filter=New MC Here you begin with a certain Markov chain with a symmetric transition matrix But you make your move to the next point with an additional condition of acceptance or rejection
Details of the Metropolis Filter Let T be a symmetric transition matrix. (T is reversible with respect to the uniform Probability distribution on the sample space Ω.) We will modify the transitions made according to T to obtain a chain with stationary distribution π Ωπ
Details of Metropolis Filter Evolution of the new chain : When at state x, candidate move generated from T(x,.). If proposed new state is y. then the move is censored with probability (1-a(x,y)) , equivalently with probability a(x,y) the new state is the proposed state and with probability (1-a(x,y)), the chain remains at x. Note : This is a new Markov chain with transition matrix, say P Ωπ
What is the new transition matrix? P(x,y) = T(x,y) a(x,y) if y not equal to x = (1 -∑z T(x,z)a(x,z)), z not equal to x if y=x We know transition matrix P has stationary distribution π if π(x)T(x,y)a(x,y)= π(y)T(y,x)a(y,x) for all x not equal to y.
Transition matrix P has stationary distribution π if π(x)T(x,y)a(x,y)= π(y)T(y,x)a(y,x) for all x not equal to y. Proof: For p to be stationary distribution for transition matrix P(x,y), we need πP= π, i.e ∑ over x( π (x)P(x,y)) = π (y). Here take P(x,y)=T(x,y)a(x,y). But π(x)P(x,y)= π(x)T(x,y)a(x,y)= π(y)T(y,x)a(y,x) = π (y)P(y,x) So ∑ over x( π (x)P(x,y)) = ∑ over x( π (y)P(y,x)) = π (y) (∑ over x P(y,x))= π (y) since (∑ over x P(y,x))= 1.
Back to the filter The acceptance probability of a move is a(x,y) Rejection of moves slows the chain. So make a(x,y) as large as possible. How large can it be? We will force the `nice’ condition π(x)T(x,y)a(x,y)= π(y)T(y,x)a(y,x),i.e., π(x)a(x,y)= π(y)a(y,x). Since a(x,y) is a probability π(x)T(x,y)a(x,y)= π(y)T(y,x)a(y,x) <= π(x) and π(y) So a(x,y)<= min(π(y)/ π(x), 1), since T is symmetric and a(x,y) is a probability
Metropolis chain of a Markov chain T(x,y) Transition matrix of Markov Chain P(x,y) Transition matrix of Metropolis Chain Is therefore defined as follows P(x,y)= T(x,y) (min( π(y)/ π(x), 1), if y not = x = 1- ∑ over z not=x (P(x,z)) if y = x For this P(x,y) , π is a stationary distribution
Why all this complication? Note that the Metropolis chain is `just another Markov Chain’. So why not start with it right away instead of starting with some other Markov Chain and then do this acceptance/rejection?
Reasons for the indirect procedure 1. May be difficult to build a Markov Chain directly for a stationary distribution π 2. The Metropolis chain depends only on the ratio π(y)/ π(x). Suppose we know π only within some normalizing constant M which is difficult to compute. Metropolis chain does not care what M is.
Numbers assigned to vertices of a graph. Find max value vertex If you `greedily’ move from any point to a higher neighbour you can get trapped at a local maximum Consider a regular graph Random walk has a symmetric transition matrix For k>=1, define πk (x)= k^(f(x))/Z (k) where f(x) is the number assigned to the vertex and Z(k)= ∑ k^(f(x)) over all x, making πk (x), a probability measure.
Suppose number of vertices Large Markov chain corresponding to π, difficult to compute because Z difficult to compute. But Metropolis chain of the `random walk Markov Chain’ accepts transition with probability k^(f(x)-f(y)). We will show that for large k this randomized algorithm ultimately gets to a maximum vertex. I
Proof that Metropolis method hits max value vertex Since πk (x)= k^(f(x))/Z (k), this distribution favours vertices with large value of f(x). Let f* be the max value that a vertex has. Define Ω* = collection of all vertices with max value. Now πk (x) can be rewritten as ( k^(f(x))/k^(f*)) / (Z (k)/ k^(f*)) (Z (k)/ k^(f*)) can be written as |Ω*|(k^f*/k^f*) + summation over the remaining elements of the expression k^f(x)/k^f* For large k, therefore, πk (x) converges to 1 over the vertices in Ω*, i.e with probability 1 we hit one of the vertices with max value.