Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas
2 Markov chains A subtype of random walks (not necessarily uniform) where the entire memory of the system is contained in the current state Described by a transition matrix P, where p ij is the probability of going from state i to state j Very useful for describing stochastic discrete systems
3 Markov chain example CG A T p AC p TC p CT p TG p GT p GA p AG p CA p TA p AT p CG p GC p CC p AA p GG p TT
4 Example application of Markov chains Markov chains model dependencies across time or position The model assigns a probability to each sequence of observed data and can be used to measure how likely an observed sequence is to follow it The GeneMark algorithm uses 5th-order Markov chains (why?) to find genes (distinguishing them from other regions in the DNA)
5 Markov Chains – Stationary Distribution Under general assumptions (irreducibility and aperiodicity), Markov chains have a stationary distribution π, the limit of P k as k goes to infinity An irreducible MC has no identical states An aperiodic MC can reach any state from any other state Such a Markov chain is called ergodic
6 Markov Chain Monte Carlo Used as a means to guide the selection of samples We want the stationary distribution of the Markov chain to be the distribution we are sampling from In many cases, this is relatively easy to do
7 Gibbs sampling A special case of MCMC where the conditional probabilities on specific variables can be calculated easily, but the joint probability must be sampled from
8 Gibbs sampling in our problem Start with a single candidate S={S 1,...,S k } where each S i chosen randomly and uniformly from input sequence i Calculate A and D(A||B) for S Choose one member of S randomly to remove Choose an alternative (from the corresponding sequence) with probability proportional to the corresponding D(A||B) Repeat until D(A||B) converges
9 Exploring alternative strings When we replace a string from sequence i –We examine in turn each of the m-n+1 strings that that sequence could offer –For each such string, we add it temporarily to S and calculate the new A and D(A||B) –Then we assign to each string S ij (j varies across these strings) probability May pick a “worse” string, or the same string we just removed
10 Gibbs sampler convergence Return best S seen across all iterations (may not be the last one) Stop after a fixed number of iterations, or when D(A||B) does not change very much Solution is sensitive to the starting S, so we typically run the algorithm several (thousand) times from different starting points
11 Complexity of Gibbs sampler Construct initial S and calculate A and D(A||B) in O(kn) time Each iterative step takes O(n) time to remove a string and recalculate D(A||B), O(mn) time to calculate the probabilities of the m-n+1 alternatives Total time is O(mnd) where d is the number of iteration steps (dm>>k), multiplied by the number of random restarts
12 Why Gibbs sampling works Retains elements of the greedy approach –weighing by relative entropy makes likely to move towards locally better solutions Allows for locally bad moves with a small probability, to escape local maxima
13 Variations in Gibbs sampling Discard substrings non-uniformly (weighed by relative entropy, analogous to subsequent selection of new string) Use simulated annealing to reduce chance of making a bad move (and gradually ensure convergence)
14 Annealing Annealing is a process in metallurgy for improving metals by increasing crystal size and reducing defects The process works by heating the metal and controlled cooling which lets the atoms go through a series of states with gradually lower internal energy The goal is to have the metal settle in a configuration with lower than the original internal energy
15 Simulated Annealing Simulated annealing (SA) adopts an energy function equal to the function we want to minimize Transitions between neighboring states are adopted with probability specified by a function f of the energy gain ΔE=E new -E old and the temperature T f(ΔE,T)>0 even if ΔE>0, but as T→0, f(ΔE,T)→1 if ΔE 0
16 Simulated Annealing Original f (from the Metropolis-Hasting algorithm) T controls acceptance of locally bad solutions The annealing schedule is a process for gradually reducing T so that eventually only good moves are accepted
17 Special cases If T is always zero, –simulated annealing reduces to greedy local optimization If T is constant but non-zero, –simulated annealing reduces to the process we described for Gibbs sampling (choose solutions randomly with probability proportional to their improvement)