Pattern Recognition and Machine Learning-Chapter 13: Sequential Data Affiliation: Kyoto University Name: Kevin Chien, Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Dec. 9, 2011
Origin of Markov Models Idea Origin of Markov Models
Why Markov Models IID data not always possible. Illustrate future data (prediction) dependent on some recent data, using DAGs where inference is done by sum-product algorithm. State Space (Markov) Model: Latent Variables Discrete latent: Hidden Markov Model Gaussian latent: Linear Dynamical Systems Order of Markov Chain: data dependence 1st order: Current observation depends only on previous 1 observation
State Space Model Latent variable Zn forms a Markov chain. Each Zn contributes to its observation Xn. As order grows #parameter grows, to organize this we use State Space Model Zn-1 and Zn+1 is now independent given Zn (d-separated)
For understanding Markov Models Terminologies For understanding Markov Models
Terminologies Markovian Property: stochastic process that probability of a transition is dependent only on present state and not on the manner in which the current state is reached. Transition diagram for same variable different state
Terminologies (cont.) F is bounded above and below by g asymptotically (review)Zn+1 and Zn-1 is d-separated given Zn: means given we block Zn’s outgoing edges there is no path from Zn+1 and Zn-1 =>independent [Big_O_notation, Wikipedia, Dec. 2011]
Formula and motivation Markov Models Formula and motivation
Hidden Markov Models (HMM) Zn discrete multinomial variable Transition probability matrix Sum of each row =1 P(staying in present state) is non-zero Counting non-diagonals K(K-1) parameters
Hidden Markov Models (cont.) Emission (transition) probability with parameters governing the distribution homogeneous model: latent variable share the same parameter A Sampling data is simply noting the parameter values while following transitions with emission probability.
HMM, Expect. Max. for max. likelihood Likelihood function: marginalizing over latent variables Start with initial model parameters for Evaluate Defining Likelihood function results
HMM: forward-backward algorithm 2 stage message passing in tree for HMM, to find marginals p(node) efficiently Here the marginals are Assume p(xk|zk), p(zk|zk-1),p(z1) known X=(x1,..,xn), xi:j=(xi,xi+1,..,xj) Goal compute p(zk|x) Forward part: compute p(zk, x1:k) for every k=1,..,n Backward part: compute p(xk+1:n|zk) for every k=1,…,n
HMM: forward-backward algorithm (cont.) P(zk|x) ∝ p(zk,x) = p(xk+1:n|zk,x1:k) p(zk,x1:k) Where xk+1:n and x1:k are d-separated given zk so P(zk|x) ∝ p(zk,x) = p(xk+1:n|zk) p(zk,x1:k) Now we can do EM algorithm and Baum-Welch algorithm to estimate parameter values Sample from posterior z given x. Most likely z with Viterbi algorithm xk+1:n
HMM forward-backward algorithm: Forward part Compute p(zk,x1:k) p(zk,x1:k)=∑(all values of zk-1) p(zk,zk-1,x1:k) = ∑(all values of zk-1) p(xk|zk,zk-1,x1:k-1)p(zk|zk-1,x1:k-1)p(zk-1,x1:k-1) mm…look like a recursive function, if p(zk,x1:k) is labeled αk(zk) then zk-1,x1:k-1 and xk d-separated given zk zk and xk-1 d-separated given zk-1 So αk(zk)=∑(all values of zk-1) p(xk|zk)p(zk|zk-1) αk-1(zk-1) xk+1:n For k=2,..,n Emission prob. transition prob. recursive part
HMM forward-backward algorithm: Forward part (cont.) α1(z1)=p(z1,x1)=p(z1)p(x1|z1) If each z has m states then computational complexity is Θ(m) for each zk for one k Θ(m2) for each k Θ(nm2) in total xk+1:n
HMM forward-backward algorithm: Backward part Compute p(xk+1:n|zk) for all zk and all k=1,..,n-1 p(xk+1:n|zk)=∑(all values of zk+1) p(xk+1:n,zk+1|zk) =∑(all values of zk+1) p(xk+2:n|zk+1,zk,xk+1)p(xk+1|zk+1,zk)p(zk+1|zk) mm…look like a recursive function, if p(xk+1:n|zk) is labeled βk(zk) then zk,xk+1 and xk+2:n d-separated given zk+1 zk and xk+1 d-separated given zk+1 So βk(zk) =∑(all values of zk+1) βk+1(zk+1) p(xk+1|zk+1)p(zk+1|zk) xk+1:n For k=1,..,n-1 recursive part Emission prob. transition prob.
HMM forward-backward algorithm: Backward part (cont.) βn(zn) =1 for all zn If each z has m states then computational complexity is same as forward part Θ(nm2) in total xk+1:n
HMM: Viterbi algorithm Max-sum algorithm for HMM, to find most probable sequence of hidden states for a given observation sequence X1:n Example: transform handwriting images into text Assume p(xk|zk), p(zk|zk-1),p(z1) known Goal: compute z*= argmaxz p(z|x) Given x=x1:n, z=z1:n Given lemma f(a)≥0 ∀a and g(a,b) ≥0 ∀a,b then Maxa,b f(a)g(a,b) = maxa[f(a) maxb g(a,b)] maxz p(z|x) ∝ maxz p(z,x)
HMM: Viterbi algorithm (cont.) μk(zk)=maxz1:k p(z1:k,x1:k) =maxz1:k p(xk|zk)p(zk|zk-1) …..f(a) part p(z1:k-1,x1:k-1) ....g(a,b) part mm…look like a recursive function, if we can make max to appear in front of p(z1:k-1,x1:k-1). Use lemma - by setting a=zk-1, b=z1:k-2 =maxzk-1[p(xk|zk)p(zk|zk-1) maxz1:k-2 p(z1:k-1,x1:k-1)] =maxzk-1[p(xk|zk) p(zk|zk-1) μk-1(zk-1) ] For k=2,…,n
HMM: Viterbi algorithm (finish up) μk(zk)=maxzk-1 p(xk|zk) p(zk|zk-1) μk-1(zk-1) μ1(z1)= p(x1,z1)=p(z1)p(x1|z1) Same method to get maxz μn(zn)=maxz p(x,z) We can get max value, to get max sequence, compute recursive equation bottom-up while remembering values (μk(zk) looks at all paths of μk-1(zk-1)) For k=2,…,n
Additional Information Excerpt of equations and diagrams from [Pattern Recognition and Machine Learning, Bishop C.M.] page 605-646 Excerpt of equations from Mathematicalmonk, Youtube LLC, Google Inc., (ML 14.6 and 14.7) various titles, July 2011