Hidden Markov Models Lecture 5, Tuesday April 15, 2003
Review of Last Lecture Lecture 2, Thursday April 3, 2003
Lecture 5, Tuesday April 15, 2003 Time Warping Definition: (u), (u) are connected by an approximate continuous time warping (u 0, v 0 ), if: u 0, v 0 are strictly increasing functions on [0, T], and (u 0 (t)) (v 0 (t))for 0 t T (t) (t) 0 T u 0 (t) v 0 (t)
Lecture 5, Tuesday April 15, 2003 Time Warping v u M N Define possible steps: ( u, v) is the possible difference of u and v between steps h-1 and h (1, 0) ( u, v) = (1, 1) (0, 1)
Lecture 5, Tuesday April 15, 2003 Definition of a hidden Markov model Definition: A hidden Markov model (HMM) Alphabet = { b 1, b 2, …, b M } Set of states Q = { 1,..., K } Transition probabilities between any two states a ij = transition prob from state i to state j a i1 + … + a iK = 1, for all states i = 1…K Start probabilities a 0i a 01 + … + a 0K = 1 Emission probabilities within each state e i (b) = P( x i = b | i = k) e i (b 1 ) + … + e i (b M ) = 1, for all states i = 1…K K 1 … 2
Lecture 5, Tuesday April 15, 2003 The three main questions on HMMs 1.Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ] 2.Decoding GIVENa HMM M, and a sequence x, FINDthe sequence of states that maximizes P[ x, | M ] 3.Learning GIVENa HMM M, with unspecified transition/emission probs., and a sequence x, FINDparameters = (e i (.), a ij ) that maximize P[ x | ]
Lecture 5, Tuesday April 15, 2003 Today Decoding Evaluation
Problem 1: Decoding Find the best parse of a sequence
Lecture 5, Tuesday April 15, 2003 Decoding GIVEN x = x 1 x 2 ……x N We want to find = 1, ……, N, such that P[ x, ] is maximized * = argmax P[ x, ] We can use dynamic programming! Let V k (i) = max { 1,…,i-1} P[x 1 …x i-1, 1, …, i-1, x i, i = k] = Probability of most likely sequence of states ending at state i = k 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2
Lecture 5, Tuesday April 15, 2003 Decoding – main idea Given that for all states k, and for a fixed position i, V k (i) = max { 1,…,i-1} P[x 1 …x i-1, 1, …, i-1, x i, i = k] What is V k (i+1)? From definition, V l (i+1) = max { 1,…,i} P[ x 1 …x i, 1, …, i, x i+1, i+1 = l ] = max { 1,…,i} P(x i+1, i+1 = l | x 1 …x i, 1,…, i ) P[x 1 …x i, 1,…, i ] = max { 1,…,i} P(x i+1, i+1 = l | i ) P[x 1 …x i-1, 1, …, i-1, x i, i ] = max k P(x i+1, i+1 = l | i = k) max { 1,…,i-1} P[x 1 …x i-1, 1,…, i-1, x i, i =k] = e l (x i+1 ) max k a kl V k (i)
Lecture 5, Tuesday April 15, 2003 The Viterbi Algorithm Input: x = x 1 ……x N Initialization: V 0 (0) = 1(0 is the imaginary first position) V k (0) = 0, for all k > 0 Iteration: V j (i) = e j (x i ) max k a kj V k (i-1) Ptr j (i) = argmax k a kj V k (i-1) Termination: P(x, *) = max k V k (N) Traceback: N * = argmax k V k (N) i-1 * = Ptr i (i)
Lecture 5, Tuesday April 15, 2003 The Viterbi Algorithm Similar to “aligning” a set of states to a sequence Time: O(K 2 N) Space: O(KN) x 1 x 2 x 3 ………………………………………..x N State 1 2 K V j (i)
Lecture 5, Tuesday April 15, 2003 Viterbi Algorithm – a practical detail Underflows are a significant problem P[ x 1,…., x i, 1, …, i ] = a 0 1 a 1 2 ……a i e 1 (x 1 )……e i (x i ) These numbers become extremely small – underflow Solution: Take the logs of all values V l (i) = log e k (x i ) + max k [ V k (i-1) + log a kl ]
Lecture 5, Tuesday April 15, 2003 Example Let x be a sequence with a portion of ~ 1/6 6’s, followed by a portion of ~ ½ 6’s… x = … … Then, it is not hard to show that optimal parse is (exercise): FFF…………………...F LLL………………………...L 6 nucleotides “123456” parsed as F, contribute.95 6 (1/6) 6 = 1.6 parsed as L, contribute.95 6 (1/2) 1 (1/10) 5 = 0.4 “162636” parsed as F, contribute.95 6 (1/6) 6 = 1.6 parsed as L, contribute.95 6 (1/2) 3 (1/10) 3 = 9.0 10 -5
Problem 2: Evaluation Find the likelihood a sequence is generated by the model
Lecture 5, Tuesday April 15, 2003 Generating a sequence by the model Given a HMM, we can generate a sequence of length n as follows: 1.Start at state 1 according to prob a 0 1 2.Emit letter x 1 according to prob e 1 (x 1 ) 3.Go to state 2 according to prob a 1 2 4.… until emitting x n 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xnxn 2 1 K 2 0 e 2 (x 1 ) a 02
Lecture 5, Tuesday April 15, 2003 A couple of questions Given a sequence x, What is the probability that x was generated by the model? Given a position i, what is the most likely state that emitted x i ? Example: the dishonest casino Say x = Most likely path: = FF……F However: marked letters more likely to be L than unmarked letters
Lecture 5, Tuesday April 15, 2003 Evaluation We will develop algorithms that allow us to compute: P(x)Probability of x given the model P(x i …x j )Probability of a substring of x given the model P( I = k | x)Probability that the i th state is k, given x A more refined measure of which states x may be in
Lecture 5, Tuesday April 15, 2003 The Forward Algorithm We want to calculate P(x) = probability of x, given the HMM Sum over all possible ways of generating x: P(x) = P(x, ) = P(x | ) P( ) To avoid summing over an exponential number of paths , define f k (i) = P(x 1 …x i, i = k) (the forward probability)
Lecture 5, Tuesday April 15, 2003 The Forward Algorithm – derivation Define the forward probability: f l (i) = P(x 1 …x i, i = l) = 1… i-1 P(x 1 …x i-1, 1,…, i-1, i = l) e l (x i ) = k 1… i-2 P(x 1 …x i-1, 1,…, i-2, i-1 = k) a kl e l (x i ) = e l (x i ) k f k (i-1) a kl
Lecture 5, Tuesday April 15, 2003 The Forward Algorithm We can compute f k (i) for all k, i, using dynamic programming! Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f l (i) = e l (x i ) k f k (i-1) a kl Termination: P(x) = k f k (N) a k0 Where, a k0 is the probability that the terminating state is k (usually = a 0k )
Lecture 5, Tuesday April 15, 2003 Relation between Forward and Viterbi VITERBI Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Iteration: V j (i) = e j (x i ) max k V k (i-1) a kj Termination: P(x, *) = max k V k (N) FORWARD Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f l (i) = e l (x i ) k f k (i-1) a kl Termination: P(x) = k f k (N) a k0
Lecture 5, Tuesday April 15, 2003 Motivation for the Backward Algorithm We want to compute P( i = k | x), the probability distribution on the i th position, given x We start by computing P( i = k, x) = P(x 1 …x i, i = k, x i+1 …x N ) = P(x 1 …x i, i = k) P(x i+1 …x N | x 1 …x i, i = k) = P(x 1 …x i, i = k) P(x i+1 …x N | i = k) Forward, f k (i)Backward, b k (i)
Lecture 5, Tuesday April 15, 2003 The Backward Algorithm – derivation Define the backward probability: b k (i) = P(x i+1 …x N | i = k) = i+1… N P(x i+1,x i+2, …, x N, i+1, …, N | i = k) = l i+1… N P(x i+1,x i+2, …, x N, i+1 = l, i+2, …, N | i = k) = l e l (x i+1 ) a kl i+1… N P(x i+2, …, x N, i+2, …, N | i+1 = l) = l e l (x i+1 ) a kl b l (i+1)
Lecture 5, Tuesday April 15, 2003 The Backward Algorithm We can compute b k (i) for all k, i, using dynamic programming Initialization: b k (N) = a k0, for all k Iteration: b k (i) = l e l (x i+1 ) a kl b l (i+1) Termination: P(x) = l a 0l e l (x 1 ) b l (1)
Lecture 5, Tuesday April 15, 2003 Computational Complexity What is the running time, and space required, for Forward, and Backward? Time: O(K 2 N) Space: O(KN) Useful implementation technique to avoid underflows Viterbi: sum of logs Forward/Backward: rescaling at each position by multiplying by a constant
Lecture 5, Tuesday April 15, 2003 Posterior Decoding We can now calculate f k (i) b k (i) P( i = k | x) = ––––––– P(x) Then, we can ask What is the most likely state at position i of sequence x: Define ^ by Posterior Decoding: ^ i = argmax k P( i = k | x)
Lecture 5, Tuesday April 15, 2003 Posterior Decoding For each state, Posterior Decoding gives us a curve of likelihood of state for each position That is sometimes more informative than Viterbi path *
A+C+G+T+ A-C-G-T- A modeling Example CpG islands in DNA sequences
Lecture 5, Tuesday April 15, 2003 Example: CpG Islands CpG nucleotides in the genome are frequently methylated (Write CpG not to confuse with CG base pair) C methyl-C T Methylation often suppressed around genes, promoters CpG islands
Lecture 5, Tuesday April 15, 2003 Example: CpG Islands In CpG islands, CG is more frequent Other pairs (AA, AG, AT…) have different frequencies Question: Detect CpG islands computationally
Lecture 5, Tuesday April 15, 2003 A model of CpG Islands – (1) Architecture A+C+G+T+ A-C-G-T- CpG Island Not CpG Island