Nov 29th, 2001Copyright © 2001-2003, Andrew W. Moore Hidden Markov Models Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Nov 29th, 2001Copyright © 2001-2003, Andrew W. Moore Hidden Markov Models Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received. http://www.cs.cmu.edu/~awm/tutorials

Hidden Markov Models: Slide 3 Copyright © 2001-2003, Andrew W. Moore A Markov System s1s1 s3s3 s2s2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it q t Note: q t  {s 1, s 2.. s N } N = 3 t=0 q t =q 0 =s 3 Current State

Hidden Markov Models: Slide 4 Copyright © 2001-2003, Andrew W. Moore A Markov System s1s1 s3s3 s2s2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it q t Note: q t  {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. N = 3 t=1 q t =q 1 =s 2 Current State

Hidden Markov Models: Slide 5 Copyright © 2001-2003, Andrew W. Moore A Markov System s1s1 s3s3 s2s2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it q t Note: q t  {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state. N = 3 t=1 q t =q 1 =s 2 P(q t+1 =s 1 |q t =s 3 ) = 1/3 P(q t+1 =s 2 |q t =s 3 ) = 2/3 P(q t+1 =s 3 |q t =s 3 ) = 0 P(q t+1 =s 1 |q t =s 1 ) = 0 P(q t+1 =s 2 |q t =s 1 ) = 0 P(q t+1 =s 3 |q t =s 1 ) = 1 P(q t+1 =s 1 |q t =s 2 ) = 1/2 P(q t+1 =s 2 |q t =s 2 ) = 1/2 P(q t+1 =s 3 |q t =s 2 ) = 0

Hidden Markov Models: Slide 6 Copyright © 2001-2003, Andrew W. Moore A Markov System s1s1 s3s3 s2s2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it q t Note: q t  {s 1, s 2.. s N } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state. N = 3 t=1 q t =q 1 =s 2 P(q t+1 =s 1 |q t =s 3 ) = 1/3 P(q t+1 =s 2 |q t =s 3 ) = 2/3 P(q t+1 =s 3 |q t =s 3 ) = 0 P(q t+1 =s 1 |q t =s 1 ) = 0 P(q t+1 =s 2 |q t =s 1 ) = 0 P(q t+1 =s 3 |q t =s 1 ) = 1 P(q t+1 =s 1 |q t =s 2 ) = 1/2 P(q t+1 =s 2 |q t =s 2 ) = 1/2 P(q t+1 =s 3 |q t =s 2 ) = 0 1/2 1/3 2/3 1 Often notated with arcs between states

Hidden Markov Models: Slide 7 Copyright © 2001-2003, Andrew W. Moore Markov Property s1s1 s3s3 s2s2 q t+1 is conditionally independent of { q t-1, q t-2, … q 1, q 0 } given q t. In other words: P(q t+1 = s j |q t = s i ) = P(q t+1 = s j |q t = s i, any earlier history ) N = 3 t=1 q t =q 1 =s 2 P(q t+1 =s 1 |q t =s 3 ) = 1/3 P(q t+1 =s 2 |q t =s 3 ) = 2/3 P(q t+1 =s 3 |q t =s 3 ) = 0 P(q t+1 =s 1 |q t =s 1 ) = 0 P(q t+1 =s 2 |q t =s 1 ) = 0 P(q t+1 =s 3 |q t =s 1 ) = 1 P(q t+1 =s 1 |q t =s 2 ) = 1/2 P(q t+1 =s 2 |q t =s 2 ) = 1/2 P(q t+1 =s 3 |q t =s 2 ) = 0 1/2 1/3 2/3 1

Hidden Markov Models: Slide 9 Copyright © 2001-2003, Andrew W. Moore A Blind Robot R H STATE q = Location of Robot, Location of Human A human and a robot wander around randomly on a grid… Note: N (num. states) = 18 * 18 = 324

Hidden Markov Models: Slide 10 Copyright © 2001-2003, Andrew W. Moore Dynamics of System R H q 0 = Typical Questions: “What’s the expected time until the human is crushed like a bug?” “What’s the probability that the robot will hit the left wall before it hits the human?” “What’s the probability Robot crushes human on next time step?” Each timestep the human moves randomly to an adjacent cell. And Robot also moves randomly to an adjacent cell.

Hidden Markov Models: Slide 11 Copyright © 2001-2003, Andrew W. Moore Example Question “It’s currently time t, and human remains uncrushed. What’s the probability of crushing occurring at time t + 1 ?” If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information … Hidden Markov Models are applicable! We’ll do this first Too Easy. We won’t do this Main Body of Lecture

Hidden Markov Models: Slide 12 Copyright © 2001-2003, Andrew W. Moore What is P(q t =s)? Too Slow Step 1: Work out how to compute P(Q) for any path Q = q 0 q 1 q 2 q 3.. q t Given we know the start state q 0 P(q 0 q 1.. q t ) = P(q 0 q 1.. q t-1 ) P(q t |q 0 q 1.. q t-1 ) = P(q 0 q 1.. q t-1 ) P(q t |q t-1 ) = P(q 1 |q 0 )P(q 2 |q 1 )…P(q t |q t-1 ) Step 2: Use this knowledge to get P(q t =s) WHY? Computation is exponential in t

Hidden Markov Models: Slide 13 Copyright © 2001-2003, Andrew W. Moore What is P(q t =s) ? Clever Answer For each state s i, define p t (i) = Prob. state is s i at time t = P(q t = s i ) Easy to do inductive definition

Hidden Markov Models: Slide 14 Copyright © 2001-2003, Andrew W. Moore What is P(q t =s) ? Clever answer For each state s i, define p t (i) = Prob. state is s i at time t = P(q t = s i ) Easy to do inductive definition

Hidden Markov Models: Slide 15 Copyright © 2001-2003, Andrew W. Moore What is P(q t =s) ? Clever answer For each state s i, define p t (i) = Prob. state is s i at time t = P(q t = s i ) Easy to do inductive definition

Hidden Markov Models: Slide 16 Copyright © 2001-2003, Andrew W. Moore What is P(q t =s) ? Clever answer For each state s i, define p t (i) = Prob. state is s i at time t = P(q t = s i ) Easy to do inductive definition Remember,

Hidden Markov Models: Slide 17 Copyright © 2001-2003, Andrew W. Moore What is P(q t =s) ? Clever answer For each state s i, define p t (i) = Prob. state is s i at time t = P(q t = s i ) Easy to do inductive definition Computation is simple. Just fill in this table in this order: tp t (1)p t (2)…p t (N) 0010 1 : t final

Hidden Markov Models: Slide 18 Copyright © 2001-2003, Andrew W. Moore What is P(q t =s) ? Clever answer For each state s i, define p t (i) = Prob. state is s i at time t = P(q t = s i ) Easy to do inductive definition Cost of computing P t (i) for all states S i is now O(t N 2 ) The stupid way was O(N t ) This was a simple example It was meant to warm you up to this trick, called Dynamic Programming, because HMMs do many tricks like this.

Hidden Markov Models: Slide 19 Copyright © 2001-2003, Andrew W. Moore Hidden State “It’s currently time t, and human remains uncrushed. What’s the probability of crushing occurring at time t + 1 ?” If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information … Hidden Markov Models are applicable! We’ll do this first Too Easy. We won’t do this Main Body of Lecture

Hidden Markov Models: Slide 20 Copyright © 2001-2003, Andrew W. Moore Hidden State R0R0 H WWW  H The previous example tried to estimate P(q t = s i ) unconditionally (using no observed evidence). Suppose we can observe something that’s affected by the true state. Example: Proximity sensors. (tell us the contents of the 8 adjacent squares) W denotes “WALL” True state q t What the robot sees: Observation O t

Hidden Markov Models: Slide 21 Copyright © 2001-2003, Andrew W. Moore Noisy Hidden State R0R0 H WWW  H Example: Noisy proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) W denotes “WALL” True state q t Uncorrupted Observation W W  W HH What the robot sees: Observation O t

Hidden Markov Models: Slide 22 Copyright © 2001-2003, Andrew W. Moore Noisy Hidden State R0R0 2 H WWW  H Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) W denotes “WALL” True state q t Uncorrupted Observation W W  W HH What the robot sees: Observation O t O t is noisily determined depending on the current state. Assume that O t is conditionally independent of {q t-1, q t-2, … q 1, q 0, O t-1, O t-2, … O 1, O 0 } given q t. In other words: P(O t = X |q t = s i ) = P(O t = X |q t = s i,any earlier history)

Hidden Markov Models: Slide 23 Copyright © 2001-2003, Andrew W. Moore Noisy Hidden State R0R0 2 H WWW  H Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) W denotes “WALL” True state q t Uncorrupted Observation W W  W HH What the robot sees: Observation O t O t is noisily determined depending on the current state. Assume that O t is conditionally independent of {q t-1, q t-2, … q 1, q 0, O t-1, O t-2, … O 1, O 0 } given q t. In other words: P(O t = X |q t = s i ) = P(O t = X |q t = s i,any earlier history)

Hidden Markov Models: Slide 25 Copyright © 2001-2003, Andrew W. Moore Hidden Markov Models Our robot with noisy sensors is a good example of an HMM Question 1: State Estimation What is P(q T =S i | O 1 O 2 …O T ) It will turn out that a new cute D.P. trick will get this for us. Question 2: Most Probable Path Given O 1 O 2 …O T, what is the most probable path that I took? And what is that probability? Yet another famous D.P. trick, the VITERBI algorithm, gets this. Question 3: Learning HMMs: Given O 1 O 2 …O T, what is the maximum likelihood HMM that could have produced this string of observations? Very very useful. Uses the E.M. Algorithm

Hidden Markov Models: Slide 26 Copyright © 2001-2003, Andrew W. Moore Are H.M.M.s Useful? You bet !! Robot planning + sensing when there’s uncertainty Speech Recognition/Understanding Phones  Words, Signal  phones Consumer decision modeling Economics & Finance. Many others …

Hidden Markov Models: Slide 27 Copyright © 2001-2003, Andrew W. Moore HMM Notation (from Rabiner’s Survey) The states are labeled S 1 S 2.. S N For a particular trial…. LetTbe the number of observations Tis also the number of states passed through O = O 1 O 2.. O T is the sequence of observations Q = q 1 q 2.. q T is the notation for a path of states λ =  N,M,  i, ,  a ij ,  b i (j)  is the specification of an HMM *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.

Hidden Markov Models: Slide 28 Copyright © 2001-2003, Andrew W. Moore HMM Formal Definition An HMM, λ, is a 5-tuple consisting of N the number of states M the number of possible observations {  1,  2,..  N } The starting state probabilities P(q 0 = S i ) =  i a 11 a 22 …a 1N a 21 a 22 …a 2N : : : a N1 a N2 …a NN b 1 (1)b 1 (2)…b 1 (M) b 2 (1)b 2 (2)…b 2 (M) : : : b N (1)b N (2)…b N (M) This is new. In our previous example, start state was deterministic The state transition probabilities P(q t+1 =S j | q t =S i )=a ij The observation probabilities P(O t =k | q t =S i )=b i (k)

Hidden Markov Models: Slide 29 Copyright © 2001-2003, Andrew W. Moore Here’s an HMM N = 3 M = 3  1 = 1/2  2 = 1/2  3 = 0 a 11 = 0 a 12 = 1/3 a 13 = 2/3 a 12 = 1/3 a 22 = 0 a 13 = 2/3 a 13 = 1/3 a 32 = 1/3 a 13 = 1/3 b 1 (X) = 1/2 b 1 (Y) = 1/2 b 1 (Z) = 0 b 2 (X) = 0 b 2 (Y) = 1/2 b 2 (Z) = 1/2 b 3 (X) = 1/2 b 3 (Y) = 0 b 3 (Z) = 1/2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 30 Copyright © 2001-2003, Andrew W. Moore Here’s an HMM N = 3 M = 3  1 = ½  2 = ½  3 = 0 a 11 = 0 a 12 = ⅓ a 13 = ⅔ a 12 = ⅓ a 22 = 0 a 13 = ⅔ a 13 = ⅓ a 32 = ⅓ a 13 = ⅓ b 1 (X) = ½ b 1 (Y) = ½ b 1 (Z) = 0 b 2 (X) = 0 b 2 (Y) = ½ b 2 (Z) = ½ b 3 (X) = ½ b 3 (Y) = 0 b 3 (Z) = ½ Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: q0=q0=__O0=O0= q1=q1= O1=O1= q2=q2= O2=O2= 50-50 choice between S 1 and S 2 XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 31 Copyright © 2001-2003, Andrew W. Moore Here’s an HMM N = 3 M = 3  1 = ½  2 = ½  3 = 0 a 11 = 0 a 12 = ⅓ a 13 = ⅔ a 12 = ⅓ a 22 = 0 a 13 = ⅔ a 13 = ⅓ a 32 = ⅓ a 13 = ⅓ b 1 (X) = ½ b 1 (Y) = ½ b 1 (Z) = 0 b 2 (X) = 0 b 2 (Y) = ½ b 2 (Z) = ½ b 3 (X) = ½ b 3 (Y) = 0 b 3 (Z) = ½ Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: q0=q0=S1S1 O0=O0=__ q1=q1= O1=O1= q2=q2= O2=O2= 50-50 choice between X and Y XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 32 Copyright © 2001-2003, Andrew W. Moore Here’s an HMM N = 3 M = 3  1 = ½  2 = ½  3 = 0 a 11 = 0 a 12 = ⅓ a 13 = ⅔ a 12 = ⅓ a 22 = 0 a 13 = ⅔ a 13 = ⅓ a 32 = ⅓ a 13 = ⅓ b 1 (X) = ½ b 1 (Y) = ½ b 1 (Z) = 0 b 2 (X) = 0 b 2 (Y) = ½ b 2 (Z) = ½ b 3 (X) = ½ b 3 (Y) = 0 b 3 (Z) = ½ Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: q0=q0=S1S1 O0=O0=X q1=q1=__O1=O1= q2=q2= O2=O2= Goto S 3 with probability 2/3 or S 2 with prob. 1/3 XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 33 Copyright © 2001-2003, Andrew W. Moore Here’s an HMM N = 3 M = 3  1 = ½  2 = ½  3 = 0 a 11 = 0 a 12 = ⅓ a 13 = ⅔ a 12 = ⅓ a 22 = 0 a 13 = ⅔ a 13 = ⅓ a 32 = ⅓ a 13 = ⅓ b 1 (X) = ½ b 1 (Y) = ½ b 1 (Z) = 0 b 2 (X) = 0 b 2 (Y) = ½ b 2 (Z) = ½ b 3 (X) = ½ b 3 (Y) = 0 b 3 (Z) = ½ Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: q0=q0=S1S1 O0=O0=X q1=q1=S3S3 O1=O1=__ q2=q2= O2=O2= 50-50 choice between Z and X XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 34 Copyright © 2001-2003, Andrew W. Moore Here’s an HMM N = 3 M = 3  1 = ½  2 = ½  3 = 0 a 11 = 0 a 12 = ⅓ a 13 = ⅔ a 12 = ⅓ a 22 = 0 a 13 = ⅔ a 13 = ⅓ a 32 = ⅓ a 13 = ⅓ b 1 (X) = ½ b 1 (Y) = ½ b 1 (Z) = 0 b 2 (X) = 0 b 2 (Y) = ½ b 2 (Z) = ½ b 3 (X) = ½ b 3 (Y) = 0 b 3 (Z) = ½ Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: q0=q0=S1S1 O0=O0=X q1=q1=S3S3 O1=O1=X q2=q2=__O2=O2= Each of the three next states is equally likely XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 35 Copyright © 2001-2003, Andrew W. Moore Here’s an HMM N = 3 M = 3  1 = ½  2 = ½  3 = 0 a 11 = 0 a 12 = ⅓ a 13 = ⅔ a 12 = ⅓ a 22 = 0 a 13 = ⅔ a 13 = ⅓ a 32 = ⅓ a 13 = ⅓ b 1 (X) = ½ b 1 (Y) = ½ b 1 (Z) = 0 b 2 (X) = 0 b 2 (Y) = ½ b 2 (Z) = ½ b 3 (X) = ½ b 3 (Y) = 0 b 3 (Z) = ½ S2S2 Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: q0=q0=S1S1 O0=O0=X q1=q1=S3S3 O1=O1=X q2=q2=S3S3 O2=O2=__ 50-50 choice between Z and X XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 36 Copyright © 2001-2003, Andrew W. Moore Here’s an HMM N = 3 M = 3  1 = ½  2 = ½  3 = 0 a 11 = 0 a 12 = ⅓ a 13 = ⅔ a 12 = ⅓ a 22 = 0 a 13 = ⅔ a 13 = ⅓ a 32 = ⅓ a 13 = ⅓ b 1 (X) = ½ b 1 (Y) = ½ b 1 (Z) = 0 b 2 (X) = 0 b 2 (Y) = ½ b 2 (Z) = ½ b 3 (X) = ½ b 3 (Y) = 0 b 3 (Z) = ½ Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: q0=q0=S1S1 O0=O0=X q1=q1=S3S3 O1=O1=X q2=q2=S3S3 O2=O2=Z XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 37 Copyright © 2001-2003, Andrew W. Moore State Estimation N = 3 M = 3  1 = ½  2 = ½  3 = 0 a 11 = 0 a 12 = ⅓ a 13 = ⅔ a 12 = ⅓ a 22 = 0 a 13 = ⅔ a 13 = ⅓ a 32 = ⅓ a 13 = ⅓ b 1 (X) = ½ b 1 (Y) = ½ b 1 (Z) = 0 b 2 (X) = 0 b 2 (Y) = ½ b 2 (Z) = ½ b 3 (X) = ½ b 3 (Y) = 0 b 3 (Z) = ½ Start randomly in state 1 or 2 Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: q0=q0=?O0=O0=X q1=q1=?O1=O1=X q2=q2=?O2=O2=Z This is what the observer has to work with… XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 38 Copyright © 2001-2003, Andrew W. Moore Prob. of a series of observations What is P(O) = P(O 1 O 2 O 3 ) = P(O 1 = X ^ O 2 = X ^ O 3 = Z)? Slow, stupid way: How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q? XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 39 Copyright © 2001-2003, Andrew W. Moore Prob. of a series of observations What is P(O) = P(O 1 O 2 O 3 ) = P(O 1 = X ^ O 2 = X ^ O 3 = Z)? Slow, stupid way: How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q? P(Q)= P(q 1,q 2,q 3 ) =P(q 1 ) P(q 2,q 3 |q 1 ) (chain rule) =P(q 1 ) P(q 2 |q 1 ) P(q 3 | q 2,q 1 ) (chain) =P(q 1 ) P(q 2 |q 1 ) P(q 3 | q 2 ) (why?) Example in the case Q = S 1 S 3 S 3 : =1/2 * 2/3 * 1/3 = 1/9 XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 40 Copyright © 2001-2003, Andrew W. Moore Prob. of a series of observations What is P(O) = P(O 1 O 2 O 3 ) = P(O 1 = X ^ O 2 = X ^ O 3 = Z)? Slow, stupid way: How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q? P(O|Q) = P(O 1 O 2 O 3 |q 1 q 2 q 3 ) = P(O 1 | q 1 ) P(O 2 | q 2 ) P(O 3 | q 3 ) (why?) Example in the case Q = S 1 S 3 S 3 : = P(X| S 1 ) P(X| S 3 ) P(Z| S 3 ) = =1/2 * 1/2 * 1/2 = 1/8 XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3

Hidden Markov Models: Slide 41 Copyright © 2001-2003, Andrew W. Moore XYXY ZXZX Z Y S2S2 S1S1 S3S3 1/3 2/3 1/3 Prob. of a series of observations What is P(O) = P(O 1 O 2 O 3 ) = P(O 1 = X ^ O 2 = X ^ O 3 = Z)? Slow, stupid way: How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q? P(O|Q) = P(O 1 O 2 O 3 |q 1 q 2 q 3 ) = P(O 1 | q 1 ) P(O 2 | q 2 ) P(O 3 | q 3 ) (why?) Example in the case Q = S 1 S 3 S 3 : = P(X| S 1 ) P(X| S 3 ) P(Z| S 3 ) = =1/2 * 1/2 * 1/2 = 1/8 P(O) would need 27 P(Q) computations and 27 P(O|Q) computations A sequence of 20 observations would need 3 20 = 3.5 billion computations and 3.5 billion P(O|Q) computations So let’s be smarter…

Hidden Markov Models: Slide 42 Copyright © 2001-2003, Andrew W. Moore The Prob. of a given series of observations, non-exponential-cost-style Given observations O 1 O 2 … O T Define α t (i) = P(O 1 O 2 … O t  q t = S i | λ) where 1 ≤ t ≤ T α t (i) = Probability that, in a random trial, We’d have seen the first t observations We’d have ended up in S i as the t’th state visited. In our example, what is α 2 (3) ?

Hidden Markov Models: Slide 43 Copyright © 2001-2003, Andrew W. Moore α t (i): easy to define recursively α t (i) = P(O 1 O 2 … O T  q t = S i | λ) (α t (i) can be defined stupidly by considering all paths length “t”. How?)

Hidden Markov Models: Slide 48 Copyright © 2001-2003, Andrew W. Moore Easy Question We can cheaply compute α t (i)=P(O 1 O 2 …O t  q t =S i ) (How) can we cheaply compute P(O 1 O 2 …O t ) ? (How) can we cheaply compute P(q t =S i |O 1 O 2 …O t )

Hidden Markov Models: Slide 49 Copyright © 2001-2003, Andrew W. Moore Easy Question We can cheaply compute α t (i)=P(O 1 O 2 …O t  q t =S i ) (How) can we cheaply compute P(O 1 O 2 …O t ) ? (How) can we cheaply compute P(q t =S i |O 1 O 2 …O t )

Hidden Markov Models: Slide 51 Copyright © 2001-2003, Andrew W. Moore Efficient MPP computation We’re going to compute the following variables: δ t (i)= max P(q 1 q 2.. q t-1  q t = S i  O 1.. O t ) q 1 q 2..q t-1 = The Probability of the path of Length t-1 with the maximum chance of doing all these things: …OCCURING and …ENDING UP IN STATE S i and …PRODUCING OUTPUT O 1 …O t DEFINE:mpp t (i) = that path So: δ t (i)= Prob(mpp t (i))

Hidden Markov Models: Slide 52 Copyright © 2001-2003, Andrew W. Moore The Viterbi Algorithm Now, suppose we have all the δ t (i)’s and mpp t (i)’s for all i. HOW TO GET δ t+1 (j) andmpp t+1 (j)? mpp t (1) Prob=δ t (1) mpp t (2) : mpp t (N) S1S1 S2S2 SNSN qtqt SjSj q t+1 Prob=δ t (N) Prob=δ t (2) ? :

Hidden Markov Models: Slide 53 Copyright © 2001-2003, Andrew W. Moore The Viterbi Algorithm time ttime t+1 S 1 :S j S i : The most prob path with last two states S i S j is the most prob path to S i, followed by transition S i → S j

Hidden Markov Models: Slide 54 Copyright © 2001-2003, Andrew W. Moore The Viterbi Algorithm time ttime t+1 S 1 :S j S i : The most prob path with last two states S i S j is the most prob path to S i, followed by transition S i → S j What is the prob of that path? δ t (i) x P(S i → S j  O t+1 | λ) = δ t (i) a ij b j (O t+1 ) SO The most probable path to S j has S i* as its penultimate state where i*=argmax δ t (i) a ij b j (O t+1 ) i

Hidden Markov Models: Slide 55 Copyright © 2001-2003, Andrew W. Moore The Viterbi Algorithm time ttime t+1 S 1 :S j S i : The most prob path with last two states S i S j is the most prob path to S i, followed by transition S i → S j What is the prob of that path? δ t (i) x P(S i → S j  O t+1 | λ) = δ t (i) a ij b j (O t+1 ) SO The most probable path to S j has S i* as its penultimate state where i*=argmax δ t (i) a ij b j (O t+1 ) i } with i* defined to the left Summary: δ t+1 (j) = δ t (i*) a ij b j (O t+1 ) mpp t+1 (j) = mpp t+1 (i*)S i*

Hidden Markov Models: Slide 56 Copyright © 2001-2003, Andrew W. Moore What’s Viterbi used for? Classic Example Speech recognition: Signal  words HMM  observable is signal  Hidden state is part of word formation What is the most probable word given this signal? UTTERLY GROSS SIMPLIFICATION In practice: many levels of inference; not one big jump.

Hidden Markov Models: Slide 57 Copyright © 2001-2003, Andrew W. Moore HMMs are used and useful But how do you design an HMM? Occasionally, (e.g. in our robot example) it is reasonable to deduce the HMM from first principles. But usually, especially in Speech or Genetics, it is better to infer it from large amounts of data. O 1 O 2.. O T with a big “T”. O 1 O 2.. O T Observations previously in lecture Observations in the next bit

Hidden Markov Models: Slide 58 Copyright © 2001-2003, Andrew W. Moore Inferring an HMM Remember, we’ve been doing things like P(O 1 O 2.. O T | λ ) That “λ” is the notation for our HMM parameters. NowWe have some observations and we want to estimate λ from them. AS USUAL: We could use (i)MAX LIKELIHOOD λ = argmax P(O 1.. O T | λ) λ (ii)BAYES Work out P( λ | O 1.. O T ) and then take E[λ] or max P( λ | O 1.. O T ) λ

Hidden Markov Models: Slide 59 Copyright © 2001-2003, Andrew W. Moore Max likelihood HMM estimation Define γ t (i) = P(q t = S i | O 1 O 2 …O T, λ ) ε t (i,j) = P(q t = S i  q t+1 = S j | O 1 O 2 …O T,λ ) γ t (i) and ε t (i,j) can be computed efficiently  i,j,t (Details in Rabiner paper) Expected number of transitions out of state i during the path Expected number of transitions from state i to state j during the path

Hidden Markov Models: Slide 61 Copyright © 2001-2003, Andrew W. Moore EM for HMMs If we knew λ we could estimate EXPECTATIONS of quantities such as Expected number of times in state i Expected number of transitions i  j If we knew the quantities such as Expected number of times in state i Expected number of transitions i  j We could compute the MAX LIKELIHOOD estimate of λ =  a ij ,  b i (j) ,  i  Roll on the EM Algorithm…

Hidden Markov Models: Slide 62 Copyright © 2001-2003, Andrew W. Moore EM 4 HMMs 1.Get your observations O 1 …O T 2.Guess your first λ estimate λ(0), k=0 3.k = k+1 4.Given O 1 …O T, λ(k) compute γ t (i), ε t (i,j)  1 ≤ t ≤ T,  1 ≤ i ≤ N,  1 ≤ j ≤ N 5.Compute expected freq. of state i, and expected freq. i→j 6.Compute new estimates of a ij, b j (k),  i accordingly. Call them λ(k+1) 7.Goto 3, unless converged. Also known (for the HMM case) as the BAUM-WELCH algorithm.

Hidden Markov Models: Slide 63 Copyright © 2001-2003, Andrew W. Moore Bad News Good News Notice There are lots of local minima The local minima are usually adequate models of the data. EM does not estimate the number of states. That must be given. Often, HMMs are forced to have some links with zero probability. This is done by setting a ij =0 in initial estimate λ(0) Easy extension of everything seen today: HMMs with real valued outputs

Hidden Markov Models: Slide 64 Copyright © 2001-2003, Andrew W. Moore Bad News Good News Notice There are lots of local minima The local minima are usually adequate models of the data. EM does not estimate the number of states. That must be given. Often, HMMs are forced to have some links with zero probability. This is done by setting a ij =0 in initial estimate λ(0) Easy extension of everything seen today: HMMs with real valued outputs Trade-off between too few states (inadequately modeling the structure in the data) and too many (fitting the noise). Thus #states is a regularization parameter. Blah blah blah… bias variance tradeoff…blah blah…cross-validation…blah blah….AIC, BIC….blah blah (same ol’ same ol’)

Hidden Markov Models: Slide 65 Copyright © 2001-2003, Andrew W. Moore What You Should Know What is an HMM ? Computing (and defining) α t (i) The Viterbi algorithm Outline of the EM algorithm To be very happy with the kind of maths and analysis needed for HMMs Fairly thorough reading of Rabiner* up to page 266* [Up to but not including “IV. Types of HMMs”]. *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. DON’T PANIC: starts on p. 257.

Nov 29th, 2001Copyright © 2001-2003, Andrew W. Moore Hidden Markov Models Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Similar presentations

Presentation on theme: "Nov 29th, 2001Copyright © 2001-2003, Andrew W. Moore Hidden Markov Models Andrew W. Moore Professor School of Computer Science Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nov 29th, 2001Copyright © 2001-2003, Andrew W. Moore Hidden Markov Models Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Similar presentations

Presentation on theme: "Nov 29th, 2001Copyright © 2001-2003, Andrew W. Moore Hidden Markov Models Andrew W. Moore Professor School of Computer Science Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback