. EM in Hidden Markov Models Tutorial 7 © Ydo Wexler & Dan Geiger, revised by Sivan Yogev
2 Learning the parameters (EM algorithm) A common algorithm to learn the parameters from unlabeled sequences is called Expectation-Maximization (EM). We will devote several classes to it. In the current context it reads as follows: Start with some probability tables (many possible choices) Iterate until convergence E-step: Compute p(s i, s i-1, x 1,…,x L ) using the current probability tables (“current parameters”). Comment: If each s i has k possible values, there are k*k such expressions. M-step: use the Expected counts found to update the local probability tables We focus today on the E-step
3 Example I: Homogenous HMM, one sample Start with some probability tables (say = = ½) Iterate until convergence E-step: Compute p , (s i, s i -1,x 1,…,x L ) using the forward- backward algorithm as will be soon explained. M-step: Update the parameters simultaneously: i [ p , (s i =1, s i-1 =0, x 1,…,x L )]/ i [ p , (s i-1 =0, x 1,…,x L )] i [ p , (s i =0, s i-1 =1, x 1,…,x L )]/ i [ p , (s i-1 =1, x 1,…,x L )] S1S1 S2S2 S L-1 SLSL X1X1 X2X2 X L-1 XLXL SiSi XiXi
4 Decomposing the computation (from previous tutorial) P(x 1,…,x L,s i ) = P(x 1,…,x i,s i ) P(x i+1,…,x L | x 1,…,x i,s i ) S1S1 S2S2 S L-1 SLSL X1X1 X2X2 X L-1 XLXL SiSi XiXi = P(x 1,…,x i,s i ) P(x i+1,…,x L | s i ) f(s i ) b(s i ) Answer: P(s i | x 1,…,x L ) = (1/K) P(x 1,…,x L,s i ) where K= si P(x 1,…,x L,s i ).
5 The E-step We already know how to do this computation P(x 1,…,x L,s i ) = P(x 1,…,x i,s i ) P(x i+1,…,x L | s i ) f(s i ) b(s i ) Now we wish to compute (for the E-step) p(x 1,…,x L,s i,s i+1 )= = f(s i ) p(s i+1 |s i ) p(x i+1 |s i+1 ) b(s i+1 ) p(x 1,…,x i,s i ) p(s i+1 |s i )p(x i+1 |s i+1 )p(x i+2,…,x L |s i+1 ) S1S1 S2S2 X1X1 X2X2 S L-1 SLSL X L-1 XLXL SiSi XiXi S i+1 X i+1 Special case p(x 1,…,x L,s L-1,s L )= = f(s L-1 ) p(s L |s L-1 ) p(x L |s L ) p(x 1,…,x L-1,s L-1 ) p(s L |s L-1 )p(x L |s L ) {define b(s L ) 1}
6 Coin-Tossing Example 0.9 Fair loaded head tail /2 1/4 3/4 1/2 S1S1 S2S2 S L-1 SLSL X1X1 X2X2 X L-1 XLXL SiSi XiXi L tosses Fair/Loade d Head/Tail Start 1/2
7 Example II: Homogenous HMM, one sample Start with some probability tables Iterate until convergence E-step: Compute p (s i, s i -1,x 1,…,x L ) using the forward- backward algorithm as explained earlier. M-step: Update the parameter: i [ p (s i =1, s i-1 =1,x 1,…,x L )+p (s i =0, s i-1 =0,x 1,…,x L )]/ i [ p (s i-1 =1,x 1,…,x L )+p (s i-1 =0,x 1,…,x L )] (will be simplified later) S1S1 S2S2 S L-1 SLSL X1X1 X2X2 X L-1 XLXL SiSi XiXi
8 S1S1 S2S2 S L-1 SLSL X1X1 X2X2 X L-1 XLXL SiSi XiXi Coin-Tossing Example Numeric example: 3 tosses Outcomes: head, head, tail
9 Coin-Tossing Example Numeric example: 3 tosses, Outcomes: head, head, tail Last time we calculated: forwardS1S1 S2S2 S3S3 loaded fair backwardS1S1 S2S2 S3S3 loaded (1) fair (1) f(s i )=P(x 1,…,x i,s i ) = P(x 1,…,x i-1, s i-1 ) P(s i | s i-1 ) P(x i | s i ) s i-1 Recall: b(s i ) = P(x i+1,…,x L |s i )= P(x i+1,…,x L |s i ) = P(s i+1 | s i ) P(x i+1 | s i+1 ) b(s i+1 ) s i+1
10 Coin-Tossing Example Outcomes: head, head, tail f(s 1 =loaded) = 0.375, f(s 1 =fair) = 0.25 b(s 2 =loaded) = 0.275, b(s 2 =fair) = p(x 1,x 2,x 3,s 1,s 2 )=f(s 1 ) p(s 2 |s 1 ) p(x 2 |s 2 ) b(s 2 ) p(x 1,x 2,x 3,s 1 =loaded,s 2 =loaded)= 0.375*0.9*0.75*0.275= p(x 1,x 2,x 3,s 1 =loaded,s 2 =fair)= 0.375*0.1*0.5*0.475= p(x 1,x 2,x 3,s 1 =fair,s 2 =loaded)= 0.25*0.1*0.75*0.275= p(x 1,x 2,x 3,s 1 =fair,s 2 =fair)= 0.25*0.9*0.5*0.475=0.0534
11 Coin-Tossing Example Outcomes: head, head, tail f(s 2 =loaded) = , f(s 2 =fair) = b(s 3 =loaded) = 1, b(s 3 =fair) = 1 p(x 1,x 2,x 3,s 2,s 3 )=f(s 2 ) p(s 3 |s 2 ) p(x 3 |s 3 ) b(s 3 ) p(x 1,x 2,x 3,s 2 =loaded,s 3 =loaded)= *0.9*0.25*1= p(x 1,x 2,x 3,s 2 =loaded,s 3 =fair)= *0.1*0.5*1= p(x 1,x 2,x 3,s 2 =fair,s 3 =loaded)= *0.1*0.25*1= p(x 1,x 2,x 3,s 2 =fair,s 3 =fair)= *0.9*0.5*1=0.0591
12 M-step M-step: Update the parameters simultaneously: (in this case we only have one parameter - ) i [ p (s i =1, s i-1 =1,x 1,…,x L )+p (s i =0, s i-1 =0,x 1,…,x L )]/ i [ p (s i-1 =1,x 1,…,x L )+p (s i-1 =0,x 1,…,x L )] The denominator: i [ p (s i-1 =1,x 1,…,x L )+p (s i-1 =0,x 1,…,x L )] = = i [ p (x 1,…,x L )] = (L-1) p (x 1,…,x L ) In previous tutorial we saw that p (x 1,…,x L ) = f(s L )
13 M-step (cont.) M-step: In our example: [ p(x 1,x 2,x 3,s 1 =l,s 2 =l) + p(x 1,x 2,x 3,s 1 =f,s 2 =f) + p(x 1,x 2,x 3,s 2 =l,s 3 =l) + p(x 1,x 2,x 3,s 2 =f,s 3 =f)]/ [2 * (f(s 3 =l) + f(s 3 =f)] [ ]/ [2 * ( )] = converges to 0.4