600.465 - Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm.

600.465 - Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm

600.465 - Intro to NLP - J. Eisner2 Please See the Spreadsheet  I like to teach this material using an interactive spreadsheet:  http://cs.jhu.edu/~jason/papers/#tnlp02 http://cs.jhu.edu/~jason/papers/#tnlp02  Has the spreadsheet and the lesson plan  I’ll also show the following slides at appropriate points.

600.465 - Intro to NLP - J. Eisner3 Marginalization SALESJanFebMarApr… Widgets5032… Grommets73108… Gadgets0010… ………………

600.465 - Intro to NLP - J. Eisner4 Marginalization SALESJanFebMarApr…TOTAL Widgets5032…30 Grommets73108…80 Gadgets0010…2 ……………… TOTAL9925126901000 Write the totals in the margins Grand total

600.465 - Intro to NLP - J. Eisner5 Marginalization prob.JanFebMarApr…TOTAL Widgets.0050.003.002….030 Grommets.007.003.010.008….080 Gadgets00.0010….002 ……………… TOTAL.099.025.126.0901.000 Grand total Given a random sale, what & when was it?

600.465 - Intro to NLP - J. Eisner6 Marginalization prob.JanFebMarApr…TOTAL Widgets.0050.003.002….030 Grommets.007.003.010.008….080 Gadgets00.0010….002 ……………… TOTAL.099.025.126.0901.000 Given a random sale, what & when was it? marginal prob: p(Jan) marginal prob: p(widget) joint prob: p(Jan,widget) marginal prob: p(anything in table)

600.465 - Intro to NLP - J. Eisner7 Conditionalization prob.JanFebMarApr…TOTAL Widgets.0050.003.002….030 Grommets.007.003.010.008….080 Gadgets00.0010….002 ……………… TOTAL.099.025.126.0901.000 Given a random sale in Jan., what was it? marginal prob: p(Jan) joint prob: p(Jan,widget) conditional prob: p(widget|Jan)=.005/.099 p(… | Jan).005/.099.007/.099 0 ….099/.099 Divide column through by Z=0.99 so it sums to 1

600.465 - Intro to NLP - J. Eisner8 Marginalization & conditionalization in the weather example  Instead of a 2-dimensional table, now we have a 66-dimensional table:  33 of the dimensions have 2 choices: {C,H}  33 of the dimensions have 3 choices: {1,2,3}  Cross-section showing just 3 of the dimensions: Weather 2 =CWeather 2 =H IceCream 2 =1 0.000… IceCream 2 =2 0.000… IceCream 2 =3 0.000…

600.465 - Intro to NLP - J. Eisner9 Interesting probabilities in the weather example  Prior probability of weather: p(Weather=CHH…)  Posterior probability of weather (after observing evidence) : p(Weather=CHH… | IceCream=233…)  Posterior marginal probability that day 3 is hot: p(Weather 3 =H | IceCream=233…) =  w such that w 3 =H p(Weather=w | IceCream=233…)  Posterior conditional probability that day 3 is hot if day 2 is: p(Weather 3 =H | Weather 2 =H, IceCream=233…)

600.465 - Intro to NLP - J. Eisner10 The HMM trellis The dynamic programming computation of  works forward from Start. Day 1: 2 cones Start C H C H Day 2: 3 cones C H 0.5*0.2=0.1 p(H|Start)*p(2|H) p(H|H)*p(3|H) 0.8*0.7=0.56 p(H|H)*p(3|H) 0.8*0.7=0.56 0.1*0.7=0.07 p(H|C)*p(3|H) 0.1*0.7=0.07 p(H|C)*p(3|H) p(C|H)*p(3|C) 0.1*0.1=0.01 p(C|H)*p(3|C) 0.1*0.1=0.01 Day 3: 3 cones  This “trellis” graph has 2 33 paths.  These represent all possible weather sequences that could explain the observed ice cream sequence 2, 3, 3, …  What is the product of all the edge weights on one path H, H, H, …?  Edge weights are chosen to get p(weather=H,H,H,… & icecream=2,3,3,…)  What is the  probability at each state?  It’s the total probability of all paths from Start to that state.  How can we compute it fast when there are many paths?  =0.1*0.07+0.1*0.56 =0.063  =0.1*0.08+0.1*0.01 =0.009  =0.1  =0.009*0.07+0.063*0.56 =0.03591  =0.009*0.08+0.063*0.01 =0.00135  =1 p(C|Start)*p(2|C) 0.5*0.2=0.1 p(C|C)*p(3|C) 0.8*0.1=0.08 p(C|C)*p(3|C) 0.8*0.1=0.08

600.465 - Intro to NLP - J. Eisner11 Computing  Values C H p2p2 f d e All paths to state:  = (ap 1 + bp 1 + cp 1 ) + (dp 2 + ep 2 + fp 2 ) =  1  p 1 +  2  p 2 22 C p1p1 a b c 11  Thanks, distributive law!

600.465 - Intro to NLP - J. Eisner12 The HMM trellis Day 34: lose diary Stop p(Stop|C) 0.1 p(Stop|H) C H p(C|C)*p(2|C) 0.8*0.2=0.16 p(H|H)*p(2|H) 0.8*0.2=0.16  =0.16*0.1+0.02*0.1 =0.018 p(H|C)*p(2|H) 0.1*0.2=0.02  =0.16*0.1+0.02*0.1 =0.018 Day 33: 2 cones  =0.1 C H p(C|C)*p(2|C) 0.8*0.2=0.16 p(H|H)*p(2|H) 0.8*0.2=0.16 0.1*0.2=0.02 p(C|H)*p(2|C)  =0.16*0.018+0.02*0.018 =0.00324 p(H|C)*p(2|H) 0.1*0.2=0.02  =0.16*0.018+0.02*0.018 =0.00324 Day 32: 2 cones The dynamic programming computation of  works back from Stop.  What is the  probability at each state?  It’s the total probability of all paths from that state to Stop  How can we compute it fast when there are many paths? p(C|H)*p(2|C) C H 0.1*0.2=0.02  =0.1

600.465 - Intro to NLP - J. Eisner13 Computing  Values C H p2p2 z x y All paths from state:  = (p 1 u + p 1 v + p 1 w) + (p 2 x + p 2 y + p 2 z) = p 1   1 + p 2   2 C p1p1 u v w 22 11 

600.465 - Intro to NLP - J. Eisner14 Computing State Probabilities C x y z a b c All paths through state: ax + ay + az + bx + by + bz + cx + cy + cz = (a+b+c)  (x+y+z) =  (C)   (C)  Thanks, distributive law!

600.465 - Intro to NLP - J. Eisner15 Computing Arc Probabilities C H p x y z a b c All paths through the p arc: apx + apy + apz + bpx + bpy + bpz + cpx + cpy + cpz = (a+b+c)  p  (x+y+z) =  (H)  p   (C)  Thanks, distributive law!

600.465 - Intro to NLP - J. Eisner16600.465 - Intro to NLP - J. Eisner16 Posterior tagging  Give each word its highest-prob tag according to forward-backward.  Do this independently of other words.  Det Adj0.35  Det N0.2  N V0.45  Output is  Det V0  Defensible: maximizes expected # of correct tags.  But not a coherent sequence. May screw up subsequent processing (e.g., can’t find any parse).  exp # correct tags = 0.55+0.35 = 0.9  exp # correct tags = 0.55+0.2 = 0.75  exp # correct tags = 0.45+0.45 = 0.9  exp # correct tags = 0.55+0.45 = 1.0

600.465 - Intro to NLP - J. Eisner17600.465 - Intro to NLP - J. Eisner17 Alternative: Viterbi tagging  Posterior tagging: Give each word its highest- prob tag according to forward-backward.  Det Adj0.35  Det N0.2  N V0.45  Viterbi tagging: Pick the single best tag sequence (best path):  N V0.45  Same algorithm as forward-backward, but uses a semiring that maximizes over paths instead of summing over paths.

600.465 - Intro to NLP - J. Eisner18 Max-product instead of sum-product Use a semiring that maximizes over paths instead of summing. We write , instead of ,  for these “Viterbi forward” and “Viterbi backward” probabilities.” Day 1: 2 cones Start C p(C|Start)*p(2|C) 0.5*0.2=0.1 H p(H|Start)*p(2|H) C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7=0.56 0.1*0.7=0.07 p(H|C)*p(3|H)  =max(0.1*0.07,0.1*0.56) =0.056 p(C|H)*p(3|C) 0.1*0.1=0.01  =max(0.1*0.08,0.1*0.01) =0.008 Day 2: 3 cones  =0.1 C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7=0.56 0.1*0.7=0.07 p(H|C)*p(3|H)  =max(0.008*0.07,0.056*0.56) =0.03136 p(C|H)*p(3|C) 0.1*0.1=0.01  =max(0.008*0.08,0.056*0.01) =0.00064 Day 3: 3 cones The dynamic programming computation of . (  is similar but works back from Stop.)  *  at a state = total prob of all paths through that state  * at a state = max prob of any path through that state Suppose max prob path has prob p: how to print it?  Print the state at each time step with highest  * (= p); works if no ties  Or, compute  values from left to right to find p, then follow backpointers

600.465 - Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm.

Similar presentations

Presentation on theme: "600.465 - Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

600.465 - Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm.

Similar presentations

Presentation on theme: "600.465 - Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm."— Presentation transcript:

Similar presentations

About project

Feedback