Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm
Intro to NLP - J. Eisner2 Please See the Spreadsheet I like to teach this material using an interactive spreadsheet: Has the spreadsheet and the lesson plan I’ll also show the following slides at appropriate points.
Intro to NLP - J. Eisner3 Marginalization SALESJanFebMarApr… Widgets5032… Grommets73108… Gadgets0010… ………………
Intro to NLP - J. Eisner4 Marginalization SALESJanFebMarApr…TOTAL Widgets5032…30 Grommets73108…80 Gadgets0010…2 ……………… TOTAL Write the totals in the margins Grand total
Intro to NLP - J. Eisner5 Marginalization prob.JanFebMarApr…TOTAL Widgets ….030 Grommets ….080 Gadgets ….002 ……………… TOTAL Grand total Given a random sale, what & when was it?
Intro to NLP - J. Eisner6 Marginalization prob.JanFebMarApr…TOTAL Widgets ….030 Grommets ….080 Gadgets ….002 ……………… TOTAL Given a random sale, what & when was it? marginal prob: p(Jan) marginal prob: p(widget) joint prob: p(Jan,widget) marginal prob: p(anything in table)
Intro to NLP - J. Eisner7 Conditionalization prob.JanFebMarApr…TOTAL Widgets ….030 Grommets ….080 Gadgets ….002 ……………… TOTAL Given a random sale in Jan., what was it? marginal prob: p(Jan) joint prob: p(Jan,widget) conditional prob: p(widget|Jan)=.005/.099 p(… | Jan).005/ / ….099/.099 Divide column through by Z=0.99 so it sums to 1
Intro to NLP - J. Eisner8 Marginalization & conditionalization in the weather example Instead of a 2-dimensional table, now we have a 66-dimensional table: 33 of the dimensions have 2 choices: {C,H} 33 of the dimensions have 3 choices: {1,2,3} Cross-section showing just 3 of the dimensions: Weather 2 =CWeather 2 =H IceCream 2 = … IceCream 2 = … IceCream 2 = …
Intro to NLP - J. Eisner9 Interesting probabilities in the weather example Prior probability of weather: p(Weather=CHH…) Posterior probability of weather (after observing evidence) : p(Weather=CHH… | IceCream=233…) Posterior marginal probability that day 3 is hot: p(Weather 3 =H | IceCream=233…) = w such that w 3 =H p(Weather=w | IceCream=233…) Posterior conditional probability that day 3 is hot if day 2 is: p(Weather 3 =H | Weather 2 =H, IceCream=233…)
Intro to NLP - J. Eisner10 The HMM trellis The dynamic programming computation of works forward from Start. Day 1: 2 cones Start C H C H Day 2: 3 cones C H 0.5*0.2=0.1 p(H|Start)*p(2|H) p(H|H)*p(3|H) 0.8*0.7=0.56 p(H|H)*p(3|H) 0.8*0.7= *0.7=0.07 p(H|C)*p(3|H) 0.1*0.7=0.07 p(H|C)*p(3|H) p(C|H)*p(3|C) 0.1*0.1=0.01 p(C|H)*p(3|C) 0.1*0.1=0.01 Day 3: 3 cones This “trellis” graph has 2 33 paths. These represent all possible weather sequences that could explain the observed ice cream sequence 2, 3, 3, … What is the product of all the edge weights on one path H, H, H, …? Edge weights are chosen to get p(weather=H,H,H,… & icecream=2,3,3,…) What is the probability at each state? It’s the total probability of all paths from Start to that state. How can we compute it fast when there are many paths? =0.1* *0.56 =0.063 =0.1* *0.01 =0.009 =0.1 =0.009* *0.56 = =0.009* *0.01 = =1 p(C|Start)*p(2|C) 0.5*0.2=0.1 p(C|C)*p(3|C) 0.8*0.1=0.08 p(C|C)*p(3|C) 0.8*0.1=0.08
Intro to NLP - J. Eisner11 Computing Values C H p2p2 f d e All paths to state: = (ap 1 + bp 1 + cp 1 ) + (dp 2 + ep 2 + fp 2 ) = 1 p 1 + 2 p 2 22 C p1p1 a b c 11 Thanks, distributive law!
Intro to NLP - J. Eisner12 The HMM trellis Day 34: lose diary Stop p(Stop|C) 0.1 p(Stop|H) C H p(C|C)*p(2|C) 0.8*0.2=0.16 p(H|H)*p(2|H) 0.8*0.2=0.16 =0.16* *0.1 =0.018 p(H|C)*p(2|H) 0.1*0.2=0.02 =0.16* *0.1 =0.018 Day 33: 2 cones =0.1 C H p(C|C)*p(2|C) 0.8*0.2=0.16 p(H|H)*p(2|H) 0.8*0.2= *0.2=0.02 p(C|H)*p(2|C) =0.16* *0.018 = p(H|C)*p(2|H) 0.1*0.2=0.02 =0.16* *0.018 = Day 32: 2 cones The dynamic programming computation of works back from Stop. What is the probability at each state? It’s the total probability of all paths from that state to Stop How can we compute it fast when there are many paths? p(C|H)*p(2|C) C H 0.1*0.2=0.02 =0.1
Intro to NLP - J. Eisner13 Computing Values C H p2p2 z x y All paths from state: = (p 1 u + p 1 v + p 1 w) + (p 2 x + p 2 y + p 2 z) = p 1 1 + p 2 2 C p1p1 u v w 22 11
Intro to NLP - J. Eisner14 Computing State Probabilities C x y z a b c All paths through state: ax + ay + az + bx + by + bz + cx + cy + cz = (a+b+c) (x+y+z) = (C) (C) Thanks, distributive law!
Intro to NLP - J. Eisner15 Computing Arc Probabilities C H p x y z a b c All paths through the p arc: apx + apy + apz + bpx + bpy + bpz + cpx + cpy + cpz = (a+b+c) p (x+y+z) = (H) p (C) Thanks, distributive law!
Intro to NLP - J. Eisner Intro to NLP - J. Eisner16 Posterior tagging Give each word its highest-prob tag according to forward-backward. Do this independently of other words. Det Adj0.35 Det N0.2 N V0.45 Output is Det V0 Defensible: maximizes expected # of correct tags. But not a coherent sequence. May screw up subsequent processing (e.g., can’t find any parse). exp # correct tags = = 0.9 exp # correct tags = = 0.75 exp # correct tags = = 0.9 exp # correct tags = = 1.0
Intro to NLP - J. Eisner Intro to NLP - J. Eisner17 Alternative: Viterbi tagging Posterior tagging: Give each word its highest- prob tag according to forward-backward. Det Adj0.35 Det N0.2 N V0.45 Viterbi tagging: Pick the single best tag sequence (best path): N V0.45 Same algorithm as forward-backward, but uses a semiring that maximizes over paths instead of summing over paths.
Intro to NLP - J. Eisner18 Max-product instead of sum-product Use a semiring that maximizes over paths instead of summing. We write , instead of , for these “Viterbi forward” and “Viterbi backward” probabilities.” Day 1: 2 cones Start C p(C|Start)*p(2|C) 0.5*0.2=0.1 H p(H|Start)*p(2|H) C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7= *0.7=0.07 p(H|C)*p(3|H) =max(0.1*0.07,0.1*0.56) =0.056 p(C|H)*p(3|C) 0.1*0.1=0.01 =max(0.1*0.08,0.1*0.01) =0.008 Day 2: 3 cones =0.1 C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7= *0.7=0.07 p(H|C)*p(3|H) =max(0.008*0.07,0.056*0.56) = p(C|H)*p(3|C) 0.1*0.1=0.01 =max(0.008*0.08,0.056*0.01) = Day 3: 3 cones The dynamic programming computation of . ( is similar but works back from Stop.) * at a state = total prob of all paths through that state * at a state = max prob of any path through that state Suppose max prob path has prob p: how to print it? Print the state at each time step with highest * (= p); works if no ties Or, compute values from left to right to find p, then follow backpointers