and the Forward-Backward Algorithm

Slides:



Advertisements
Similar presentations
Hidden Markov Models (HMM) Rabiner’s Paper
Advertisements

Large Vocabulary Unconstrained Handwriting Recognition J Subrahmonia Pen Technologies IBM T J Watson Research Center.
Dynamic Bayesian Networks (DBNs)
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
… Hidden Markov Models Markov assumption: Transition model:
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Kim Jin-young Biointelligence Laboratory, Seoul.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Intro to NLP - J. Eisner1 The Expectation Maximization (EM) Algorithm … continued!
Class 5 Hidden Markov models. Markov chains Read Durbin, chapters 1 and 3 Time is divided into discrete intervals, t i At time t, system is in one of.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue
Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.
Computational Linguistics - Jason Eisner1 It’s All About High- Probability Paths in Graphs Airport Travel Hidden Markov Models Parsing (if you generalize)
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
CS Statistical Machine learning Lecture 24
CSC321: Neural Networks Lecture 16: Hidden Markov Models
Viterbi, Forward, and Backward Algorithms for Hidden Markov Models Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Pattern Recognition and Machine Learning-Chapter 13: Sequential Data
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models BMI/CS 576
CS 224S / LINGUIST 285 Spoken Language Processing
CSCE 771 Natural Language Processing
Part-of-Speech Tagging
The Expectation Maximization (EM) Algorithm
Today.
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
CSCI 5822 Probabilistic Models of Human and Machine Learning
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Three classic HMM problems
Hidden Markov Model LR Rabiner
Hidden Markov Models (HMMs)
The Expectation Maximization (EM) Algorithm
Summarized by Kim Jin-young
Algorithms of POS Tagging
Hidden Markov Models Teaching Demo The University of Arizona
Instructor: Vincent Conitzer
and the Forward-Backward Algorithm
Presentation transcript:

and the Forward-Backward Algorithm Hidden Markov Models and the Forward-Backward Algorithm 600.465 - Intro to NLP - J. Eisner

Please See the Spreadsheet I like to teach this material using an interactive spreadsheet: http://cs.jhu.edu/~jason/papers/#tnlp02 Has the spreadsheet and the lesson plan I’ll also show the following slides at appropriate points. 600.465 - Intro to NLP - J. Eisner

Marginalization SALES Jan Feb Mar Apr … Widgets 5 3 2 Grommets 7 10 8 3 2 Grommets 7 10 8 Gadgets 1 600.465 - Intro to NLP - J. Eisner

Marginalization Write the totals in the margins SALES Jan Feb Mar Apr … TOTAL Widgets 5 3 2 30 Grommets 7 10 8 80 Gadgets 1 99 25 126 90 1000 Grand total 600.465 - Intro to NLP - J. Eisner

Marginalization Given a random sale, what & when was it? prob. Jan Feb Apr … TOTAL Widgets .005 .003 .002 .030 Grommets .007 .010 .008 .080 Gadgets .001 .099 .025 .126 .090 1.000 Grand total 600.465 - Intro to NLP - J. Eisner

Marginalization Given a random sale, what & when was it? joint prob: p(Jan,widget) prob. Jan Feb Mar Apr … TOTAL Widgets .005 .003 .002 .030 Grommets .007 .010 .008 .080 Gadgets .001 .099 .025 .126 .090 1.000 marginal prob: p(widget) marginal prob: p(Jan) marginal prob: p(anything in table) 600.465 - Intro to NLP - J. Eisner

Divide column through by Z=0.99 so it sums to 1 Conditionalization Given a random sale in Jan., what was it? Divide column through by Z=0.99 so it sums to 1 joint prob: p(Jan,widget) prob. Jan Feb Mar Apr … TOTAL Widgets .005 .003 .002 .030 Grommets .007 .010 .008 .080 Gadgets .001 .099 .025 .126 .090 1.000 p(… | Jan) .005/.099 .007/.099 … .099/.099 marginal prob: p(Jan) conditional prob: p(widget|Jan)=.005/.099 600.465 - Intro to NLP - J. Eisner

Marginalization & conditionalization in the weather example Instead of a 2-dimensional table, now we have a 66-dimensional table: 33 of the dimensions have 2 choices: {C,H} 33 of the dimensions have 3 choices: {1,2,3} Cross-section showing just 3 of the dimensions: Weather =C 3 Weather =H 3 Weather2=C Weather2=H IceCream2=1 0.000… IceCream2=2 IceCream2=3 600.465 - Intro to NLP - J. Eisner

Interesting probabilities in the weather example Prior probability of weather: p(Weather=CHH…) Posterior probability of weather (after observing evidence): p(Weather=CHH… | IceCream=233…) Posterior marginal probability that day 3 is hot: p(Weather3=H | IceCream=233…) = w such that w3=H p(Weather=w | IceCream=233…) Posterior conditional probability that day 3 is hot if day 2 is: p(Weather3=H | Weather2=H, IceCream=233…) 600.465 - Intro to NLP - J. Eisner

The HMM trellis This “trellis” graph has 233 paths. The dynamic programming computation of a works forward from Start. Day 1: 2 cones Day 2: 3 cones Day 3: 3 cones a=0.1*0.08+0.1*0.01 =0.009 a=0.009*0.08+0.063*0.01 =0.00135 a=0.1 p(C|Start)*p(2|C) 0.5*0.2=0.1 p(C|C)*p(3|C) 0.8*0.1=0.08 C C C a=1 0.1*0.7=0.07 p(H|C)*p(3|H) p(C|H)*p(3|C) 0.1*0.1=0.01 Start 0.5*0.2=0.1 p(H|Start)*p(2|H) p(H|H)*p(3|H) 0.8*0.7=0.56 H H H a=0.1 a=0.1*0.07+0.1*0.56 =0.063 a=0.009*0.07+0.063*0.56 =0.03591 This “trellis” graph has 233 paths. These represent all possible weather sequences that could explain the observed ice cream sequence 2, 3, 3, … What is the product of all the edge weights on one path H, H, H, …? Edge weights are chosen to get p(weather=H,H,H,… & icecream=2,3,3,…) What is the  probability at each state? It’s the total probability of all paths from Start to that state. How can we compute it fast when there are many paths? 600.465 - Intro to NLP - J. Eisner

Thanks, distributive law! Computing  Values All paths to state:  = (ap1 + bp1 + cp1) + (dp2 + ep2 + fp2) a  1 p1 b C C c p2 d H 2 e = 1p1 + 2p2 f Thanks, distributive law! 600.465 - Intro to NLP - J. Eisner

The HMM trellis What is the b probability at each state? The dynamic programming computation of b works back from Stop. Day 32: 2 cones Day 33: 2 cones Day 34: lose diary b=0.16*0.018+0.02*0.018 =0.00324 b=0.16*0.1+0.02*0.1 =0.018 b=0.1 p(C|C)*p(2|C) p(C|C)*p(2|C) C C C 0.8*0.2=0.16 0.8*0.2=0.16 p(Stop|C) 0.1 p(H|C)*p(2|H) 0.1*0.2=0.02 0.1*0.2=0.02 p(C|H)*p(2|C) p(H|C)*p(2|H) 0.1*0.2=0.02 p(C|H)*p(2|C) Stop 0.1*0.2=0.02 p(Stop|H) p(H|H)*p(2|H) p(H|H)*p(2|H) 0.1 H H H 0.8*0.2=0.16 0.8*0.2=0.16 b=0.16*0.018+0.02*0.018 =0.00324 b=0.16*0.1+0.02*0.1 =0.018 b=0.1 What is the b probability at each state? It’s the total probability of all paths from that state to Stop How can we compute it fast when there are many paths? 600.465 - Intro to NLP - J. Eisner 12 12

 = (p1u + p1v + p1w) + (p2x + p2y + p2z) Computing  Values All paths from state:  = (p1u + p1v + p1w) + (p2x + p2y + p2z) u  p1 2 1 v C C w p2 x H = p11 + p22 y z 600.465 - Intro to NLP - J. Eisner 13 13

Computing State Probabilities All paths through state: ax + ay + az + bx + by + bz + cx + cy + cz a x y z   b C c = (a+b+c)(x+y+z) = (C)  (C) Thanks, distributive law! 600.465 - Intro to NLP - J. Eisner

Computing Arc Probabilities All paths through the p arc: apx + apy + apz + bpx + bpy + bpz + cpx + cpy + cpz x y z  C p a  b H c = (a+b+c)p(x+y+z) = (H)  p  (C) Thanks, distributive law! 600.465 - Intro to NLP - J. Eisner

Posterior tagging Give each word its highest-prob tag according to forward-backward. Do this independently of other words. Det Adj 0.35 Det N 0.2 N V 0.45 Output is Det V 0 Defensible: maximizes expected # of correct tags. But not a coherent sequence. May screw up subsequent processing (e.g., can’t find any parse).  exp # correct tags = 0.55+0.35 = 0.9  exp # correct tags = 0.55+0.2 = 0.75  exp # correct tags = 0.45+0.45 = 0.9  exp # correct tags = 0.55+0.45 = 1.0 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 16

Alternative: Viterbi tagging Posterior tagging: Give each word its highest-prob tag according to forward-backward. Det Adj 0.35 Det N 0.2 N V 0.45 Viterbi tagging: Pick the single best tag sequence (best path): Same algorithm as forward-backward, but uses a semiring that maximizes over paths instead of summing over paths. 600.465 - Intro to NLP - J. Eisner 600.465 - Intro to NLP - J. Eisner 17

Max-product instead of sum-product Use a semiring that maximizes over paths instead of summing. We write , instead of , for these “Viterbi forward” and “Viterbi backward” probabilities.” Day 1: 2 cones Start C p(C|Start)*p(2|C) 0.5*0.2=0.1 H p(H|Start)*p(2|H) p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7=0.56 0.1*0.7=0.07 p(H|C)*p(3|H) m=max(0.1*0.07,0.1*0.56) =0.056 p(C|H)*p(3|C) 0.1*0.1=0.01 m=max(0.1*0.08,0.1*0.01) =0.008 Day 2: 3 cones m=0.1 m=max(0.008*0.07,0.056*0.56) =0.03136 m=max(0.008*0.08,0.056*0.01) =0.00064 Day 3: 3 cones The dynamic programming computation of m. (u is similar but works back from Stop.) * at a state = total prob of all paths through that state * at a state = max prob of any path through that state Suppose max prob path has prob p: how to print it? Print the state at each time step with highest * (= p); works if no ties Or, compute  values from left to right to find p, then follow backpointers 600.465 - Intro to NLP - J. Eisner