600.465 - Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm.

Slides:



Advertisements
Similar presentations
Hidden Markov Models (HMM) Rabiner’s Paper
Advertisements

Large Vocabulary Unconstrained Handwriting Recognition J Subrahmonia Pen Technologies IBM T J Watson Research Center.
Learning HMM parameters
Dynamic Bayesian Networks (DBNs)
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Machine Learning 4 Machine Learning 4 Hidden Markov Models.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
… Hidden Markov Models Markov assumption: Transition model:
Midterm Review. The Midterm Everything we have talked about so far Stuff from HW I won’t ask you to do as complicated calculations as the HW Don’t need.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Kim Jin-young Biointelligence Laboratory, Seoul.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Conditional Random Fields
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Intro to NLP - J. Eisner1 The Expectation Maximization (EM) Algorithm … continued!
Computer vision: models, learning and inference
Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.
Computational Linguistics - Jason Eisner1 It’s All About High- Probability Paths in Graphs Airport Travel Hidden Markov Models Parsing (if you generalize)
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
CS Statistical Machine learning Lecture 24
CSC321: Neural Networks Lecture 16: Hidden Markov Models
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Pattern Recognition and Machine Learning-Chapter 13: Sequential Data
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models BMI/CS 576
CS 224S / LINGUIST 285 Spoken Language Processing
Part-of-Speech Tagging
The Expectation Maximization (EM) Algorithm
Statistical NLP Winter 2009
and the Forward-Backward Algorithm
Today.
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Three classic HMM problems
Hidden Markov Model LR Rabiner
The Expectation Maximization (EM) Algorithm
Summarized by Kim Jin-young
Algorithms of POS Tagging
Hidden Markov Models Teaching Demo The University of Arizona
and the Forward-Backward Algorithm
Presentation transcript:

Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm

Intro to NLP - J. Eisner2 Please See the Spreadsheet  I like to teach this material using an interactive spreadsheet:   Has the spreadsheet and the lesson plan  I’ll also show the following slides at appropriate points.

Intro to NLP - J. Eisner3 Marginalization SALESJanFebMarApr… Widgets5032… Grommets73108… Gadgets0010… ………………

Intro to NLP - J. Eisner4 Marginalization SALESJanFebMarApr…TOTAL Widgets5032…30 Grommets73108…80 Gadgets0010…2 ……………… TOTAL Write the totals in the margins Grand total

Intro to NLP - J. Eisner5 Marginalization prob.JanFebMarApr…TOTAL Widgets ….030 Grommets ….080 Gadgets ….002 ……………… TOTAL Grand total Given a random sale, what & when was it?

Intro to NLP - J. Eisner6 Marginalization prob.JanFebMarApr…TOTAL Widgets ….030 Grommets ….080 Gadgets ….002 ……………… TOTAL Given a random sale, what & when was it? marginal prob: p(Jan) marginal prob: p(widget) joint prob: p(Jan,widget) marginal prob: p(anything in table)

Intro to NLP - J. Eisner7 Conditionalization prob.JanFebMarApr…TOTAL Widgets ….030 Grommets ….080 Gadgets ….002 ……………… TOTAL Given a random sale in Jan., what was it? marginal prob: p(Jan) joint prob: p(Jan,widget) conditional prob: p(widget|Jan)=.005/.099 p(… | Jan).005/ / ….099/.099 Divide column through by Z=0.99 so it sums to 1

Intro to NLP - J. Eisner8 Marginalization & conditionalization in the weather example  Instead of a 2-dimensional table, now we have a 66-dimensional table:  33 of the dimensions have 2 choices: {C,H}  33 of the dimensions have 3 choices: {1,2,3}  Cross-section showing just 3 of the dimensions: Weather 2 =CWeather 2 =H IceCream 2 = … IceCream 2 = … IceCream 2 = …

Intro to NLP - J. Eisner9 Interesting probabilities in the weather example  Prior probability of weather: p(Weather=CHH…)  Posterior probability of weather (after observing evidence) : p(Weather=CHH… | IceCream=233…)  Posterior marginal probability that day 3 is hot: p(Weather 3 =H | IceCream=233…) =  w such that w 3 =H p(Weather=w | IceCream=233…)  Posterior conditional probability that day 3 is hot if day 2 is: p(Weather 3 =H | Weather 2 =H, IceCream=233…)

Intro to NLP - J. Eisner10 The HMM trellis The dynamic programming computation of  works forward from Start. Day 1: 2 cones Start C H C H Day 2: 3 cones C H 0.5*0.2=0.1 p(H|Start)*p(2|H) p(H|H)*p(3|H) 0.8*0.7=0.56 p(H|H)*p(3|H) 0.8*0.7= *0.7=0.07 p(H|C)*p(3|H) 0.1*0.7=0.07 p(H|C)*p(3|H) p(C|H)*p(3|C) 0.1*0.1=0.01 p(C|H)*p(3|C) 0.1*0.1=0.01 Day 3: 3 cones  This “trellis” graph has 2 33 paths.  These represent all possible weather sequences that could explain the observed ice cream sequence 2, 3, 3, …  What is the product of all the edge weights on one path H, H, H, …?  Edge weights are chosen to get p(weather=H,H,H,… & icecream=2,3,3,…)  What is the  probability at each state?  It’s the total probability of all paths from Start to that state.  How can we compute it fast when there are many paths?  =0.1* *0.56 =0.063  =0.1* *0.01 =0.009  =0.1  =0.009* *0.56 =  =0.009* *0.01 =  =1 p(C|Start)*p(2|C) 0.5*0.2=0.1 p(C|C)*p(3|C) 0.8*0.1=0.08 p(C|C)*p(3|C) 0.8*0.1=0.08

Intro to NLP - J. Eisner11 Computing  Values C H p2p2 f d e All paths to state:  = (ap 1 + bp 1 + cp 1 ) + (dp 2 + ep 2 + fp 2 ) =  1  p 1 +  2  p 2 22 C p1p1 a b c 11  Thanks, distributive law!

Intro to NLP - J. Eisner12 The HMM trellis Day 34: lose diary Stop p(Stop|C) 0.1 p(Stop|H) C H p(C|C)*p(2|C) 0.8*0.2=0.16 p(H|H)*p(2|H) 0.8*0.2=0.16  =0.16* *0.1 =0.018 p(H|C)*p(2|H) 0.1*0.2=0.02  =0.16* *0.1 =0.018 Day 33: 2 cones  =0.1 C H p(C|C)*p(2|C) 0.8*0.2=0.16 p(H|H)*p(2|H) 0.8*0.2= *0.2=0.02 p(C|H)*p(2|C)  =0.16* *0.018 = p(H|C)*p(2|H) 0.1*0.2=0.02  =0.16* *0.018 = Day 32: 2 cones The dynamic programming computation of  works back from Stop.  What is the  probability at each state?  It’s the total probability of all paths from that state to Stop  How can we compute it fast when there are many paths? p(C|H)*p(2|C) C H 0.1*0.2=0.02  =0.1

Intro to NLP - J. Eisner13 Computing  Values C H p2p2 z x y All paths from state:  = (p 1 u + p 1 v + p 1 w) + (p 2 x + p 2 y + p 2 z) = p 1   1 + p 2   2 C p1p1 u v w 22 11 

Intro to NLP - J. Eisner14 Computing State Probabilities C x y z a b c All paths through state: ax + ay + az + bx + by + bz + cx + cy + cz = (a+b+c)  (x+y+z) =  (C)   (C)  Thanks, distributive law!

Intro to NLP - J. Eisner15 Computing Arc Probabilities C H p x y z a b c All paths through the p arc: apx + apy + apz + bpx + bpy + bpz + cpx + cpy + cpz = (a+b+c)  p  (x+y+z) =  (H)  p   (C)  Thanks, distributive law!

Intro to NLP - J. Eisner Intro to NLP - J. Eisner16 Posterior tagging  Give each word its highest-prob tag according to forward-backward.  Do this independently of other words.  Det Adj0.35  Det N0.2  N V0.45  Output is  Det V0  Defensible: maximizes expected # of correct tags.  But not a coherent sequence. May screw up subsequent processing (e.g., can’t find any parse).  exp # correct tags = = 0.9  exp # correct tags = = 0.75  exp # correct tags = = 0.9  exp # correct tags = = 1.0

Intro to NLP - J. Eisner Intro to NLP - J. Eisner17 Alternative: Viterbi tagging  Posterior tagging: Give each word its highest- prob tag according to forward-backward.  Det Adj0.35  Det N0.2  N V0.45  Viterbi tagging: Pick the single best tag sequence (best path):  N V0.45  Same algorithm as forward-backward, but uses a semiring that maximizes over paths instead of summing over paths.

Intro to NLP - J. Eisner18 Max-product instead of sum-product Use a semiring that maximizes over paths instead of summing. We write , instead of ,  for these “Viterbi forward” and “Viterbi backward” probabilities.” Day 1: 2 cones Start C p(C|Start)*p(2|C) 0.5*0.2=0.1 H p(H|Start)*p(2|H) C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7= *0.7=0.07 p(H|C)*p(3|H)  =max(0.1*0.07,0.1*0.56) =0.056 p(C|H)*p(3|C) 0.1*0.1=0.01  =max(0.1*0.08,0.1*0.01) =0.008 Day 2: 3 cones  =0.1 C H p(C|C)*p(3|C) 0.8*0.1=0.08 p(H|H)*p(3|H) 0.8*0.7= *0.7=0.07 p(H|C)*p(3|H)  =max(0.008*0.07,0.056*0.56) = p(C|H)*p(3|C) 0.1*0.1=0.01  =max(0.008*0.08,0.056*0.01) = Day 3: 3 cones The dynamic programming computation of . (  is similar but works back from Stop.)  *  at a state = total prob of all paths through that state  * at a state = max prob of any path through that state Suppose max prob path has prob p: how to print it?  Print the state at each time step with highest  * (= p); works if no ties  Or, compute  values from left to right to find p, then follow backpointers