CS 4705 Hidden Markov Models Julia Hirschberg CS4705.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.
Word classes and part of speech tagging Chapter 5.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Hidden Markov Models CMSC 723: Computational Linguistics I ― Session #5 Jimmy Lin The iSchool University of Maryland Wednesday, September 30, 2009.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models CMSC 723: Computational Linguistics I ― Session #5 Jimmy Lin The iSchool University of Maryland Wednesday, September 30, 2009.
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012.
Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.
Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
CSA3202 Human Language Technology HMMs for POS Tagging.
中文信息处理 Chinese NLP Lecture 7.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Speech and Language Processing SLP Chapter 5. 10/31/1 2 Speech and Language Processing - Jurafsky and Martin 2 Today  Parts of speech (POS)  Tagsets.
CS 224S / LINGUIST 285 Spoken Language Processing
Lecture 5 POS Tagging Methods
Word classes and part of speech tagging
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models IP notice: slides from Dan Jurafsky.
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Algorithms of POS Tagging
Presentation transcript:

CS 4705 Hidden Markov Models Julia Hirschberg CS4705

9/24/20152 POS Tagging using HMMs A different approach to POS tagging –A special case of Bayesian inference –Related to the “noisy channel” model used in MT, ASR and other applications

9/24/20153 POS Tagging as a Sequence ID Task Given a sentence (a sequence of words, or observations) –Secretariat is expected to race tomorrow What is the best sequence of tags which corresponds to this sequence of observations? Bayesian approach: –Consider all possible sequences of tags –Choose the tag sequence t 1 …t n which is most probable given the observation sequence of words w 1 …w n

9/24/20154 Out of all sequences of n tags t 1 …t n want the single tag sequence such that P(t 1 …t n |w 1 …w n ) is highest I.e. find our best estimate of the sequence that maximizes P(t 1 …t n |w 1 …w n ) How do we do this? Use Bayes rule to transform P(t 1 …t n |w 1 …w n ) into a set of other probabilities that are easier to compute

9/24/20155 Bayes Rule Rewrite Substitute

Drop denominator since is the same for all we consider So….

Likelihoods, Priors, and Simplifying Assumptions A1:P(w) depends only on its own POS A2:P(t) depends only on P(t-1) n

Simplified Equation Now we have two probabilities to calculate: Probability of a word occurring given its tag Probability of a tag occurring given a previous tag We can calculate each of these from a POS-tagged corpus

Tag Transition Probabilities P(t i |t i-1 ) Determiners likely to precede adjs and nouns but unlikely to follow adjs –The/DT red/JJ hat/NN;*Red/JJ the/DT hat/NN –So we expect P(NN|DT) and P(JJ|DT) to be high but P(DT|JJ) to be low Compute P(NN|DT) by counting in a tagged corpus:

9/24/ Word Likelihood Probabilities P(w i |t i ) VBZ (3sg pres verb) likely to be is Compute P(is|VBZ) by counting in a tagged corpus:

9/24/ Some Data on race Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN How do we pick the right tag for race in new data?

9/24/ Disambiguating to race tomorrow

9/24/ Look Up the Probabilities P(NN|TO) = P(VB|TO) =.83 P(race|NN) = P(race|VB) = P(NR|VB) =.0027 P(NR|NN) =.0012 P(VB|TO)P(NR|VB)P(race|VB) = P(NN|TO)P(NR|NN)P(race|NN)= So we (correctly) choose the verb reading

Markov Chains –Have transition probabilities like P(t i |t i-1 ) A special case of weighted FSTs (FSTs which have probabilities or weights on the arcs) Can only represent unambiguous phenomena 9/24/201514

A Weather Example: cold, hot, hot 9/24/201515

Weather Markov Chain What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is P(3,3,3,3) = –  1 a 33 a 33 a 33 a 33 = 0.2 x (0.6) 3 = /24/201516

Markov Chain Defined A set of states Q = q 1, q 2 …q N Transition probabilities –A set of probabilities A = a 01 a 02 …a n1 …a nn. –Each a ij represents the probability of transitioning from state i to state j –The set of these is the transition probability matrix A Distinguished start and end states q 0,q F 9/24/201517

Can have special initial probability vector  instead of start state –An initial distribution over probability of start states –Must sum to 1 9/24/

Hidden Markov Models Markov Chains are useful for representing problems in which –Observable events –Sequences to be labeled are unambiguous Problems like POS tagging are neither HMMs are useful for computing events we cannot directly observe in the world, using other events we can observe –Unobservable (Hidden): e.g., POS tags –Observable: e.g., words –We have to learn the relationships

Hidden Markov Models A set of states Q = q 1, q 2 …q N Transition probabilities –Transition probability matrix A = {a ij } Observations O= o 1, o 2 …o N; –Each a symbol from a vocabulary V = {v 1,v 2,…v V } Observation likelihoods or emission probabilities –Output probability matrix B={b i (o t )}

Special initial probability vector  A set of legal accepting states 9/24/201521

First-Order HMM Assumptions Markov assumption: probability of a state depends only on the state that precedes it –This is the same Markov assumption we made when we decided to represent sentence probabilities as the product of bigram probabilities Output-independence assumption: probability of an output observation depends only on the state that produced the observation 9/24/201522

Weather Again 9/24/201523

Weather and Ice Cream You are a climatologist in the year 2799 studying global warming You can’t find any records of the weather in Baltimore for summer of 2007 But you find JE’s diary Which lists how many ice-creams J ate every day that summer Your job: Determine (hidden) sequence of weather states that `led to’ J’s (observed) ice cream behavior 9/24/201524

Weather/Ice Cream HMM Hidden States: {Hot,Cold} Transition probabilities (A Matrix) between H and C Observations: {1,2,3} # of ice creams eaten per day Goal: Learn observation likelihoods between observations and weather states (Output Matrix B) by training HMM on aligned input streams from a training corpus Result: trained HMM for weather prediction given ice cream information alone

Ice Cream/Weather HMM 9/24/201526

Bakis = left-to-right Ergodic = fully-connected HMM Configurations

What can HMMs Do? Likelihood: Given an HMM λ = (A,B) and an observation sequence O, determine the likelihood P(O, λ): Given # ice creams, what is the weather? Decoding: Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q: Given seq of ice creams, what was the most likely weather on those days? Learning: Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B 9/24/201528

Likelihood: The Forward Algorithm Likelihood: Given an HMM λ = (A,B) and an observation sequence O, determine the likelihood P(O, λ) –E.g. what is the probability of an ice-cream sequence 3 – 1 – 3 given 9/24/201529

Compute the joint probability of 3 – 1 – 3 by summing over all the possible {H,C} weather sequences – since weather is hidden Forward Algorithm: –Dynamic Programming algorithm, stores table of intermediate values so it need not recompute them –Computes P(O) by summing over probabilities of all hidden state paths that could generate the observation sequence – 3:

–The previous forward path probability, –The transition probability from the previous state to the current state –The state observation likelihood of the observation o t given the current state j

Decoding: The Viterbi Algorithm Decoding: Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence of weather states in Q –E.g., Given the observations 3 – 1 – 1 and an HMM, what is the best (most probable) hidden weather sequence of {H,C} Viterbi algorithm –Dynamic programming algorithm –Uses a dynamic programming trellis to store probabilities that the HMM is in state j after seeing the first t observations, for all states j 9/24/201532

–Value in each cell computed by taking MAX over all paths leading to this cell – i.e. best path –Extension of a path from state i at time t-1 is computed by multiplying: –Most probable path is the max over all possible previous state sequences Like Forward Algorithm, but it takes the max over previous path probabilities rather than sum

HMM Training: The Forward-Backward (Baum-Welch) Algorithm Learning: Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A (transition) and B (emission) Input: unlabeled seq of observations O and vocabulary of possible hidden states Q –E.g. for ice-cream weather: Observations = {1,3,2,1,3,3,…} Hidden states = {H,C,C,C,H,C,….}

Intuitions –Iteratively re-estimate counts, starting from an initialization for A and B probabilities, e.g. all equi- probable –Estimate new probabilities by computing forward probability for an observation, dividing prob. mass among all paths contributing to it, and computing the backward probability from the same state Details: left as an exercise to the Reader

Very Important Check errata for Chapter 6 – there are some important errors in equations Please use clic.cs.columbia.edu, not cluster.cs.columbia.edu for HW

Summary HMMs are a major tool for relating observed sequences to hidden information that explains or predicts the observations Forward, Viterbi, and Forward-Backward Algorithms are used for computing likelihoods, decoding, and training HMMs Next week: Sameer Maskey will discuss Maximum Entropy Models and other Machine Learning approaches to NLP tasks These will be very useful in HW2