Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.

Outline Why part of speech tagging? Word classes

February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.

LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.

Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.

Hidden Markov Models IP notice: slides from Dan Jurafsky.

Hidden Markov Models IP notice: slides from Dan Jurafsky.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수

Statistical NLP: Lecture 11

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.

Albert Gatt Corpora and Statistical Methods Lecture 8.

Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.

1 Parts of Speech Sudeshna Sarkar 7 Aug Why Do We Care about Parts of Speech? Pronunciation Hand me the lead pipe. Predicting what words can be.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.

FSA and HMM LING 572 Fei Xia 1/5/06.

Learning Bit by Bit Hidden Markov Models. Weighted FSA weather The is outside

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.

1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.

1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

Part of speech (POS) tagging

Word classes and part of speech tagging Chapter 5.

Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.

Albert Gatt Corpora and Statistical Methods Lecture 9.

M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012.

Part-of-Speech Tagging

1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.

Parts of Speech Sudeshna Sarkar 7 Aug 2008.

CS 4705 Hidden Markov Models Julia Hirschberg CS4705.

Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.

Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:

1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.

Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

Word classes and part of speech tagging Chapter 5.

Tokenization & POS-Tagging

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.

CSA3202 Human Language Technology HMMs for POS Tagging.

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

中文信息处理 Chinese NLP Lecture 7.

CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.

Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.

Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.

NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

1 COMP790: Statistical NLP POS Tagging Chap POS tagging Goal: assign the right part of speech (noun, verb, …) to words in a text “The/AT representative/NN.

Speech and Language Processing SLP Chapter 5. 10/31/1 2 Speech and Language Processing - Jurafsky and Martin 2 Today  Parts of speech (POS)  Tagsets.

Hidden Markov Models BMI/CS 576

Lecture 5 POS Tagging Methods

Word classes and part of speech tagging

CSC 594 Topics in AI – Natural Language Processing

Hidden Markov Models IP notice: slides from Dan Jurafsky.

CSCI 5832 Natural Language Processing

Part of Speech Tagging September 9, /12/2018.

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

N-Gram Model Formulas Word sequences Chain rule of probability

Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider

Presentation transcript:

Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference

 Goal: maximize P(word|tag) x P(tag|previous n tags)  P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix) ‏  P(tag|previous n tags) ‏ tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix) ‏ Hidden Markov Model (HMM) Taggers Lexical information Syntagmatic information

POS tagging as a sequence classification task  We are given a sentence (an “observation” or “sequence of observations”) ‏ Secretariat is expected to race tomorrow. sequence of n words w1…wn.  What is the best sequence of tags which corresponds to this sequence of observations?  Probabilistic/Bayesian view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.

Getting to HMM  Let T = t 1,t 2,…,t n  Let W = w 1,w 2,…,w n  Goal: Out of all sequences of tags t 1 …t n, get the the most probable sequence of POS tags T underlying the observed sequence of words w 1,w 2,…,w n  Hat ^ means “our estimate of the best = the most probable tag sequence”  Argmax x f(x) means “the x such that f(x) is maximized” it maximizes our estimate of the best tag sequence

Bayes Rule We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

Bayes Rule

Likelihood and prior

Likelihood and prior Further Simplifications n 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag 3. The most probable tag sequence estimated by the bigram tagger

Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag. Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N- gram. Bigrams are used as the basis for simple statistical analysis of text The bigram assumption is related to the first-order Markov assumption

Likelihood and prior Further Simplifications 3. The most probable tag sequence estimated by the bigram tagger n biagram assumption

Probability estimates  Tag transition probabilities p(t i |t i-1 ) ‏ Determiners likely to precede adjectives and nouns  That/DT flight/NN  The/DT yellow/JJ hat/NN  So we expect P(NN|DT) and P(JJ|DT) to be high

Estimating probability  Tag transition probabilities p(t i |t i-1 ) ‏ Compute P(NN|DT) by counting in a labeled corpus: # of times DT is followed by NN

Two kinds of probabilities  Word likelihood probabilities p(w i |t i ) ‏ P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is” Compute P(is|VBZ) by counting in a labeled corpus: If we were expecting a third person singular verb, how likely is it that this verb would be is?

An Example: the verb “race” Two possible tags:  Secretariat/NNP is/VBZ expected/VBN to/TO race /VB tomorrow/NR  People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race /NN for/IN outer/JJ space/NN How do we pick the right tag?

Disambiguating “race”

 P(NN|TO) =  P(VB|TO) =.83 The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’  P(race|NN) =  P(race|VB) = Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.

Disambiguating “race”  P(NR|VB) =.0027  P(NR|NN) =.0012 tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun  P(VB|TO)P(NR|VB)P(race|VB) =  P(NN|TO)P(NR|NN)P(race|NN)= Multiply the lexical likelihoods with the tag sequence probabilities: the verb wins

Hidden Markov Models  What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM) ‏  Let’s just spend a bit of time tying this into the model  In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model.

Definitions  A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one  A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through  Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences

 States Q = q 1, q 2 …q N;  Observations O = o 1, o 2 …o N; Each observation is a symbol from a vocabulary V = {v 1,v 2,…v V }  Transition probabilities (prior) ‏ Transition probability matrix A = {a ij }  Observation likelihoods (likelihood) ‏ Output probability matrix B={b i (o t )} a set of observation likelihoods, each expressing the probability of an observation o t being generated from a state i, emission probabilities  Special initial probability vector   i the probability that the HMM will start in state i, each  i expresses the probability p(q i |START) ‏ Hidden Markov Models Formal definition

Assumptions  Markov assumption: the probability of a particular state depends only on the previous state  Output-independence assumption: the probability of an output observation depends only on the state that produced that observation

HMM Taggers  Two kinds of probabilities A transition probabilities (PRIOR) B observation likelihoods (LIKELIHOOD)  HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability

Weighted FSM corresponding to hidden states of HMM

observation likelihoods for POS HMM

Transition matrix for the POS HMM

The output matrix for the POS HMM

HMM Taggers  The probabilities are trained on hand-labeled training corpora (training set) ‏  Combine different N-gram levels  Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard)

The Viterbi Algorithm  best tag sequence for "John likes to fish in the sea"?  efficiently computes the most likely state sequence given a particular output sequence  based on dynamic programming

A smaller example 0.6 b q r start end  What is the best sequence of states for the input string “bbba”?  Computing all possible paths and finding the one with the max probability is exponential a b a

Viterbi for POS tagging Let: n = nb of words in sentence to tag (nb of input tokens) ‏ T = nb of tags in the tag set (nb of states) ‏ vit = path probability matrix (viterbi) ‏ vit[i,j] = probability of being at state (tag) j at word i state = matrix to recover the nodes of the best path (best tag sequence) state[i+1,j] = the state (tag) of the incoming arc that led to this most probable state j at word i+1 // Initialization vit[1,PERIOD]:=1.0 // pretend that there is a period before // our sentence (start tag = PERIOD) ‏ vit[1,t]:=0.0 for t ≠ PERIOD

Viterbi for POS tagging (con’t) ‏ // Induction (build the path probability matrix) ‏ for i:=1 to n step 1 do // for all words in the sentence for all tags t j do // for all possible tags // store the max prob of the path vit[i+1,t j ] := max 1≤k≤T (vit[i,t k ] x P(w i+1 |t j ) x P(t j | t k )) // store the actual state path[i+1,t j ] := argmax 1≤k≤T ( vit[i,t k ] x P(w i+1 |t j ) x P(t j | t k )) ‏ end //Termination and path-readout bestState n+1 := argmax 1≤j≤T vit[n+1,j] for j:=n to 1 step -1 do // for all the words in the sentence bestState j := path[i+1, bestState j+1 ] end P(bestState 1,…, bestState n ) := max 1≤j≤T vit[n+1,j] emission probability state transition probability probability of best path leading to state t k at word i

 in bigram POS tagging, we condition a tag only on the preceding tag  why not... use more context (ex. use trigram model)  more precise:  “is clearly marked” --> verb, past participle  “he clearly marked” --> verb, past tense  combine trigram, bigram, unigram models condition on words too  but with an n-gram approach, this is too costly (too many parameters to model) ‏ Possible improvements

Further issues with Markov Model tagging  Unknown words are a problem since we don’t have the required probabilities. Possible solutions: Assign the word probabilities based on corpus- wide distribution of POS ??? Use morphological cues (capitalization, suffix) to assign a more calculated guess.  Using higher order Markov models: Using a trigram model captures more context However, data sparseness is much more of a problem.