Download presentation
Presentation is loading. Please wait.
1
POS Tagging HMM Taggers (continued)
2
Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words
3
HMM Tagger What is the goal of a Markov Tagger? To maximize the following expression: P(w i |t j ) x P(t i |t 1,i-1 ) Or P(word|tag) x P(tag|previous n tags) Simplifies, by the Markov assumption, to: P(w i |t i ) x P(t i |t i-1 )
4
HMM Tagger P(word|tag) x P(tag|previous n tags) P(word|tag) – –The probability of the word given a tag (not vice versa) –We model this by using a word-tag matrix (often called a language model) –Familiar? HW 4 (3)
5
HMM Tagger P(word|tag) x P(tag|previous n tags) P(tag|previous n tags) – –How likely a tag is given the n so many tags before –Simplified to the previous tag –Modeled by using a tag-tag matrix –Familiar? HW 4 (2)
6
HMM Tagger But why is it P(word|tag) not P(tag|word)? Take the following examples (from J&M): 1.Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN 2.People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/?? for/IN outer/JJ space/NN
7
HMM Tagger Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN Maximize: P(w i |t j ) x P(t j |t j-1 ) We can choose between a.P(race|VB) P(VB|TO) b.P(race|NN) P(NN|TO)
8
The good HMM Tagger From the Brown/Switchboard corpus: –P(VB|TO) =.34 –P(NN|TO) =.021 –P(race|VB) =.00003 –P(race|NN) =.00041 a.P(VB|TO) x P(race|VB) =.34 x.00003 =.00001 b.P(NN|TO) x P(race|NN) =.021 x.00041 =.000007 a. TO followed by VB in the context of race is more probable (‘race’ really has no effect here).
9
The no-go HMM Tagger Invert word and tag, P(t|w) instead of P(w|t): 1.P(VB|race) =.02 2.P(NN|race) =.98
10
HMM Tagger But don’t we really want to maximize the probability of the best sequence of tags for a given sequence of words? Not just the best tag for a given word? Thus, we really want to maximize (and implement): P(t 1,…,t n | w 1,…,w n ), or T^ = argmax P(T|W)
11
HMM Tagger By Bayes Rule: P(T) P(W|T) P(T|W) = ----------------- P(W) Since P(W) is always the same (why?), then P(T|W) = P(T) P(W|T)
12
HMM Tagger P(T|W) = P(T) P(W|T) = P(t 1,…,t n ) P(w 1,…,w n |t 1,…,t n ) By chain rule (computes joint probabilities from conditional probabilities, or vice versa) = P(t n | t 1, …, t n-1 ) x P(t n-1 | t 1, …, t n-2 ) x P(t n-2 | t 1, …, t n-3 ) x … x P(t 1 ) x P(w 1 | t 1, …, t n ) x P(w 2 | t 1, …, t n ) x P(w 3 | t 1, …, t n ) x … x P(w n | t 1, …, t n ) n = Π P(w i |w 1 t 1 …w i-1 t i-1 ) P(t i |w 1 t 1 …w i-1 t i-1 ) i=1
13
HMM Tagger P(T|W) = P(T) P(W|T) n = Π P(w i |w 1 t 1 …w i-1 t i-1 ) P(t i |w 1 t 1 …w i-1 t i-1 ) i=1 Simplifying assumption: probability of word is dependent on its tag: P(w i |w 1 t 1 …w i-1 t i-1 ) = P(w i |t i ) And the Markov assumption (for bigram): P(t i |w 1 t 1 …w i-1 t i-1 t i ) = P(t i |t i-1 ) The best tag sequence is then: n T^ = argmax Π P(t i |t i-1 ) P(w i |t i ) i=1
14
Implementation So we have the best tag sequence will be the maximization of: T^ = argmax Π P(t i |t i-1 ) P(w i |t i ) Training: Learn the transition and emission probabilities from a corpus –smoothing may be necessary State transition probabilities Emission probabilities
15
Training An HMM needs to be trained on the following: 1.The initial state probabilities 2.The state transition probabilities –The tag-tag matrix 3.The emission probabilities –The tag-word matrix
16
Implementation Once trained, how do we implement such a maximization function? T^ = argmax Π P(t i |t i-1 ) P(w i |t i ) Can’t we just walk through every path, calculate all probabilities, and choose the path with the highest probability (max)? Yeah, if we have a lot of time. (Why?) –Exponential –Better to use a DP algorithm, such as Viterbi
17
Unknown Words The tagger just described will do poorly on unknown words. Why? Because P(w i |t i ) = 0 for a word it has not seen (or more specifically, the given word-tag pair). How do we resolve this problem? –A dictionary with the most common tag (the stupid tagger) Still doesn’t solve the problem for completely novel words –Morphological/typographical analysis –Probability of a tag generating an unknown word Secondary training required
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.