POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.

POS Tagging HMM Taggers (continued)

Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words

HMM Tagger P(word|tag) x P(tag|previous n tags) P(word|tag) – –The probability of the word given a tag (not vice versa) –We model this by using a word-tag matrix (often called a language model) –Familiar? HW 4 (3)

HMM Tagger P(word|tag) x P(tag|previous n tags) P(tag|previous n tags) – –How likely a tag is given the n so many tags before –Simplified to the previous tag –Modeled by using a tag-tag matrix –Familiar? HW 4 (2)

HMM Tagger But why is it P(word|tag) not P(tag|word)? Take the following examples (from J&M): 1.Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN 2.People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/?? for/IN outer/JJ space/NN

The no-go HMM Tagger Invert word and tag, P(t|w) instead of P(w|t): 1.P(VB|race) =.02 2.P(NN|race) =.98

HMM Tagger But don’t we really want to maximize the probability of the best sequence of tags for a given sequence of words? Not just the best tag for a given word? Thus, we really want to maximize (and implement): P(t 1,…,t n | w 1,…,w n ), or T^ = argmax P(T|W)

HMM Tagger By Bayes Rule: P(T) P(W|T) P(T|W) = ----------------- P(W) Since P(W) is always the same (why?), then P(T|W) = P(T) P(W|T)

Implementation So we have the best tag sequence will be the maximization of: T^ = argmax Π P(t i |t i-1 ) P(w i |t i ) Training: Learn the transition and emission probabilities from a corpus –smoothing may be necessary State transition probabilities Emission probabilities

Training An HMM needs to be trained on the following: 1.The initial state probabilities 2.The state transition probabilities –The tag-tag matrix 3.The emission probabilities –The tag-word matrix

Implementation Once trained, how do we implement such a maximization function? T^ = argmax Π P(t i |t i-1 ) P(w i |t i ) Can’t we just walk through every path, calculate all probabilities, and choose the path with the highest probability (max)? Yeah, if we have a lot of time. (Why?) –Exponential –Better to use a DP algorithm, such as Viterbi

Unknown Words The tagger just described will do poorly on unknown words. Why? Because P(w i |t i ) = 0 for a word it has not seen (or more specifically, the given word-tag pair). How do we resolve this problem? –A dictionary with the most common tag (the stupid tagger) Still doesn’t solve the problem for completely novel words –Morphological/typographical analysis –Probability of a tag generating an unknown word Secondary training required

POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.

Similar presentations

Presentation on theme: "POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.

Similar presentations

Presentation on theme: "POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words."— Presentation transcript:

Similar presentations

About project

Feedback