Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.

Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill

Part of Speech Tagging Train a model on a set of hand-tagged sentences Find best sequence of POS tags for new sentence Generative Models Hidden Markov Model HMM Discriminative Models Maximum Entropy Markov Model (MEMM) Brown Corpus ~57,000 tagged sentences 87 tags (reduced to 45 for Penn TreeBank tagging) ~300 tags including compound tags that_DT fire's_NN+BEZ too_QL big_JJ._. “fire’s” = fire_NN is_BEZ

Hidden Markov Models Set of hidden states (POS tags) Set of observations (word tokens) Dependents ONLY on current tag HMM parameters Transition probabilities : P(t i |t 0 …t i ) = P(t i |t i-1 ) Observation probabilities: P(w i |t 0 …t n,w 0 …w n ) = P(w i |t i ) Initial tag distribution: P(t 0 ) DT (singular determiner) That NN+BEZ (common noun + “is”) Fire’s QL (qualifier) too JJ (adjective) big

HMM Best Tag Sequence For HMM, the Viterbi algorithm finds the most probable tagging for a new sentence For re-ranking later, we want not the best tagging but the k best tagging for each sentence

HMM Beam Search Step1 Enumerate all possible tags for the first word Step 2 Evaluate each tagging using trained HMM keep only the best k (first word sentence taggings) Step 3 For each of the k taggings of the previous step, enumerate all possible tags for the second word Step 4 Evaluate each two-word sentence tagging and discard all the k best. Repeat for all words in the sentence Start Word 0... Word 0 Start Word 1 Word 2... Word 2... Word 2... Word 2 Word 1 Word 2... Word 2

MaxEnt Re-ranking After beam search, we have the k “best” taggings for our sentence Use trained MaxEnt model to select most probable sequence of tags Start Word 1... Word t... Word t... Word 1... Word t

Results Feature Current word Previous tag Word contains a numeral “-ing” “-ness” “-ity” “-ed” “-able” “-s” “-ion” “-al” “-ive” “-ly” Word is capitalized Word is hyphenated Word is all uppercase Word is all uppercase with a numeral Word is capitalized and a word ending in “Co.” or “Inc.” is found within 3 words ahead

Results Baseline “Most frequent class tagger”: 73.41% (24%) HMM Viterbi tagger: 92.96% (32.76% on )

Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.

Similar presentations

Presentation on theme: "Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.

Similar presentations

Presentation on theme: "Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill."— Presentation transcript:

Similar presentations

About project

Feedback