Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10
2 Contents Markov Model Taggers Hidden Markov Model Taggers Transformation-Based Learning of Tags Tagging Accuracy and Uses of Taggers
3 Markov Model Taggers Markov properties Limited horizon Time invariant cf. Wh-extraction (Chomsky) a. Should Peter buy a book? b. Which book should Peter buy?
4 Markov Model Taggers The probabilistic model Finding the best tagging t 1,n for a sentence w 1,n ex: P(AT NN BEZ IN AT VB | The bear is on the move)
5 assumtion words are independent of each other a word’s identity only depends on its tag
6 Markov Model Taggers Training for all tags t j do for all tags t k do end for all tags t j do for all words w l do end
7 First tag Second tag ATBEZINNNVBPERIOD AT BEZ IN NN VB PERIOD ATBEZINNNVBPERIOD bear is move on president progress the
8 Markov Model Taggers Tagging (the Viterbi algorithm)
9 Variations The models for unknown words 1. assuming that they can be any part of speech 2. using morphological to make inferences about a possible parts of speech
10 Z: normalization constant
11 Variation Trigram taggers Interpolation Variable Memory Markov Model (VMMM)
12 Variation Smoothing Reversibility K l : the number of possible parts of speech of w l
13 Variation Sequence vs. tag by tag Time flies like an arrow. a. NN VBZ RB AT NN.P(.) = 0.01 b. NN NNS VB AT NN.P(.) = 0.01 there is no large difference in accuracy between maximizing the sequence and tag
14 Hidden Markov Model Taggers When we have no tagged training data Initializing all parameters with the dictionary information Jelinek’s method Kupiec’s method
15 Hidden Markov Model Taggers Jelinek’s method initializing the HMM with the MLE for P(w k |t i ) assuming that words occur equally likely with each of their possible tags. T(w j ): the number of tags allowed for w j
16 Hidden Markov Model Taggers Kupiec’s method grouping all words with the same possible parts of speech into ‘metawords’ u L not to fine-tune parameters for each word
17 Hidden Markov Model Taggers Training after initialization, the HMM is trained using the Forward-Backward algorithm Tagging equal to VMM ! the difference between VMM tagging and HMM tagging is in how we train the model, not in how we tag.
18 Hidden Markov Model Taggers The effect of initialization on HMM overtraining problem D0maximum likelihood estimates from a tagged training corpus D1correct ordering only of lexical probabilities D2lexical probabilities proportional to overall tag probabilities D3equal lexical probabilities for all tags admissible for a word T0maximum likelihood estimates from a tagged training corpus T1equal probabilities for all transitions
19 Use Visible Markov Model a sufficiently large training text similar to the intended text of application Run Forward-Backward for a few iterations no training text training and test text are very different but at least some lexical information Run Forward-Backward for a larger number of iterations no lexical information