Ling 570 Day 6: HMM POS Taggers 1
Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details 2
HMM POS TAGGING 3
HMM Tagger 4
5
6
7
The good HMM Tagger From the Brown/Switchboard corpus: –P(VB|TO) =.34 –P(NN|TO) =.021 –P(race|VB) = –P(race|NN) = a.P(VB|TO) x P(race|VB) =.34 x = b.P(NN|TO) x P(race|NN) =.021 x = a. TO followed by VB in the context of race is more probable (‘race’ really has no effect here). 8
HMM Philosophy Imagine: the author, when creating this sentence, also had in mind the parts-of- speech of each of these words. After the fact, we’re now trying to recover those parts of speech. They’re the hidden part of the Markov model. 9
What happens when we do it the wrong way? Invert word and tag, P(t|w) instead of P(w|t): 1.P(VB|race) =.02 2.P(NN|race) =.98 2 would drown out virtually any other probability! We’d always tag race with NN! 10
What happens when we do it the wrong way? 11
N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously 12
N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously Predict current tag conditioned on prior n-1 tags 13
N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously Predict current tag conditioned on prior n-1 tags Predict word conditioned on current tag 14
N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously 15
N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously 16
HMM bigram tagger JJ NNSVBRB colorlessgreenideassleepfuriously 17
HMM trigram tagger JJ NNSVBRB colorlessgreenideassleepfuriously 18
Training An HMM needs to be trained on the following: 1.The initial state probabilities 2.The state transition probabilities –The tag-tag matrix 3.The emission probabilities –The tag-word matrix 19
Implementation 20
Implementation Transition distribution 21
Implementation Emission distribution 22
Implementation 23
Implementation 24
REVIEW VITERBI ALGORITHM 25
Consider two examples Mariners hit a a home run Mariners hit made the news 26
Consider two examples Mariners hit a a home run N N N N V V DT N N Mariners hit made the news N N V V DT N N N N 27
Parameters As probabilities, they get very small NVDT N V DT ahithomemadeMarinersnewsrunthe N E E-05 V DT
Parameters As probabilities, they get very small NVDT N V DT ahithomemadeMarinersnewsrunthe N E E-05 V DT NVDT N V DT ahithomemadeMarinersnewsrunthe N V-10-8 DT-2 As log probabilities, they won’t underflow… …and we can just add them 29
NVDT N -3-7 V DT ahithomemadeMarinersnewsrunthe N V-10-8 DT-2 Marinershitahomerun N V DT 30
NVDT N V DT ahithomemadeMarinersnewsrunthe N V-10-8 DT-2 Marinershitmadethenews N V DT 31
Viterbi 32
Pseudocode 33
Pseudocode 34
SMOOTHING 35
Training 36
Why Smoothing? Zero counts 37
Why Smoothing? Zero counts Handle missing tag sequences: –Smooth transition probabilities 38
Why Smoothing? Zero counts Handle missing tag sequences: –Smooth transition probabilities Handle unseen words: –Smooth observation probabilities 39
Why Smoothing? Zero counts Handle missing tag sequences: –Smooth transition probabilities Handle unseen words: –Smooth observation probabilities Handle unseen (word,tag) pairs where both are known 40
Smoothing Tag Sequences 41
Smoothing Tag Sequences 42
Smoothing Tag Sequences 43
Smoothing Tag Sequences 44
Smoothing Emission Probabilities 45
Smoothing Emission Probabilities 46
Smoothing Emission Probabilities Preprocessing the training corpus: –Count occurrences of all words –Replace words singletons with magic token –Gather counts on modified data, estimate parameters Preprocessing the test set –For each test set word –If seen at least twice in training set, leave it alone –Otherwise replace with –Run Viterbi on this modified input 47
Unknown Words Is there other information we could use for P(w|t)? –Information in words themselves? Morphology: –-able: JJ –-tion NN –-ly RB –Case: John NP, etc –Augment models Add to ‘context’ of tags Include as features in classifier models –We’ll come back to this idea! 48
HMM IMPLEMENTATION 49
HMM Implementation: Storing an HMM Approach #1: –Hash table (direct): π i = 50
HMM Implementation: Storing an HMM Approach #1: –Hash table (direct): π i =pi{state_str} a ij : 51
HMM Implementation: Storing an HMM Approach #1: –Hash table (direct): π i =pi{state_str} a ij :a{from_state_str}{to_state_str} b i (o t ): 52
HMM Implementation: Storing an HMM Approach #1: –Hash table (direct): π i =pi{state_str} a ij :a{from_state_str}{to_state_str} b i (o t ): b{state_str}{symbol} 53
HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}= 54
HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}= 55
HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = 56
HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]= 57
HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]=state_str π i : 58
HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]=state_str π i :pi[state_idx] a ij : 59
HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]=state_str π i :pi[state_idx] a ij :a[from_state_idx][to_state_idx] b i (o t ): 60
HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]=state_str π i :pi[state_idx] a ij :a[from_state_idx][to_state_idx] b i (o t ):b[state_idx][symbol_idx] 61
HMM Matrix Representations Issue: 62
HMM Matrix Representations Issue: –Many matrix entries are 0 Especially b[i][o] Approach 3: Sparse matrix representation –a[i] = 63
HMM Matrix Representations Issue: –Many matrix entries are 0 Especially b[i][o] Approach 3: Sparse matrix representation –a[i] = “j1 p1 j2 p2…” –a[j] = 64
HMM Matrix Representations Issue: –Many matrix entries are 0 Especially b[i][o] Approach 3: Sparse matrix representation –a[i] = “j1 p1 j2 p2…” –a[j] = “i1 p1 i2 p2..” –b[i] = “o1 p1 o2 p2 …” –b[o] = “i1 p1 i2 p2…” 65
HMM Matrix Representations Issue: –Many matrix entries are 0 Especially b[i][o] Approach 3: Sparse matrix representation –a[i] = “j1 p1 j2 p2…” –a[j] = “i1 p1 i2 p2..” –b[i] = “o1 p1 o2 p2 …” –b[o] = “i1 p1 i2 p2…” Could be: –Array of hashes –Array of lists of non-empty values –The latter is often quite fast, because lists are short and fit into cache lines 66