MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011
Roadmap MaxEnt POS Tagging Features Beam Search vs Viterbi Named Entity Tagging
MaxEnt Feature Template Words: Current word: w 0 Previous word: w -1 Word two back: w -2 Next word: w +1 Next next word: w +2 Tags: Previous tag: t -1 Previous tag pair: t -2 t -1 How many features? 5|V|+|T|+|T| 2
Representing Orthographic Patterns How can we represent morphological patterns as features? Character sequences Which sequences? Prefixes/suffixes e.g. suffix(w i )=ing or prefix(w i )=well Specific characters or character types Which? is-capitalized is-hyphenated
MaxEnt Feature Set
Examples well-heeled: rare word
Examples well-heeled: rare word JJ prevW=about:1 prev2W=stories:1 nextW=communities:1 next2W=and:1 pref=w:1 pref=we:1 pref=wel:1 pref=well:1 suff=d:1 suff=ed:1 suff=led:1 suff=eled:1 is-hyphenated:1 preT=IN:1 pre2T=NNS- IN:1
Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV
Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV
Sequence Labeling
Goal: Find most probable labeling of a sequence Many sequence labeling tasks POS tagging Word segmentation Named entity tagging Story/spoken sentence segmentation Pitch accent detection Dialog act tagging
Solving Sequence Labeling
Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features?
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions:
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info)
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use Perform incremental classification to get labels, use labels as features for instances later in sequence
HMM Trellis time flies like an arrow Adapted from F. Xia
Viterbi Initialization: Recursion: Termination:
1 2 time 3 flies 4 like 5 an 6 arrow N00.05 BOS N 00[D,5]*P(N|D)*P(arrow|N) D V00.01 BOS N N,V 0 P V 0 D V BOS
Decoding Goal: Identify highest probability tag sequence
Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available
Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences?
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences?
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not?
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences?
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T Top K highest probability sequences
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-first Search Is breadth-first search efficient?
Breadth-first Search Is it efficient? No, it tries everything
Beam Search Intuition: Breadth-first search explores all paths
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths?
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width
Beam Search, k=3 time flies like an arrow
Beam Search, k=3 time flies like an arrow
Beam Search, k=3 time flies like an arrow
Beam Search, k=3 time flies like an arrow
Beam Search, k=3 time flies like an arrow
Beam Search W={w 1,w 2,…,w n }: test sentence
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n:
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences Return highest probability sequence s n1
POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages:
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95%
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time:
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: O(kT) [vs. O(N T )] Disadvantage: Not guaranteed optimal (or complete)
Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage:
Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage:
Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage: Limited window of context
Beam vs Viterbi Dynamic programming vs heuristic search
Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee
Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee Different context window
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice
Named Entity Recognition
Roadmap Named Entity Recognition Definition Motivation Challenges Common Approach
Named Entity Recognition Task: Identify Named Entities in (typically) unstructured text Typical entities: Person names Locations Organizations Dates Times
Example Microsoft released Windows Vista in 2007.
Example Microsoft released Windows Vista in Microsoft released Windows Vista in 2007
Example Microsoft released Windows Vista in Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence:
Example Microsoft released Windows Vista in Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical:
Example Microsoft released Windows Vista in Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical: Genes, proteins, diseases, drugs, …
Why NER? Machine translation:
Why NER? Machine translation: Person
Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number:
Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number: 9/11: Date vs ratio 911: Emergency phone number, simple number
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target Nes
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech:
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech: Phone numbers (vs other digit strings), differ by language
Challenges Ambiguity Washington chose
Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings
Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results)
Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results) CAT(erpillar) stock ticker Computerized Axial Tomography Chloramphenicol Acetyl Transferase small furry mammal
Evaluation Precision Recall F-measure
Resources Online: Name lists Baby name, who’s who, newswire services Gazetteers etc Tools Lingpipe OpenNLP Stanford NLP toolkit