Download presentation
Presentation is loading. Please wait.
1
MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011
2
Roadmap MaxEnt POS Tagging Features Beam Search vs Viterbi Named Entity Tagging
3
MaxEnt Feature Template Words: Current word: w 0 Previous word: w -1 Word two back: w -2 Next word: w +1 Next next word: w +2 Tags: Previous tag: t -1 Previous tag pair: t -2 t -1 How many features? 5|V|+|T|+|T| 2
4
Representing Orthographic Patterns How can we represent morphological patterns as features? Character sequences Which sequences? Prefixes/suffixes e.g. suffix(w i )=ing or prefix(w i )=well Specific characters or character types Which? is-capitalized is-hyphenated
5
MaxEnt Feature Set
6
Examples well-heeled: rare word
7
Examples well-heeled: rare word JJ prevW=about:1 prev2W=stories:1 nextW=communities:1 next2W=and:1 pref=w:1 pref=we:1 pref=wel:1 pref=well:1 suff=d:1 suff=ed:1 suff=led:1 suff=eled:1 is-hyphenated:1 preT=IN:1 pre2T=NNS- IN:1
8
Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV
9
Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV
10
Sequence Labeling
11
Goal: Find most probable labeling of a sequence Many sequence labeling tasks POS tagging Word segmentation Named entity tagging Story/spoken sentence segmentation Pitch accent detection Dialog act tagging
12
Solving Sequence Labeling
13
Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM
14
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features?
15
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions:
16
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info)
17
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use
18
Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use Perform incremental classification to get labels, use labels as features for instances later in sequence
19
HMM Trellis time flies like an arrow Adapted from F. Xia
20
Viterbi Initialization: Recursion: Termination:
21
1 2 time 3 flies 4 like 5 an 6 arrow N00.05 BOS 0.001 N 00[D,5]*P(N|D)*P(arrow|N) 0.000001680 D V00.01 BOS 0.007 N 0.00014 N,V 0 P0000.00007 V 0 D0000.0000168 V BOS1.0 0 000
22
Decoding Goal: Identify highest probability tag sequence
23
Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available
24
Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient
25
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices
26
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences?
27
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences?
28
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not?
29
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences?
30
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T
31
Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T Top K highest probability sequences
32
Breadth-First Search time flies like an arrow
33
Breadth-First Search time flies like an arrow
34
Breadth-First Search time flies like an arrow
35
Breadth-First Search time flies like an arrow
36
Breadth-First Search time flies like an arrow
37
Breadth-First Search time flies like an arrow
38
Breadth-first Search Is breadth-first search efficient?
39
Breadth-first Search Is it efficient? No, it tries everything
40
Beam Search Intuition: Breadth-first search explores all paths
41
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths?
42
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but
43
Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width
44
Beam Search, k=3 time flies like an arrow
45
Beam Search, k=3 time flies like an arrow
46
Beam Search, k=3 time flies like an arrow
47
Beam Search, k=3 time flies like an arrow
48
Beam Search, k=3 time flies like an arrow
49
Beam Search W={w 1,w 2,…,w n }: test sentence
50
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i
51
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly
52
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n:
53
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j
54
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences
55
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences Return highest probability sequence s n1
56
POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues
57
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages:
58
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95%
59
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming
60
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time:
61
Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: O(kT) [vs. O(N T )] Disadvantage: Not guaranteed optimal (or complete)
62
Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage:
63
Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage:
64
Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage: Limited window of context
65
Beam vs Viterbi Dynamic programming vs heuristic search
66
Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee
67
Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee Different context window
68
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words
69
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification
70
MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice
71
Named Entity Recognition
72
Roadmap Named Entity Recognition Definition Motivation Challenges Common Approach
73
Named Entity Recognition Task: Identify Named Entities in (typically) unstructured text Typical entities: Person names Locations Organizations Dates Times
74
Example Microsoft released Windows Vista in 2007.
75
Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007
76
Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence:
77
Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical:
78
Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical: Genes, proteins, diseases, drugs, …
79
Why NER? Machine translation:
80
Why NER? Machine translation: Person
81
Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number:
82
Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number: 9/11: Date vs ratio 911: Emergency phone number, simple number
83
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on
84
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations
85
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target Nes
86
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech:
87
Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech: 206-616-5728 Phone numbers (vs other digit strings), differ by language
88
Challenges Ambiguity Washington chose
89
Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings
90
Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results)
91
Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results) CAT(erpillar) stock ticker Computerized Axial Tomography Chloramphenicol Acetyl Transferase small furry mammal
92
Evaluation Precision Recall F-measure
93
Resources Online: Name lists Baby name, who’s who, newswire services Gazetteers etc Tools Lingpipe OpenNLP Stanford NLP toolkit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.