MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011.

MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011

Roadmap MaxEnt POS Tagging Features Beam Search vs Viterbi Named Entity Tagging

MaxEnt Feature Template Words: Current word: w 0 Previous word: w -1 Word two back: w -2 Next word: w +1 Next next word: w +2 Tags: Previous tag: t -1 Previous tag pair: t -2 t -1 How many features? 5|V|+|T|+|T| 2

Representing Orthographic Patterns How can we represent morphological patterns as features? Character sequences Which sequences? Prefixes/suffixes e.g. suffix(w i )=ing or prefix(w i )=well Specific characters or character types Which? is-capitalized is-hyphenated

MaxEnt Feature Set

Examples well-heeled: rare word

Examples well-heeled: rare word JJ prevW=about:1 prev2W=stories:1 nextW=communities:1 next2W=and:1 pref=w:1 pref=we:1 pref=wel:1 pref=well:1 suff=d:1 suff=ed:1 suff=led:1 suff=eled:1 is-hyphenated:1 preT=IN:1 pre2T=NNS- IN:1

Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV

Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV

Sequence Labeling

Goal: Find most probable labeling of a sequence Many sequence labeling tasks POS tagging Word segmentation Named entity tagging Story/spoken sentence segmentation Pitch accent detection Dialog act tagging

Solving Sequence Labeling

Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features?

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions:

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info)

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use

Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use Perform incremental classification to get labels, use labels as features for instances later in sequence

HMM Trellis time flies like an arrow Adapted from F. Xia

Viterbi Initialization: Recursion: Termination:

1 2 time 3 flies 4 like 5 an 6 arrow N00.05 BOS 0.001 N 00[D,5]*P(N|D)*P(arrow|N) 0.000001680 D V00.01 BOS 0.007 N 0.00014 N,V 0 P0000.00007 V 0 D0000.0000168 V BOS1.0 0 000

Decoding Goal: Identify highest probability tag sequence

Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available

Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences?

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences?

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not?

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences?

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T

Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T Top K highest probability sequences

Breadth-First Search time flies like an arrow

Breadth-first Search Is breadth-first search efficient?

Breadth-first Search Is it efficient? No, it tries everything

Beam Search Intuition: Breadth-first search explores all paths

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths?

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but

Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width

Beam Search, k=3 time flies like an arrow

Beam Search W={w 1,w 2,…,w n }: test sentence

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n:

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences Return highest probability sequence s n1

POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages:

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95%

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time:

Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: O(kT) [vs. O(N T )] Disadvantage: Not guaranteed optimal (or complete)

Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage:

Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage:

Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage: Limited window of context

Beam vs Viterbi Dynamic programming vs heuristic search

Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee

Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee Different context window

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification

MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice

Named Entity Recognition

Roadmap Named Entity Recognition Definition Motivation Challenges Common Approach

Named Entity Recognition Task: Identify Named Entities in (typically) unstructured text Typical entities: Person names Locations Organizations Dates Times

Example Microsoft released Windows Vista in 2007.

Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007

Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence:

Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical:

Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical: Genes, proteins, diseases, drugs, …

Why NER? Machine translation:

Why NER? Machine translation: Person

Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number:

Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number: 9/11: Date vs ratio 911: Emergency phone number, simple number

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target Nes

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech:

Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech: 206-616-5728 Phone numbers (vs other digit strings), differ by language

Challenges Ambiguity Washington chose

Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings

Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results)

Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results) CAT(erpillar) stock ticker Computerized Axial Tomography Chloramphenicol Acetyl Transferase small furry mammal

Evaluation Precision Recall F-measure

Resources Online: Name lists Baby name, who’s who, newswire services Gazetteers etc Tools Lingpipe OpenNLP Stanford NLP toolkit

MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011.

Similar presentations

Presentation on theme: "MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011.

Similar presentations

Presentation on theme: "MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011."— Presentation transcript:

Similar presentations

About project

Feedback