Ling 570 Day 16: Sequence modeling Named Entity Recognition
Sequence Labeling Goal: Find most probable labeling of a sequence Many sequence labeling tasks – POS tagging – Word segmentation – Named entity tagging – Story/spoken sentence segmentation – Pitch accent detection – Dialog act tagging
HMM search space POStimeflieslikeanarrow N0.1 V P DT0.3 NVPDT N V P 0.6 DT1.0 N0.5 V0.1 P DT0.4 N N V V P P DT timeflieslikeanarrow N N V V P P DT N N V V P P N N V V P P N N V V P P
timeflieslikeanarrow N0.05 [0] 0.001[N]00= *1.0*0.1[DT] V0.01 [0] [N] [N]0 P0 [0] [V]0 DT0 [0] [V] POStimeflieslikeanarrow N0.1 V P DT0.3 NVPDT N V P 0.6 DT1.0 N0.5 V0.1 P DT0.4 (1)Find max in last column, (2)Follow back-pointer chains to recover that best sequence N N DTDT DTDT V V N N N N
Viterbi
Decoding Goal: Identify highest probability tag sequence Issues: – Features include tags from previous words Not immediately available – Uses tag history Just knowing highest probability preceding tag insufficient
Decoding Approach: Retain multiple candidate tag sequences – Essentially search through tagging choices Which sequences? – We can’t look at all of them – exponentially many! – Instead, use top K highest probability sequences
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-First Search time flies like an arrow
Breadth-first Search Is breadth-first search efficient?
Breadth-first Search Is it efficient? – No, it tries everything
Beam Search Intuition: – Breadth-first search explores all paths – Lots of paths are (pretty obviously) bad – Why explore bad paths? – Restrict to (apparently best) paths Approach: – Perform breadth-first search, but – Retain only k ‘best’ paths thus far – k: beam width
Beam Search, k=3 time flies like an arrow
Beam Search, k=3 time flies like an arrow
Beam Search, k=3 time flies like an arrow
Beam Search, k=3 time flies like an arrow
Beam Search, k=3 time flies like an arrow
Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: – Extension: add tags for w i to each s (i-1)j – Beam selection: Sort sequences by probability Keep only top k sequences Return highest probability sequence s n1
POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides – Probabilistic framework – Better able to model different info sources Topline accuracy 96-97% – Consistency issues
Beam Search Beam search decoding: – Variant of breadth first search – At each layer, keep only top k sequences Advantages: – Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% – Simple to implement Just extensions + sorting, no dynamic programming – Running time: O(kT) [vs. O(N T )] Disadvantage: Not guaranteed optimal (or complete)
Viterbi Decoding Viterbi search: – Exploits dynamic programming, memoization Requires small history window – Efficient search: O(N 2 T) Advantage: – Exact: optimal solution is returned Disadvantage: – Limited window of context
Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee Different context window
MaxEnt POS Tagging Part of speech tagging by classification: – Feature design word and tag context features orthographic features for rare words Sequence classification problems: – Tag features depend on prior classification Beam search decoding – Efficient, but inexact Near optimal in practice
NAMED ENTITY RECOGNITION
Roadmap Named Entity Recognition – Definition – Motivation – Challenges – Common Approach
Named Entity Recognition Task: Identify Named Entities in (typically) unstructured text Typical entities: – Person names – Locations – Organizations – Dates – Times
Example Lady Gaga is playing a concert for the Bushes in Texas next September
person location time person Example Lady Gaga is playing a concert for the Bushes in Texas next September
person organization location value Example from financial news Ray Dalio’s Bridgewater Associates is an extremely large and extremely successful hedge fund. Based in Westport and known for its strong -- some would say cultish -- culture, it has grown to well over $100 billion in assets under management with little negative impact on its returns.
Entity types may differ by applicaiton News: – People, countries, organizations, dates, etc. Medical records: – Diseases, medications, organisms, organs, etc.
Named Entity Types Common categories
Named Entity Examples For common categories:
Why NER? Machine translation: – Lady Gaga is playing a concert for the Bushes in Texas next September – La señora Gaga es toca un concierto para los arbustos … – Number: 9/11: Date vs ratio 911: Emergency phone number, simple number
Why NER? Information extraction: – MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: – Named entities focus of retrieval – In some data sets, 60+% queries target NEs Text-to-speech: – Phone numbers (vs other digit strings), differ by language
Challenges Ambiguity – Washington chose D.C., State, George, etc – Most digit strings – cat: (95 results) CAT(erpillar) stock ticker Computerized Axial Tomography Chloramphenicol Acetyl Transferase small furry mammal
Context & Ambiguity
Evaluation Precision Recall F-measure
Resources Online: – Name lists Baby name, who’s who, newswire services, census.gov – Gazetteers – SEC listings of companies Tools – Lingpipe – OpenNLP – Stanford NLP toolkit
Approaches to NER Rule/Regex-based: – Match names/entities in lists – Regex: e.g \d\d/\d\d/\d\d: 11/23/11 – Currency: $\d+\.\d+ Machine Learning via Sequence Labeling: – Better for names, organizations Hybrid
NER AS SEQUENCE LABELING
NER as Classification Task Instance:
NER as Classification Task Instance: token Labels:
NER as Classification Task Instance: token Labels: – Position: B(eginning), I(nside), Outside
NER as Classification Task Instance: token Labels: – Position: B(eginning), I(nside), Outside – NER types: PER, ORG, LOC, NUM
NER as Classification Task Instance: token Labels: – Position: B(eginning), I(nside), Outside – NER types: PER, ORG, LOC, NUM – Label: Type-Position, e.g. PER-B, PER-I, O, … – How many tags?
NER as Classification Task Instance: token Labels: – Position: B(eginning), I(nside), Outside – NER types: PER, ORG, LOC, NUM – Label: Type-Position, e.g. PER-B, PER-I, O, … – How many tags? (|NER Types|x 2) + 1
NER as Classification: Features What information can we use for NER?
NER as Classification: Features What information can we use for NER?
NER as Classification: Features What information can we use for NER? – Predictive tokens: e.g. MD, Rev, Inc,.. How general are these features?
NER as Classification: Features What information can we use for NER? – Predictive tokens: e.g. MD, Rev, Inc,.. How general are these features? – Language? Genre? Domain?
NER as Classification: Shape Features Shape types:
NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case
NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase
NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized
NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized – mixed case: eBay Mixed upper and lower case
NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized – mixed case: eBay Mixed upper and lower case – Capitalized with period: H.
NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized – mixed case: eBay Mixed upper and lower case – Capitalized with period: H. – Ends with digit: A9
NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized – mixed case: eBay Mixed upper and lower case – Capitalized with period: H. – Ends with digit: A9 – Contains hyphen: H-P
Example Instance Representation Example
Sequence Labeling Example
Evaluation System: output of automatic tagging Gold Standard: true tags
Evaluation System: output of automatic tagging Gold Standard: true tags Precision: # correct chunks/# system chunks Recall: # correct chunks/# gold chunks F-measure:
Evaluation System: output of automatic tagging Gold Standard: true tags Precision: # correct chunks/# system chunks Recall: # correct chunks/# gold chunks F-measure: F 1 balances precision & recall
Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation)
Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation) Classifiers vs evaluation measures – Classifiers optimize tag accuracy
Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation) Classifiers vs evaluation measures – Classifiers optimize tag accuracy Most common tag?
Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation) Classifiers vs evaluation measures – Classifiers optimize tag accuracy Most common tag? – O – most tokens aren’t NEs – Evaluation measures focuses on NE
Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation) Classifiers vs evaluation measures – Classifiers optimize tag accuracy Most common tag? – O – most tokens aren’t NEs – Evaluation measures focuses on NE State-of-the-art: – Standard tasks: PER, LOC: 0.92; ORG: 0.84
Hybrid Approaches Practical sytems – Exploit lists, rules, learning…
Hybrid Approaches Practical sytems – Exploit lists, rules, learning… – Multi-pass: Early passes: high precision, low recall Later passes: noisier sequence learning
Hybrid Approaches Practical sytems – Exploit lists, rules, learning… – Multi-pass: Early passes: high precision, low recall Later passes: noisier sequence learning Hybrid system: – High precision rules tag unambiguous mentions Use string matching to capture substring matches
Hybrid Approaches Practical sytems – Exploit lists, rules, learning… – Multi-pass: Early passes: high precision, low recall Later passes: noisier sequence learning Hybrid system: – High precision rules tag unambiguous mentions Use string matching to capture substring matches – Tag items from domain-specific name lists – Apply sequence labeler