Ling 570 Day 16: Sequence modeling Named Entity Recognition.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Presenters: Arni, Sanjana.  Subtask of Information Extraction  Identify known entity names – person, places, organization etc  Identify the boundaries.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Sequence Classification: Chunking & NER Shallow Processing Techniques for NLP Ling570 November 23, 2011.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Named Entity Recognition LING 570 Fei Xia Week 10: 11/30/09.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Introduction to Machine Learning Approach Lecture 5.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Ling 570 Day 17: Named Entity Recognition Chunking.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7,
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Natural language processing tools Lê Đức Trọng 1.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)
Supervised Time Series Pattern Discovery through Local Importance
LING 388: Computers and Language
Lecture 13 Information Extraction
CSCI 5832 Natural Language Processing
Chunk Parsing CS1573: AI Application Development, Spring 2003
Presentation transcript:

Ling 570 Day 16: Sequence modeling Named Entity Recognition

Sequence Labeling Goal: Find most probable labeling of a sequence Many sequence labeling tasks – POS tagging – Word segmentation – Named entity tagging – Story/spoken sentence segmentation – Pitch accent detection – Dialog act tagging

HMM search space POStimeflieslikeanarrow N0.1 V P DT0.3 NVPDT N V P 0.6 DT1.0 N0.5 V0.1 P DT0.4 N N V V P P DT timeflieslikeanarrow N N V V P P DT N N V V P P N N V V P P N N V V P P

timeflieslikeanarrow N0.05 [0] 0.001[N]00= *1.0*0.1[DT] V0.01 [0] [N] [N]0 P0 [0] [V]0 DT0 [0] [V] POStimeflieslikeanarrow N0.1 V P DT0.3 NVPDT N V P 0.6 DT1.0 N0.5 V0.1 P DT0.4 (1)Find max in last column, (2)Follow back-pointer chains to recover that best sequence N N DTDT DTDT V V N N N N

Viterbi

Decoding Goal: Identify highest probability tag sequence Issues: – Features include tags from previous words Not immediately available – Uses tag history Just knowing highest probability preceding tag insufficient

Decoding Approach: Retain multiple candidate tag sequences – Essentially search through tagging choices Which sequences? – We can’t look at all of them – exponentially many! – Instead, use top K highest probability sequences

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-First Search time flies like an arrow

Breadth-first Search Is breadth-first search efficient?

Breadth-first Search Is it efficient? – No, it tries everything

Beam Search Intuition: – Breadth-first search explores all paths – Lots of paths are (pretty obviously) bad – Why explore bad paths? – Restrict to (apparently best) paths Approach: – Perform breadth-first search, but – Retain only k ‘best’ paths thus far – k: beam width

Beam Search, k=3 time flies like an arrow

Beam Search, k=3 time flies like an arrow

Beam Search, k=3 time flies like an arrow

Beam Search, k=3 time flies like an arrow

Beam Search, k=3 time flies like an arrow

Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: – Extension: add tags for w i to each s (i-1)j – Beam selection: Sort sequences by probability Keep only top k sequences Return highest probability sequence s n1

POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides – Probabilistic framework – Better able to model different info sources Topline accuracy 96-97% – Consistency issues

Beam Search Beam search decoding: – Variant of breadth first search – At each layer, keep only top k sequences Advantages: – Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% – Simple to implement Just extensions + sorting, no dynamic programming – Running time: O(kT) [vs. O(N T )] Disadvantage: Not guaranteed optimal (or complete)

Viterbi Decoding Viterbi search: – Exploits dynamic programming, memoization Requires small history window – Efficient search: O(N 2 T) Advantage: – Exact: optimal solution is returned Disadvantage: – Limited window of context

Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee Different context window

MaxEnt POS Tagging Part of speech tagging by classification: – Feature design word and tag context features orthographic features for rare words Sequence classification problems: – Tag features depend on prior classification Beam search decoding – Efficient, but inexact Near optimal in practice

NAMED ENTITY RECOGNITION

Roadmap Named Entity Recognition – Definition – Motivation – Challenges – Common Approach

Named Entity Recognition Task: Identify Named Entities in (typically) unstructured text Typical entities: – Person names – Locations – Organizations – Dates – Times

Example Lady Gaga is playing a concert for the Bushes in Texas next September

person location time person Example Lady Gaga is playing a concert for the Bushes in Texas next September

person organization location value Example from financial news Ray Dalio’s Bridgewater Associates is an extremely large and extremely successful hedge fund. Based in Westport and known for its strong -- some would say cultish -- culture, it has grown to well over $100 billion in assets under management with little negative impact on its returns.

Entity types may differ by applicaiton News: – People, countries, organizations, dates, etc. Medical records: – Diseases, medications, organisms, organs, etc.

Named Entity Types Common categories

Named Entity Examples For common categories:

Why NER? Machine translation: – Lady Gaga is playing a concert for the Bushes in Texas next September – La señora Gaga es toca un concierto para los arbustos … – Number: 9/11: Date vs ratio 911: Emergency phone number, simple number

Why NER? Information extraction: – MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: – Named entities focus of retrieval – In some data sets, 60+% queries target NEs Text-to-speech: – Phone numbers (vs other digit strings), differ by language

Challenges Ambiguity – Washington chose D.C., State, George, etc – Most digit strings – cat: (95 results) CAT(erpillar) stock ticker Computerized Axial Tomography Chloramphenicol Acetyl Transferase small furry mammal

Context & Ambiguity

Evaluation Precision Recall F-measure

Resources Online: – Name lists Baby name, who’s who, newswire services, census.gov – Gazetteers – SEC listings of companies Tools – Lingpipe – OpenNLP – Stanford NLP toolkit

Approaches to NER Rule/Regex-based: – Match names/entities in lists – Regex: e.g \d\d/\d\d/\d\d: 11/23/11 – Currency: $\d+\.\d+ Machine Learning via Sequence Labeling: – Better for names, organizations Hybrid

NER AS SEQUENCE LABELING

NER as Classification Task Instance:

NER as Classification Task Instance: token Labels:

NER as Classification Task Instance: token Labels: – Position: B(eginning), I(nside), Outside

NER as Classification Task Instance: token Labels: – Position: B(eginning), I(nside), Outside – NER types: PER, ORG, LOC, NUM

NER as Classification Task Instance: token Labels: – Position: B(eginning), I(nside), Outside – NER types: PER, ORG, LOC, NUM – Label: Type-Position, e.g. PER-B, PER-I, O, … – How many tags?

NER as Classification Task Instance: token Labels: – Position: B(eginning), I(nside), Outside – NER types: PER, ORG, LOC, NUM – Label: Type-Position, e.g. PER-B, PER-I, O, … – How many tags? (|NER Types|x 2) + 1

NER as Classification: Features What information can we use for NER?

NER as Classification: Features What information can we use for NER?

NER as Classification: Features What information can we use for NER? – Predictive tokens: e.g. MD, Rev, Inc,.. How general are these features?

NER as Classification: Features What information can we use for NER? – Predictive tokens: e.g. MD, Rev, Inc,.. How general are these features? – Language? Genre? Domain?

NER as Classification: Shape Features Shape types:

NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case

NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase

NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized

NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized – mixed case: eBay Mixed upper and lower case

NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized – mixed case: eBay Mixed upper and lower case – Capitalized with period: H.

NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized – mixed case: eBay Mixed upper and lower case – Capitalized with period: H. – Ends with digit: A9

NER as Classification: Shape Features Shape types: – lower: e.g. e. e. cummings All lower case – capitalized: e.g. Washington First letter uppercase – all caps: e.g. WHO all letters capitalized – mixed case: eBay Mixed upper and lower case – Capitalized with period: H. – Ends with digit: A9 – Contains hyphen: H-P

Example Instance Representation Example

Sequence Labeling Example

Evaluation System: output of automatic tagging Gold Standard: true tags

Evaluation System: output of automatic tagging Gold Standard: true tags Precision: # correct chunks/# system chunks Recall: # correct chunks/# gold chunks F-measure:

Evaluation System: output of automatic tagging Gold Standard: true tags Precision: # correct chunks/# system chunks Recall: # correct chunks/# gold chunks F-measure: F 1 balances precision & recall

Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation)

Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation) Classifiers vs evaluation measures – Classifiers optimize tag accuracy

Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation) Classifiers vs evaluation measures – Classifiers optimize tag accuracy Most common tag?

Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation) Classifiers vs evaluation measures – Classifiers optimize tag accuracy Most common tag? – O – most tokens aren’t NEs – Evaluation measures focuses on NE

Evaluation Standard measures: – Precision, Recall, F-measure – Computed on entity types (Co-NLL evaluation) Classifiers vs evaluation measures – Classifiers optimize tag accuracy Most common tag? – O – most tokens aren’t NEs – Evaluation measures focuses on NE State-of-the-art: – Standard tasks: PER, LOC: 0.92; ORG: 0.84

Hybrid Approaches Practical sytems – Exploit lists, rules, learning…

Hybrid Approaches Practical sytems – Exploit lists, rules, learning… – Multi-pass: Early passes: high precision, low recall Later passes: noisier sequence learning

Hybrid Approaches Practical sytems – Exploit lists, rules, learning… – Multi-pass: Early passes: high precision, low recall Later passes: noisier sequence learning Hybrid system: – High precision rules tag unambiguous mentions Use string matching to capture substring matches

Hybrid Approaches Practical sytems – Exploit lists, rules, learning… – Multi-pass: Early passes: high precision, low recall Later passes: noisier sequence learning Hybrid system: – High precision rules tag unambiguous mentions Use string matching to capture substring matches – Tag items from domain-specific name lists – Apply sequence labeler