Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Supervised Learning Recap

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Naïve Bayes Classifier

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 21 11/8/2011.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Information Extraction Shallow Processing Techniques for NLP Ling570 December 5, 2011.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.

Conditional Random Fields

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Text Classification, Active/Interactive learning.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

Lecture 13 Information Extraction Topics Name Entity Recognition Relation detection Temporal and Event Processing Template Filling Readings: Chapter 22.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

CSA3202 Human Language Technology HMMs for POS Tagging.

Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.

MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape

CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.

John Lafferty Andrew McCallum Fernando Pereira

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney.

Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)

CSC 594 Topics in AI – Natural Language Processing

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

CSCE 590 Web Scraping – Information Retrieval

CSCI 5832 Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

Hidden Markov Models Part 2: Algorithms

CSC 594 Topics in AI – Natural Language Processing

Lecture 13 Information Extraction

CSCI 5832 Natural Language Processing

Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.

ML – Lecture 3B Deep NN.

LECTURE 23: INFORMATION THEORY REVIEW

CSCI 5832 Natural Language Processing

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]

Presentation transcript:

Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides

Outline Named Entities and the basic idea Named Entities and the basic idea IOB Tagging IOB Tagging A new classifier: Logistic Regression A new classifier: Logistic Regression  Linear regression  Logistic regression  Multinomial logistic regression = MaxEnt Why classifiers aren’t as good as sequence models Why classifiers aren’t as good as sequence models A new sequence model: A new sequence model:  MEMM = Maximum Entropy Markov Model

Named Entity Tagging Slide from Jim Martin CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

Named Entity Tagging CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York. Slide from Jim Martin

Named Entity Recognition Find the named entities and classify them by type Find the named entities and classify them by type Typical approach Typical approach  Acquire training data  Encode using IOB labeling  Train a sequential supervised classifier  Augment with pre- and post-processing using available list resources (census data, gazetteers, etc.) Slide from Jim Martin

Temporal and Numerical Expressions Temporals Temporals  Find all the temporal expressions  Normalize them based on some reference point Numerical Expressions Numerical Expressions  Find all the expressions  Classify by type  Normalize Slide from Jim Martin

NE Types Slide from Jim Martin

NE Types: Examples Slide from Jim Martin

Ambiguity

Biomedical Entities Disease Disease Symptom Symptom Drug Drug Body Part Body Part Treatment Treatment Enzime Enzime Protein Protein Difficulty: discontiguous or overlapping mentions Difficulty: discontiguous or overlapping mentions  Abdomen is soft, nontender, nondistended, negative bruits

NER Approaches As with partial parsing and chunking there are two basic approaches (and hybrids) As with partial parsing and chunking there are two basic approaches (and hybrids)  Rule-based (regular expressions) Lists of names Patterns to match things that look like names Patterns to match the environments that classes of names tend to occur in.  ML-based approaches Get annotated training data Extract features Train systems to replicate the annotation Slide from Jim Martin

ML Approach Slide from Jim Martin

Encoding for Sequence Labeling We can use IOB encoding: We can use IOB encoding: …United Airlines said Friday it has increased B_ORG I_ORG O O O O O the move, spokesman Tim Wagner said. O O O O B_PER I_PER O How many tags? How many tags?  For N classes we have 2*N+1 tags An I and B for each class and one O for no-class Each token in a text gets a tag Each token in a text gets a tag Can use simpler IO tagging if what? Can use simpler IO tagging if what?

NER Features Slide from Jim Martin

Reminder: Naïve Bayes Learner Train : For each class c j of documents 1. Estimate P(c j ) 2. For each word w i estimate P(w i | c j ) Classify (doc): Assign doc to most probable class Slide from Jim Martin

Logistic Regression How to compute: How to compute: Naïve Bayes: Naïve Bayes:  Use Bayes rule: Logistic Regression Logistic Regression  Compute posterior probability directly:

How to do NE tagging? Classifiers Classifiers  Naïve Bayes  Logistic Regression Sequence Models Sequence Models  HMMs  MEMMs  CRFs Sequence models work better Sequence models work better

Linear Regression Example from Freakonomics (Levitt and Dubner 2005) Example from Freakonomics (Levitt and Dubner 2005)  Fantastic/cute/charming versus granite/maple Can we predict price from # of adjs? Can we predict price from # of adjs?

Linear Regression

Muliple Linear Regression Predicting values: Predicting values: In general: In general:  Let’s pretend an extra “intercept” feature f 0 with value 1 Multiple Linear Regression Multiple Linear Regression

Learning in Linear Regression Consider one instance x j Consider one instance x j We’d like to choose weights to minimize the difference between predicted and observed value for x j : We’d like to choose weights to minimize the difference between predicted and observed value for x j : This is an optimization problem that turns out to have a closed-form solution This is an optimization problem that turns out to have a closed-form solution

Put the weight from the training set into matrix X of observations f (i) Put the weight from the training set into matrix X of observations f (i) Put the observed values in a vector y Put the observed values in a vector y Formula that mimimizes the cost: Formula that mimimizes the cost: W = (X T X) −1 X T y

Logistic Regression

But in these language problems we are doing classification But in these language problems we are doing classification  Predicting one of a small set of discrete values Could we just use linear regression for this? Could we just use linear regression for this?

Logistic regression Not possible: the result doesn’t fall between 0 and 1 Not possible: the result doesn’t fall between 0 and 1 Instead of predicting prob, predict ratio of probs: Instead of predicting prob, predict ratio of probs:  but still not good: doesn’t lie between 0 and 1 So how about if we predict the log: So how about if we predict the log:

Logistic regression Solving this for p(y=true) Solving this for p(y=true)

Logistic function

Logistic Regression How do we do classification? How do we do classification?Or: Or back to explicit sum notation:

Multinomial logistic regression Multiple classes: Multiple classes: One change: indicator functions f(c,x) instead of real values One change: indicator functions f(c,x) instead of real values

Estimating the weight Gradient Iterative Scaling Gradient Iterative Scaling

Features

Summary so far Naïve Bayes Classifier Naïve Bayes Classifier Logistic Regression Classifier Logistic Regression Classifier  Sometimes called MaxEnt classifiers

How do we apply classification to sequences?

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Slide from Ray Mooney John saw the saw and decided to take it to the table. classifier NNP

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier PRP Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier IN Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Using Outputs as Inputs Better input features are usually the categories of the surrounding tokens, but these are not available yet Better input features are usually the categories of the surrounding tokens, but these are not available yet Can use category of either the preceding or succeeding tokens by going forward or back and using previous output Can use category of either the preceding or succeeding tokens by going forward or back and using previous output Slide from Ray Mooney

Forward Classification John saw the saw and decided to take it to the table. classifier NNP Slide from Ray Mooney

Forward Classification NNP John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Forward Classification NNP VBD John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Forward Classification NNP VBD DT John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Forward Classification NNP VBD DT NN John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Forward Classification NNP VBD DT NN CC John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Forward Classification NNP VBD DT NN CC VBD John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Forward Classification NNP VBD DT NN CC VBD TO John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. DT NN John saw the saw and decided to take it to the table. classifier IN Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. IN DT NN John saw the saw and decided to take it to the table. classifier PRP Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. PRP IN DT NN John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier NNP Slide from Ray Mooney

NER as Sequence Labeling

Why classifiers aren’t as good as sequence models

Problems with using Classifiers for Sequence Labeling It’s not easy to integrate information from hidden labels on both sides It’s not easy to integrate information from hidden labels on both sides We make a hard decision on each token We make a hard decision on each token  We’d rather choose a global optimum  The best labeling for the whole sequence  Keeping each local decision as just a probability, not a hard decision

Probabilistic Sequence Models Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment Two standard models Two standard models  Hidden Markov Model (HMM)  Conditional Random Field (CRF)  Maximum Entropy Markov Model (MEMM) is a simplified version of CRF

HMMs vs. MEMMs Slide from Jim Martin

HMMs vs. MEMMs Slide from Jim Martin

HMMs vs. MEMMs Slide from Jim Martin

HMM (top) and MEMM (bottom)

Viterbi in MEMMs We condition on the observation AND the previous state: We condition on the observation AND the previous state: HMM decoding: HMM decoding: Which is the HMM version of: Which is the HMM version of: MEMM decoding: MEMM decoding:

Decoding in MEMMs

Evaluation Metrics

Precision Precision: how many of the names we returned are really names? Precision: how many of the names we returned are really names? Recall: how many of the names in the database did we find? Recall: how many of the names in the database did we find?

F-measure F-measure is a way to combine these: F-measure is a way to combine these: More generally: More generally:

F-measure Harmonic mean is the reciprocal of arthithmetic mean of reciprocals: Harmonic mean is the reciprocal of arthithmetic mean of reciprocals: Hence F-measure is: Hence F-measure is:

Outline Named Entities and the basic idea Named Entities and the basic idea IOB Tagging IOB Tagging A new classifier: Logistic Regression A new classifier: Logistic Regression  Linear regression  Logistic regression  Multinomial logistic regression = MaxEnt Why classifiers aren’t as good as sequence models Why classifiers aren’t as good as sequence models A new sequence model: A new sequence model:  MEMM = Maximum Entropy Markov Model