1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 7. Named Entity Recognition.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Conditional Random Fields
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Feb, 27, 2015 Slide Sources Raymond J. Mooney University of.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Introduction to Machine Learning Approach Lecture 5.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Information Retrieval in Practice
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto.
A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Text Classification, Active/Interactive learning.
Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
John Lafferty Andrew McCallum Fernando Pereira
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
CSC 594 Topics in AI – Natural Language Processing
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Qian Liu CSE spring University of Pennsylvania
CSCI 5832 Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
CSC 594 Topics in AI – Natural Language Processing
CSE 574 Finite State Machines for Information Extraction
CSCI 5832 Natural Language Processing
LECTURE 23: INFORMATION THEORY REVIEW
Extracting Information from Diverse and Noisy Scanned Document Images
CSCI 5832 Natural Language Processing
Presentation transcript:

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Named Entity Recognition

2 Named Entity Recognition Named Entities (NEs) are proper names in texts, i.e. the names of persons, organizations, locations, times and quantities NE Recognition (NER) is a sub-task of Information Extraction (IE) NER is to process a text and identify named entities –e.g. “U.N. official Ekeus heads for Baghdad.” NER is also an important task for texts in specific domains such as biomedical texts Source: J. Choi, CSE842, MSU; Marti Hearst, i256, at UC Berkeley

3 Difficulties with NER Names are too numerous to include in dictionaries Variations –e.g. John Smith, Mr Smith, John Changing constantly –new names invent unknown words Ambiguities –Same name refers to different entities, e.g. JFK – the former president JFK – his son JFK – airport in NY Source: J. Choi, CSE842, MSU

4 Two Kinds of NER Approaches Knowledge Engineering rule-based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re- annotation of the entire training corpus annotators are cheap (but you get what you pay for!) Source: Marti Hearst, i256, at UC Berkeley

5 Landscape of IE/NER Techniques Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Source: Marti Hearst, i256, at UC Berkeley

6 CoNLL-2003 CoNLL – Conference on Natural Language Learning by the ACL's Special Interest Group on Natural Language Learning Shared Task: Language-Independent Named Entity Recognition Goal: Identify boundaries and types of named entities –4 entity types: Person, Organization, Location, Misc. Data: –Use the IOB notation B- -- beginning of a chunk/NE I- -- internal of a chunk/NE O – outside of any chunk/NE –4 pieces of info for each term Word POS Chunk EntityType Source: Marti Hearst, i256, at UC Berkeley

7 Details on Training/Test Sets Reuters Newswire + European Corpus Initiative Source: Marti Hearst, i256, at UC Berkeley

8 Evaluation of Single Entity Extraction Source: Andrew McCallum, UMass Amherst

9 Summary of Results 16 systems participated Machine Learning Techniques –Combinations of Maximum Entropy Models (5) + Hidden Markov Models (4) + Winnow/Perceptron (4) –Others used once were Support Vector Machines, Conditional Random Fields (CRF), Transformation-Based learning, AdaBoost, and memory-based learning –Combining techniques often worked well Features –Choice of features is at least as important as ML method –Top-scoring systems used many types –No one feature stands out as essential (other than words) Source: Marti Hearst, i256, at UC Berkeley

10 Precision, Recall, and F-Scores * * * * * Not significantly different Source: Marti Hearst, i256, at UC Berkeley

11 State of the Art Performance Named entity recognition –Person, Location, Organization, … –F1 in high 80’s or low- to mid-90’s However, performance depends on the entity types [Wikipedia] At least two hierarchies of named entity types have been proposed in the literature. BBN categories [1], proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes. Sekine's extended hierarchy [2], proposed in 2002, is made of 200 subtypes.hierarchiesBBN[1]Question Answering[2] Also, various domains use different entity types (e.g. concepts in biomedical texts)

12 Sequence Labeling Inputs: x = (x 1, …, x n ) Labels: y = (y 1, …, y n ) Typical goal: Given x, predict y Example sequence labeling tasks –Part-of-speech tagging –Named Entity Recognition (NER) Target class/label is the entity type, in IOB notation Word POS Chunk EntityType Source: Andrew McCallum, UMass Amherst

13 Methods for Sequence Labeling Typically the following methods are used for NER: a)Hidden Markov Model (HMM) b)Maximum Entropy Classifier (MaxEnt) c)Maximum Entropy Markov Model (MEMM) d)Conditional Random Fields (CRF) These are all classifiers (i.e., supervised learning) which model sequences (rather than individual random variables)

14 Instance/word Features for NER Characteristics of the token & text in a surrounding window. Shape/orthographic features Source: Jurafsky & Martin “Speech and Language Processing”

15 a) Hidden Markov Models (HMMs) Source: Andrew McCallum, UMass Amherst

16 NER with HMMs Source: Andrew McCallum, UMass Amherst

17 We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. Source: Andrew McCallum, UMass Amherst

18 However… These arbitrary features are not independent. –Multiple levels of granularity (chars, words, phrases) –Multiple dependent modalities (words, formatting, layout) –Past & future Possible solutions: Model dependencies Ignore dependencies Conditional Sequence Models Source: Andrew McCallum, UMass Amherst

19 b) Maximum Entropy Classifier Conditional model p(y|x). –Do not waste effort modeling p(x), since x is given at test time anyway. –Allows more complicated input features, since we do not need to model dependencies between them. Feature functions f(x,y): –f1(x,y) = { word is Boston & y=Location } –f2(x,y) = { first letter capitalized & y=Name } –f3(x,y) = { x is an HTML link & y=Location} Source: Andrew McCallum, UMass Amherst

20 Maximum Entropy Models Same as multinomial logistic regression (thus an exponential model for n-way classification) Features are constraints on the model. Of all possible models, we choose that which maximizes the entropy of all models that satisfy these constraints. Choosing one with less entropy means we are adding information that is not justified by the empirical evidence. Source: Jason Balbridge, U of Texas at Austin

21 MaxEnt: A Simple Example Say we have an unknown probability distribution p(x,y), where x  {a,b} and y  {0,1}. We only know that p(a,0)+p(b,0)=.6. Well, actually we also know that  x,y p(x,y)=1 since p is a probability distribution. There are infinitely many consistent distributions conforming to the constraint. Source: Jason Balbridge, U of Texas at Austin

22 MaxEnt: A Simple Example (cont.) Principle of Maximum Entropy recommends non- committal assignment of probabilities, subject to the constraints. What if we also knew that p(a,0)+p(a,1)=.6? Source: Jason Balbridge, U of Texas at Austin

23 MaxEnt: A Simple Example (cont.) An observation such as p(a,0)+p(b,0)=.6 is implemented as a constraint on the model p’s expectation of a feature f : E p f =.6 In general, where f(x,y) is an indicator function for the feature f (and there is one function for each feature; below is one example): Source: Jason Balbridge, U of Texas at Austin y = 0

24 Maximizing Entropy To find the maximum entropy model, our objective is to maximize the log likelihood: subject to the known constraints. Note: - p(x,y) log(p(x,y)) is the entropy of p(x,y), which is the given model p. In summary, “the intuition of maximum entropy is to build a distribution by continuously adding features. Each feature is an indicator function, which picks out a subset of the training observations. For each feature, we add a constraint on our total distribution, specifying that our distribution for this subset should matcc with the empirical distribution that we saw in our training data. We then choose the maximum entropy distribution that otherwise accords with these constraints.” Source: Jason Balbridge, U of Texas at Austin; Jurafsky & Martin “Speech and Language Processing”

25 MaxEnt Classification To identify the most probable label for a particular context x, we calculate: Source: Jason Balbridge, U of Texas at Austin

26 d) Conditional Random Fields (CRF) Conditionally-trained, undirected graphical model. Source: Andrew McCallum, UMass Amherst

27 Characteristics of CRFs In CRF, the graph structure is often a linear chain, where the state sequence (S) connects S t-1 and S t, and each state S i is dependent on the observation sequence (O). CRF is a more general and expressive modeling technique than HMM, since it can model the dependency of each state on maximally the whole/entire observation sequence (not just its corresponding output unit). Source: Andrew McCallum, UMass Amherst

28 CRFs CRFs are widely used and applied in NLP research. CRFs give state-the-art results in many domains and tasks. –Part-of-speech tagging –Named Entity Recognition –Gene prediction –Chinese word segmentation –Extracting information from research papers. –etc. Source: Andrew McCallum, UMass Amherst

29 For a standard linear-chain structure: Source: Andrew McCallum, UMass Amherst