807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition.

807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition

Text classification at Different Granularities Text Categorization: – Classify an entire document Information Extraction (IE): – Identify and classify small units within documents Named Entity Extraction (NE): – A subset of IE – Identify and classify proper names People, locations, organizations

Adapted from slide by William Cohen What is Information Extraction Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

Adapted from slide by William Cohen What is Information Extraction Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

Adapted from slide by William Cohen What is Information Extraction Information Extraction = segmentation + classification + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction”

Adapted from slide by William Cohen What is Information Extraction Information Extraction = segmentation + classification + association A family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

INFORMATION EXTRACTION More general definition: extraction of structured information from unstructured documents IE Tasks: – Named entity extraction Named entity recognition Coreference resolution Relationship extraction Semi-structured IE – Table extraction Terminology extraction

Adapted from slide by William Cohen Landscape of IE Tasks: Degree of Formatting Grammatical sentences and some formatting & links Text paragraphs without formatting Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables

Adapted from slide by William Cohen Landscape of IE Tasks: Intended Breadth of Coverage Web site specificGenre specificWide, non-specific Amazon.com Book PagesResumesUniversity Names FormattingLayoutLanguage

Landscape of IE Tasks” Complexity Closed set He was born in Alabama… The big Wyoming sky… U.S. states Regular set Phone: (413) 545-1323 The CALD main office can be reached at 412-268-1299 U.S. phone numbers Complex pattern University of Arkansas P.O. Box 140 Hope, AR 71802 U.S. postal addresses Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence Person names Pawel Opalinski, Software Engineer at WhizBang Labs.

Adapted from slide by William Cohen Landscape of IE Tasks: Single Field/Record Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record “Named entity” extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut

Adapted from slide by William Cohen State of the Art Performance: a sample Named entity recognition from newswire text – Person, Location, Organization, … – F1 in high 80’s or low- to mid-90’s Binary relation extraction – Contained-in (Location1, Location2) Member-of (Person1, Organization1) – F1 in 60’s or 70’s or 80’s Web site structure recognition – Extremely accurate performance obtainable – Human effort (~10min?) required on each site

Slide by Chris Manning, based on slides by several others Three generations of IE systems Hand-Built Systems – Knowledge Engineering [1980s– ] – Rules written by hand – Require experts who understand both the systems and the domain – Iterative guess-test-tweak-repeat cycle Automatic, Trainable Rule-Extraction Systems [1990s– ] – Rules discovered automatically using predefined templates, using automated rule learners – Require huge, labeled corpora (effort is just moved!) Statistical Models [1997 – ] – Use machine learning to learn which features indicate boundaries and types of entities. – Learning usually supervised; may be partially unsupervised

Adapted from slide by William Cohen Landscape of IE Techniques Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence?

Standard approaches to IE Hand-written regular expressions or rules Supervised classification – SVM – Maximum entropy models Sequence models – Hidden Markov Models – Conditional random fields

Named Entity Recognition (NER) Input: Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne. Output: Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

Named Entity Recognition (NER) Locate and classify atomic elements in text into predefined categories (persons, organizations, locations, temporal expressions, quantities, percentages, monetary values, …) Input: a block of text – Jim bought 300 shares of Acme Corp. in 2006. Output: annotated block of text – Jim bought 300 shares of Acme Corp. in 2006 – ENAMEX tags (MUC in the 1990s)

THE STANDARD NEWS DOMAIN Most work on NER focuses on – NEWS – Variants of repertoire of entity types first studied in MUC and then in ACE: PERSON ORGANIZATION – GPE LOCATION TEMPORAL ENTITY NUMBER

Named Entity Recognition – Subtask of information extraction – Locate and classify elements in text into predefined categories: names of persons, organizations, locations, expressions of times, etc Example – James Clarke, director of ABC company (Person) (Organization)

HOW Two tasks: – Identifying the part of text that mentions a text (RECOGNITION) – Classifying it (CLASSIFICATION) The two tasks are reduced to a standard classification task by having the system classify WORDS

Basic Problems in NER Variation of NEs – e.g. John Smith, Mr Smith, John. Ambiguity of NE types – John Smith (company vs. person) – May (person vs. month) – Washington (person vs. location) – 1945 (date vs. time) Ambiguity with common words, e.g. “may”

Problems in NER Category definitions are intuitively quite clear, but there are many grey areas. Many of these grey area are caused by metonymy. Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”. Company vs. Artefact: “shares in MTV” vs. “watching MTV” Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”

Solutions The task definition must be very clearly specified at the outset. The definitions adopted at the MUC conferences for each category listed guidelines, examples, counter-examples, and “logic” behind the intuition. MUC essentially adopted simplistic approach of disregarding metonymous uses of words, e.g. “England” was always identified as a location. However, this is not always useful for practical applications of NER (e.g. football domain). Idealistic solutions, on the other hand, are not always practical to implement, e.g. making distinctions based on world knowledge.

More complex problems in NER Issues of style, structure, domain, genre etc. – Punctuation, spelling, spacing, formatting, ….all have an impact Dept. of Computing and Maths Manchester Metropolitan University Manchester United Kingdom > Tell me more about Leonardo > Da Vinci

Approaches to NER: List Lookup System that recognises only entities stored in its lists (GAZETTEERS). Advantages - Simple, fast, language independent, easy to retarget Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

Approaches to NER: Shallow Parsing Names often have internal structure. These components can be either stored or guessed. location: CapWord + {City, Forest, Center} e.g. Sherwood Forest Cap Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street

Shallow Parsing Approach (E.g., Mikheev et al 1998) External evidence - names are often used in very predictive local contexts Location: “to the” COMPASS “of” CapWord e.g. to the south of Loitokitok “based in” CapWord e.g. based in Loitokitok CapWord “is a” (ADJ)? GeoWord e.g. Loitokitok is a friendly city

Difficulties in Shallow Parsing Approach Ambiguously capitalised words (first word in sentence) [All American Bank] vs. All [State Police] Semantic ambiguity “John F. Kennedy” = airport (location) “Philip Morris” = organisation Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell] [Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith].

Machine learning approaches to NER NER as classification: the IOB representation Supervised methods – Support Vector Machines – Logistic regression (aka Maximum Entropy) – Sequence pattern learning – Hidden Markov Models – Conditional Random Fields Distant learning Semi-supervised methods

THE ML APPROACH TO NE: THE IOB REPRESENTATION

THE ML APPROACH TO NE: FEATURES

FEATURES

Supervised ML for NER Methods already seen – Decision trees – Support Vector Machines Sequence learning – Hidden Markov Models – Maximum Entropy Models – Conditional Random Fields

Linear Regression Example from Freakonomics (Levitt and Dubner 2005)  Fantastic/cute/charming versus granite/maple Can we predict price from # of adjs?

Linear Regression

Multiple Linear Regression Predicting values: Predicting values: In general: In general: – Let’s pretend an extra “intercept” feature f 0 with value 1 Multiple Linear Regression Multiple Linear Regression

Learning in Linear Regression Consider one instance x j Consider one instance x j We’d like to choose weights to minimize the difference between predicted and observed value for x j : We’d like to choose weights to minimize the difference between predicted and observed value for x j : This is an optimization problem that turns out to have a closed-form solution This is an optimization problem that turns out to have a closed-form solution

Logistic regression But in these language cases we are doing classification  Predicting one of a small set of discrete values Could we just use linear regression for this?

Logistic regression Not possible: the result doesn’t fall between 0 and 1 Not possible: the result doesn’t fall between 0 and 1 Instead of predicting prob, predict ratio of probs: Instead of predicting prob, predict ratio of probs: – but still not good: doesn’t lie between 0 and 1 So how about if we predict the log: So how about if we predict the log:

Logistic regression Solving this for p(y=true) Solving this for p(y=true)

Logistic Regression How do we do classification? How do we do classification?Or: Or back to explicit sum notation:

Multinomial logistic regression Multiple classes: One change: indicator functions f(c,x) instead of real values

Estimating the weight Gradient Iterative Scaling

Features

Summary so far Naïve Bayes Classifier Naïve Bayes Classifier Logistic Regression Classifier Logistic Regression Classifier – Sometimes called MaxEnt classifiers

NER as a SEQUENCE CLASSIFICATION TASK

Sequence Labeling as Classification: POS Tagging Classify each token independently but use as input features, information about the surrounding tokens (sliding window). Slide from Ray Mooney John saw the saw and decided to take it to the table. classifier NNP

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier PRP Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier IN Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Sequence Labeling as Classification Classify each token independently but use as input features, information about the surrounding tokens (sliding window). John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Using Outputs as Inputs Better input features are usually the categories of the surrounding tokens, but these are not available yet Can use category of either the preceding or succeeding tokens by going forward or back and using previous output Slide from Ray Mooney

Forward Classification John saw the saw and decided to take it to the table. classifier NNP Slide from Ray Mooney

Forward Classification NNP John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Forward Classification NNP VBD John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Forward Classification NNP VBD DT John saw the saw and decided to take it to the table. classifier NN Slide from Ray Mooney

Forward Classification NNP VBD DT NN John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Forward Classification NNP VBD DT NN CC John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Forward Classification NNP VBD DT NN CC VBD John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Forward Classification NNP VBD DT NN CC VBD TO John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. DT NN John saw the saw and decided to take it to the table. classifier IN Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. IN DT NN John saw the saw and decided to take it to the table. classifier PRP Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. PRP IN DT NN John saw the saw and decided to take it to the table. classifier VB Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier TO Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier CC Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier DT Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier VBD Slide from Ray Mooney

Backward Classification Disambiguating “to” in this case would be even easier backward. Disambiguating “to” in this case would be even easier backward. VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table. classifier NNP Slide from Ray Mooney

NER as Sequence Labeling

Problems with using Classifiers for Sequence Labeling It’s not easy to integrate information from hidden labels on both sides It’s not easy to integrate information from hidden labels on both sides We make a hard decision on each token We make a hard decision on each token – We’d rather choose a global optimum – The best labeling for the whole sequence – Keeping each local decision as just a probability, not a hard decision

Probabilistic Sequence Models Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment Two standard models  Hidden Markov Model (HMM)  Conditional Random Field (CRF)  Maximum Entropy Markov Model (MEMM) is a simplified version of CRF

Hidden Markov Models (HMMs) Generative – Find parameters to maximize P(X,Y) Assumes features are independent When labeling X i future observations are taken into account (forward-backward)

MaxEnt Markov Models (MEMMs) Discriminative – Find parameters to maximize P(Y|X) No longer assume that features are independent Do not take future observations into account (no forward-backward)

Conditional Random Fields (CRFs) Discriminative – Find parameters to maximize P(Y|X) Doesn’t assume that features are independent When labeling Y i future observations are taken into account  The best of both worlds!

86 PROBABILISTIC CLASSIFICATION: GENERATIVE VS DISCRIMINATIVE Let Y be the random variable for the class which takes values {y 1,y 2,…y m }. Let X be the random variable describing an instance consisting of a vector of values for n features, let x k be a possible vector value for X and x ij a possible value for X i. For classification, we need to compute P(Y=y i | X=x k ) for i = 1…m Could be done using joint distribution but this requires estimating an exponential number of parameters.

Discriminative Vs. Generative Generative Model: A model that generate observed data randomly Naïve Bayes: once the class label is known, all the features are independent Discriminative: Directly estimate the posterior probability; Aim at modeling the “discrimination” between different outputs MaxEnt classifier: linear combination of feature function in the exponent, Both generative models and discriminative models describe distributions over (y, x), but they work in different directions.

Discriminative Vs. Generative =unobservable=observable

Generative vs. Discriminative Sequence Labeling Models HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q). HMMs are trained to have an accurate probabilistic model of the underlying language, and not all aspects of this model benefit the sequence labeling task. Conditional Random Fields (CRFs) are specifically designed and trained to maximize performance of sequence labeling. They model the conditional distribution P(Q | O)

Classification Y X1X1 X2X2 … XnXn Y X1X1 X2X2 … XnXn Naïve Bayes Logistic Regression Conditional Generative Discriminative

Logistic Regression Assumes a parametric form for directly estimating P(Y | X). For binary concepts, this is:

Sequence Labeling Y2Y2 X1X1 X2X2 … XTXT HMM Linear-chain CRF Conditional Generative Discriminative Y1Y1 YTYT.. Y2Y2 X1X1 X2X2 … XTXT Y1Y1 YTYT

Simple Linear Chain CRF Features Modeling the conditional distribution is similar to that used in multinomial logistic regression. Create feature functions f k (Y t, Y t−1, X t ) – Feature for each state transition pair i, j f i,j (Y t, Y t−1, X t ) = 1 if Y t = i and Y t−1 = j and 0 otherwise – Feature for each state observation pair i, o f i,o (Y t, Y t−1, X t ) = 1 if Y t = i and X t = o and 0 otherwise Note: number of features grows quadratically in the number of states (i.e. tags). 93

Conditional Distribution for Linear Chain CRF Using these feature functions for a simple linear chain CRF, we can define: 94

Adding Token Features to a CRF Can add token features X i,j 95 … X 1,1 X 1,m … X 2,1 X 2,m … X T,1 X T,m … … Can add additional feature functions for each token feature to model conditional distribution. Y1Y1 Y2Y2 YTYT

NER: EVALUATION

TYPICAL PERFORMANCE

NER Evaluation Campaigns English NER-- CoNLL 2003 - PER/ORG/LOC/MISC – Training set:203.621 tokens – Development set: 51.362 tokens – Test set: 46.435 tokens Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE – Development set:223.706 tokens – Test set: 90.556 tokens Mention Detection-- ACE 2005 – 599 documents

CoNLL2003 shared task (1) English and German language 4 types of NEs: – LOC Location – MISC Names of miscellaneous entities – ORG Organization – PER Person Training Set for developing the system Test Data for the final evaluation

CoNLL2003 shared task (2) Data – columns separated by a single space – A word for each line – An empty line after each sentence – Tags in IOB format An example MilanNNPB-NPI-ORG 'sPOSB-NPO playerNNI-NPO GeorgeNNPI-NPI-PER WeahNNPI-NPI-PER meetVBPB-VPO

CoNLL2003 shared task (3) Englishprecision recall F [FIJZ03]88.99%88.54%88.76% [CN03]88.12%88.51%88.31% [KSNM03]85.93%86.21%86.07% [ZJ03]86.13%84.88%85.50% --------------------------------------------------- [Ham03]69.09%53.26%60.15% baseline71.91%50.90%59.61%

CURRENT RESEARCH ON NER New domains New approaches: – Semi-supervised – Distant Handling many NE types Integration with Machine Translation Handling difficult linguistic phenomena such as metonymy metonymy

NEW DOMAINS BIOMEDICAL CHEMISTRY HUMANITIES: MORE FINE GRAINED TYPES

Bioinformatics Named Entities Protein DNA RNA Cell line Cell type Drug Chemical

NER IN THE HUMANITIES LOC SITE CULTURE

Semi-supervised learning Modest amounts of supervision – Small size of training data – Supervisor input sought when necessary Aims to match supervised learning performance, but with muchless human effort Bootstrapping – Seeds used to identify contextual clues – Contextual clues used to find more NEs

Examples: ( Brin 1998); (Collins and Singer 1999); (Riloff and Jones 1999); (Cucchiarelli and Velardi 2001); (Pasca et al. 2006); (Heng and Grishman 2006); (Nadeau et al. 2006), and (Liao and Veeramachaneni, 2009) Semi-supervised learning

Input – A seed list of a few examples of a given NE type ‘Muhammad’ & ‘Obama’ can be used as seed examples for entity of type person. Parameters – Number of iterations! – Number of initial seeds! – The ranking measure (Reliability measure)! ASemiNER - Methodology

–Sentences containing seed instance (retrieved) –A number of tokens on each side of the seed. (extracted) Sentence boundaries Pattern Induction - Initial Patterns TP pair = (Token/POS) pair

–Token (noun)  inflected forms Pattern Induction - Final Patterns −Token (verb)  stems

– Lists of trigger nouns (e.g., alsayd `Mr.’, alraeys `President’, alduktur `Dr.’ ) They will be used as Arabic NE indicators or trigger words in the training phase. Arabic Wikipedia articles are crawled randomly, prepared, and POS-tagged. The left and right nouns of the named entity are extracted and collected. The top most frequent nouns (inflected) are picked and stored as “trigger” nouns. – Lists of trigger verbs (e.g., rasam `draw’, naHat `sculpture’,...etc.) The top most frequent verbs (stems) are picked and stored as “trigger” verbs Pattern Induction - Final Patterns “Trigger” words

Generalization TP pairs that contains nouns, and verbs are stripped of their ‘Token’ parts, unless these tokens are in the corresponding lists of trigger words. alsayd/NN ‘Mr./NN’  alsayd/NN ‘Mr./NN’ as alsayd ‘Mr.’ is in the list of trigger nouns qalam/NN ‘pen/NN’  / NN as qalam ‘pen’ is not among trigger nouns. TP pairs that contain preposition are kept without changes. TP pairs that contain other parts of speech categories (e.g., proper noun, adjective, coordinating conjunction) are stripped of their ‘Token’ parts. mufyd/JJ ‘useful/JJ’  /JJ All POS tags used for verbs (e.g., VBP, VBD, VBN) are converted to one form VB. All POS tags used for nouns (e.g., NN, NNS) are converted to one form NN. All POS tags used for proper noun (e.g., NNP, NNPS) are converted to one form NNP. The seed instance is replaced with NE class tag (e.g.,,, ). Pattern Induction

Producing Final Patterns Initial Pattern Final Pattern English Gloss

Pattern Induction Two more Final Patterns

Pattern Induction Final Pattern Set ( P ) is modified and filtered every time a new pattern is added. Repeated patterns are rejected. Pattern consisting of less than six TP pairs should contain at least one ‘Token’ part. – /VB /NN /NNP /NNP

ASemiNER - Methodology

Instance Extraction ASemiNER retrieves from the training corpus the set of instances ( I ) that match any of the patterns in ( P ) using Regular Expressions (Regex). ASemiNER automatically generates regexes from final patterns without any modification regardless of the correctness of the POS tags assigned to the proper noun by POS tagger.

Instance Extraction ASemiNER automatically add the information of average NE length to the produced regexes. (2 tokens)

ASemiNER - Methodology

Instance Ranking/Selection Extracted instances in ( I ) are ranked according to: – Distinct patterns used to extract them. (pattern variety is better cue to semantic than absolute frequency) – Pointwise Mutual Information (PMI) : |i,p|: the frequency of the instance i extracted by pattern p. |i|: the frequency of the instance i in the corpus. |p|: the frequency of the pattern p in the corpus.

ASemiNER - Methodology Top m instances Where ( m ) is set to the number of instances in the previous iteration + 1

Experiments & Results Several experiments – Different values of the parameters No. of iterations. No. of initial seeds. Ranking measure. Training data – ACE 2005 (Linguistic Data Consortium, LDC) – ANERcorp training set (Benajiba et al. 2007) Test data – ANERcorp test corpus

Experiments & Results Several experiments – Standard NE types: Person Location Organization – Specialised NE types: Politicians Sportspersons Artists

Simple Models (standard NE types) ANERcorp (Training data) Without iterations. No. of Initial seeds : 5

ASemiNER (Specialised NE types) Politicians, Artists, and Sportspersons Unlike supervised techniques, ASemiNER does not require additional annotated training data or re-annotating the existing one. It requires only minor modification: For each new NE type, Generate new trigger nouns and verb lists. – Artists trigger nouns (e.g., actress, actor, painter…etc. ) – Politicians trigger nouns (e.g., president, party, king, …etc.) – Sportsmen trigger nouns (e.g., player, football, athletic, …etc.)

ASemiNER (Specialised NE types) – ASemiNER performs as well as on recognizing standard person category – ASemiNER proved to be easily adaptable when extracting new types of NEs

ACTIVE LEARNING

Unsupervised learning Clustering Lexical resources (WordNet) Lexical patterns Statistics computed over large unannotated corpus Examples – Assign topic signature to WordNet synsets based on word context (words that co-occur) – PMI-IR: measure of co-occurrence of expressions using web queries

DISTANT LEARNING

Query: WIKIPEDIA AND NER Wikipedia: 130 May 2012 Truc-Vien T. Nguyen Giotto was called to work in Padua, and also in Rimini

CRF++ (1) Can redefine feature sets Written in C++ with STL Fast training based on LBFGS for large scale Less memory usage both in training and testing encoding/decoding in practical time Available as an open source software http://crfpp.googlecode.com/svn/trunk/doc/index.html

CRF++ (2) use Conditional Random Fields (CRFs) CRFs methodology: use statistical correlated features and train them discriminatively simple, customizable, and open source implementation for segmenting/labeling sequential data can define – unigram/bigram features – relative positions (windows-size)

Template basic An example: HePRPB-NP reckonsVBZB-VP theDTB-NP<< CURRENT TOKEN currentJJI-NP accountNNI-NP TemplateExpanded feature %x[0,0]the %x[0,1]DT %x[-1,0]reckons %x[-2,1]PRP %x[0,0]/%x[0,1]the/DT

A Case Study Installing CRF++ Data for Training and Test Making the baseline Training CRF++ on the – NER dataset: English CoNLL2003, Italian EVALITA – Mention classification: ACE 2005 dataset Annotating the test corpus with CRF++ Evaluating results Exercise

Installing CRF++ First, ssh compute-0-x where x=1..10 Unzip the lab--NER.tar.gz file (tar -xvzf lab-- NER.tar.gz) Enter the lab--NER directory – Unzip the CRF++-0.54.tar.gz file (tar -xvzf CRF++- 0.54.tar.gz) – Enter the CRF++-0.54 directory – Run./configure – Run make

Training/Classification (1) Notations – xxxtrain_it.dat/train_en.dat/train_mention.dat – nnnit.model/en.model/mention.model – yyytest_it.dat/test_en.dat/test_mention.dat – zzztest_it.tagged/test_en.tagged/test_mention.tagged – ttttest_it.eval/test_en. eval/test_mention.eval Note that the test_it.dat already contains the right NE tags but the system is not using this information for tagging the data

Training/Classification (2) Enter the CRF++-0.54 directory Training./crf_learn../templates/template_4../corpus/xxx../models/nnn Classification./crf_test -m../models/nnn../corpus/yyy >../corpus/zzz Evaluation perl../eval/conlleval.pl../corpus/zzz >../corpus/ttt See the results cat../corpus/ttt

THANKS I used slides from Bernardo Magnini, Chris Manning, Roberto Zanoli, Ray Mooney

807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition.

Similar presentations

Presentation on theme: "807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition.

Similar presentations

Presentation on theme: "807 - TEXT ANALYTICS Massimo Poesio Lecture 5: Named Entity Recognition."— Presentation transcript:

Similar presentations

About project

Feedback