Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.

Slides:

Advertisements

Similar presentations

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Problem Semi supervised sarcasm identification using SASI

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.

LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.

© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li *, Rajasekar Krishnamurthy *, Sriram Raghavan *, Shivakumar Vaithyanathan.

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.

Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Fine-Grained Location Extraction from Tweets with Temporal Awareness Date:2015/03/19 Author:Chenliang Li, Aixin Sun Source:SIGIR '14 Advisor:Jia-ling Koh.

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.

Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

Computing Science, University of Aberdeen1 Reflections on Bayesian Spam Filtering l Tutorial nr.10 of CS2013 is based on Rosen, 6 th Ed., Chapter 6 & exercises.

CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

OneView Administrator Training Company Module December 2010.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Organization and Course Design A Discussion on this Quality Assurance Course Design Principle Facilitated by: Rosemary Rowlands, University College & Paul.

1 Accurate Product Name Recognition from User Generated Content Team: ISSSID Sen Wu, Zhanpeng Fang, Jie Tang Department of Computer Science Tsinghua University.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Automatically Labeled Data Generation for Large Scale Event Extraction

Text Based Information Retrieval

CRF &SVM in Medication Extraction

Erasmus University Rotterdam

CS 430: Information Discovery

Family History Technology Workshop

Presentation transcript:

Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute William W. Cohen Center for Automated Learning and Discovery School of Computer Science Carnegie Mellon University

October 7, 2005CMU School of Computer Science2 What is an informal text? A text that is… –Written for a narrow audience Group/task-specific abbreviations often used Not self-contained (context shared by a related group of people) –Not carefully prepared Contains grammatical and spelling errors Does not follow capitalization conventions Some examples are… –Instant messages –Newsgroup postings – messages

October 7, 2005CMU School of Computer Science3 Objective / Outline Investigate named entity recognition (NER) for informal text –Conduct experiments on recognizing personal names in Examine indicative features in and newswire Suggest specialized features for Evaluate performance of a state-of-the-art extractor (CRF) Analyze repetition of names in and newswire Suggest and evaluate a recall-enhancing method that is effective for

October 7, 2005CMU School of Computer Science4 Corpora Mgmt corpora – s from a management course at CMU in which students form teams to run simulated companies –Teams: Each set (train/tune/test) formed by different simulation teams –Game: Each set formed by different days during the simulation period Enron corpora – s from Enron Corporation –Meetings: Each set formed by randomly selected meeting-related s –Random: Each set formed by repeatedly sampling a user then sampling an from that user, both at random Note: The number of words and names refer to the whole annotated corpora

October 7, 2005CMU School of Computer Science5 Extraction Method Train Conditional Random Fields (CRF) to label and extract personal names –A machine-learning based probabilistic approach to labeling sequences of examples Learning reduces NER to the task of tagging, or classifying, each word using a set of five tags: –Unique: A one-token entity –Begin: The first token of a multi-token entity –End: The last token of a multi-token entity –Inside: Any other token of a multi-token entity –Outside: A token that is not part of an entity Example: Einat and Richard Wang met William W. Cohen today Unique Outside Begin End Outside Begin Inside End Outside

October 7, 2005CMU School of Computer Science6 Top Learned Features Features most indicative of a token being part of a name in a Conditional Random Fields (CRF) extractor Note: A feature is denoted by its direction (left/right) comparing to the focus word, offset, and lexical value Newswire (MUC-6) (Mgmt-Game) 2 In Quoted Excerpt In Signature Name Titles Job Titles Results show that… and newswire text have very different characteristics

October 7, 2005CMU School of Computer Science7 Note: All features are instantiated for the focus word t, and 3 tokens to the left and right of t Our Proposed Features

October 7, 2005CMU School of Computer Science8 Feature Evaluation Entity-level F1 of learned extractor (CRF) using: –Basic features (B) –Basic and features (B+E) –Basic and Dictionary features (B+D) –All features (B+D+E) B+D+E PrecisionRecall Results show that… 1) Dictionary and features are useful (best when combined) 2) Generally high precision but low recall

October 7, 2005CMU School of Computer Science9 What’s Next? Previous experiments show high precision but low recall –Next goal: Improve recall One recall-enhancing method –Look for multiple occurrences of names in a corpus We conduct experimental studies –Examine repetition patterns of names in and newswire text –Examine occurrences of names within a single document and across multiple documents

October 7, 2005CMU School of Computer Science10 Doc. Frequency of Names Percentage of person-name tokens that appear in at most K distinct documents as a function of K 1 Document Frequency Percentage 30% of names in Mgmt-Game appear only in one document Nearly 80% of names in MUC-6 appear only in one document About 20% of names in Mgmt- Game appear in 10+ documents Only 1.3% of names in MUC-6 appear in 10+ documents Results show that… Repetition of names across multiple documents is more common in corpora unique(A) : duplicates removed from set A df(w) : # of documents containing token w

October 7, 2005CMU School of Computer Science11 Single vs. Multiple Documents We define the following extractors: 1.CRF – baseline trained with all features 2.SDR (Single Document Repetition) Rules that extract person-name tokens that appear more than once within a single document; hence an upper bound on recall using only names repetition within a single document 3.MDR (Multiple Document Repetition) Rules that extract person-name tokens that appear in more than one document; hence an upper bound on recall using only names repetition across multiple documents 4.SDR+CRF Union of extractions by SDR and CRF; hence an upper bound on recall using CRF and names repetition within a single document 5.MDR+CRF Union of extractions by MDR and CRF; hence an upper bound on recall using CRF and names repetition across multiple documents

October 7, 2005CMU School of Computer Science12 Single vs. Multiple Documents Token-level upper bounds on recall and potential recall-gains associated with methods that look for name tokens that re-occur within a single document or across multiple documents Results show that… Higher recall and potential recall-gains can be obtained for corpora using MDR method MUC-6 has highest recall-gain using SDR MUC-6 has highest recall using SDR MUC-6 has lowest recall using MDR MUC-6 has lowest recall-gain using MDR

October 7, 2005CMU School of Computer Science13 What’s Next? Our studies show the potential of exploiting repetition of names over multiple documents for improving recall in corpora We suggest a recall-enhancing method: 1.Auto-construct a dictionary of predicted names and their variants from test set 2.Statistically filter out noisy names from the dictionary 3.Match names globally from the inferred dictionary onto test set, exploiting repetition of names Note: A “dictionary” is simply a list of one or more tokens

October 7, 2005CMU School of Computer Science14 Name Dictionary Construction Every name in the test set predicted by the learned extractor (CRF), trained with all features, is transformed into a set of name variants and inserted into a dictionary Transformation Example Name variants of “Benjamin Brown Smith”. Original name is included by default

October 7, 2005CMU School of Computer Science15 Name Dictionary Filtering Previously constructed dictionary contains noisy names –i.e. “brown” can also refer to a color –Next goal: Filter out noisy names We suggest a filtering scheme to remove every single- token name w from the dictionary when PF.IDF(w) < Θ cpf(w) : # of times w is predicted as a name-token in corpus ctf(w) : # of occurrences of w in corpus df(w) : document frequency of w in corpus N : # of documents in corpus Words that get low PF.IDF scores are either highly ambiguous names or very common words in corpus Note: “Corpus” mentioned here refers to the test set in our experiments Θ = 0.16 optimizes entity- level F1 in tune sets; thus, we apply the same threshold onto our test sets Predicted Frequency × Inverse Document Frequency

October 7, 2005CMU School of Computer Science16 Name Matching I called Benjamin Brown Smith and left a message to send us an if he could come. I have not received his yet. He might not be able to come. We may want to postpone until tomorrow morning. Do you still have our class schedule? Please contact benjamin and confirm the meeting. I do not have classes tomorrow morning. A window slides through every token in the test set A match occurs when tokens in a window starts with the longest possible name variant in the dictionary All matched names are marked for evaluation … benjamin brown smith benjamin-brown smith benjamin brown-smith benjamin-brown-smith benjamin brown s. benjamin-b. smith benjamin b. smith benjamin brown-s. benjamin-brown s. benjamin-brown-s benjamin-b. s. benjamin-smith benjamin smith b. brown smith benjamin b. s. b. brown-smith benjamin-s. benjamin s. b. brown s. b. b. smith b. brown-s. benjamin b. smith b. b. s. smith b. s. … Filtered Dictionary Names Matching Example Predicte d by CRF Missed by CRF

October 7, 2005CMU School of Computer Science17 Experimental Results Entity-level relative improvements (and final scores) after applying our recall-enhancing method on test sets –Baseline: learned extractor (CRF) trained with all features Results show that… 1) Recall improved significantly with small sacrifice in precision 2) F1 scores improved in all cases

October 7, 2005CMU School of Computer Science18 Conclusion and newswire text have different characteristics We suggested a set of specialized features for names extraction on exploiting structural regularities in Exploiting names repetition over multiple documents is important for improving recall in corpora We presented the PF.IDF recall-enhancing method that improves recall significantly with small sacrifice in precision

October 7, 2005CMU School of Computer Science19 Thank You!

October 7, 2005CMU School of Computer Science20 References