CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Slides:

Advertisements

Similar presentations

University of Sheffield NLP Module 4: Machine Learning.

Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,

Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.

Presenters: Arni, Sanjana.  Subtask of Information Extraction  Identify known entity names – person, places, organization etc  Identify the boundaries.

Robust Extraction of Named Entity Including Unfamiliar Word Masatoshi Tsuchiya, Shinya Hida & Seiichi Nakagawa Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task Robert Dale, Ilya Anisimoff and George Narroway Centre for Language Technology.

Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.

Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.

47 th Annual Meeting of the Association for Computational Linguistics and 4 th International Joint Conference on Natural Language Processing Of the AFNLP.

Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.

Information Extraction CS 4705 Julia Hirschberg CS 4705.

CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.

Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008.

The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.

A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK

Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

ELN – Natural Language Processing Giuseppe Attardi

Instance Weighting for Domain Adaptation in NLP Jing Jiang & ChengXiang Zhai University of Illinois at Urbana-Champaign June 25, 2007.

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.

1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.

Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.

Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel.

Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.

Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

Hedge Detection with Latent Features SU Qi CLSW2013, Zhengzhou, Henan May 12, 2013.

Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.

Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:

Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.

FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.

Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Domain Adaptation Slide 1 Hal Daumé III Frustratingly Easy Domain Adaptation Hal Daumé III School of Computing University of Utah

Language Identification and Part-of-Speech Tagging

Using Uneven Margins SVM and Perceptron for IE

PURE Learning Plan Richard Lee, James Chen,.

Presentation transcript:

CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science Nothman et al. 2009, EACL

Outline 1.Motivation 2.NER and Gold-Standard Corpora 3.The Problem: Cross-corpora Performance 4.Wikipedia for NER 5.Results 6.Conclusion and My Observation

Motivation 1.Manual Annotation is “expensive”. (1) expensive (2) time (3) extra problems Can we use linguistic resources to create NER corpus automatically? 2.What’s the cross-corpora NER performance? 3.How can we utilize Web resource (e.g. Wikipedia) to improve NER?

NER Gold Corpora 1.MUC-7: Locations(LOC), organizations(ORG), personal names(PER) 2.CoNLL-03: LOC, ORG, PER, Miscellaneous(MISC) 3.BBN: 54 tags in Penn Treebank CorpusTagsTrain Tokens Dev Tokens Test Tokens MUC CoNLL BBN

Problem: Poor Cross-corpus Performance TrainWith MISC CoNLL BBN Without MISC MUC CoNLL BBN MUC— CoNLL BBN

Corpus and Error Analysis N-gram tag variation: Check tags of all n-grams appear multiple times to see if the NE tags are consistent Entity type frequency: (1) POS tag with its NE tag (e.g. nationalities are often with JJ or NNPS) (2) Wordtypes (3) Wordtypes with Functions (e.g. Bank of New England -> Aaa of Aaa Aaa) Tag sequence confusion: Looking into the detail of confusion matrix

Using Wikipedia to Build NER Corpus 1.Classify all articles into entity classes 2. Split Wikipedia articles into sentences 3. Label NEs according to link targets 4. Select sentences for inclusion in a corpus

Improve Wikipedia NER Baseline: 58.9% and 62.3% on CoNLL and BBN 1.Inferring extra links using Wikipedia Disambiguation Pages 2.Personal titles: not all preceding titles indicate PER (e.g. Prime Minister of Australia) 3.Previously missed JJ entities (e.g. American / MISC) 4.Miscellaneous changes

Results TrainWith MISC CoNLL BBN Without MISC MUC CoNLL BBN MUC— CoNLL BBN WP WP WP WP WP DEV set results (higher but similar to test set results)

Conclusion The impact of NER training corpora on its corresponding test set is huge Annotation-free Wikipedia NER corpora created Wikipedia data performs better in the cross- corpora NER task Still much room for improvement

Comments What I like about this paper: The scope of this paper is unique (analogy: cross- cultural studies) Utilizing novel linguistic resources to solve basic NLP problems Good results Relatively clear and easy to understand What I don’t like about this paper: The overall method to improve Wikipedia NER training is not a principal approach

Overall Assessment: 8/10

Thank you!