Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Slides:



Advertisements
Similar presentations
A Probabilistic Representation of Systemic Functional Grammar Robert Munro Department of Linguistics, SOAS, University of London.
Advertisements

University of Sheffield NLP Module 11: Advanced Machine Learning.
Exploring the Effectiveness of Lexical Ontologies for Modeling Temporal Relations with Markov Logic Eun Y. Ha, Alok Baikadi, Carlyle Licata, Bradford Mott,
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
ML-based approaches to Named Entity Recognition for German newspaper texts ESSLLI 02 – Workshop on ML Aproaches for CL Marc Rössler University of Duisburg.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Survey of Semantic Annotation Platforms
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
A Language Independent Method for Question Classification COLING 2004.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Kyoungryol Kim Extracting Schedule Information from Korean .
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
Learning Subjective Nouns using Extraction Pattern Bootstrapping Ellen Riloff School of Computing University of Utah Janyce Wiebe, Theresa Wilson Computing.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Language Identification and Part-of-Speech Tagging
PRESENTED BY: PEAR A BHUIYAN
张昊.
Using Uneven Margins SVM and Perceptron for IE
Hierarchical, Perceptron-like Learning for OBIE
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics ACL 2005 Student Research Workshop

2/15 Abstract  Named Entity Recognition (X) Complex linguistic resources (X) A hand coded system (X) Any language dependent tools The only information we use is automatically extracted from the documents, without human intervention.  Our approach even outperformed the hand coded system on NER in Spanish, and it achieved high accuracies in Portuguese.

3/15 Introduction  Most NER approaches have very low portability  Many NE extractor systems rely heavily on complex linguistic resources, which are typically hand coded, for example regular expressions, grammars, gazetteers and the like.  Adapting a system to a different collection or language requires a lot of human effort: rewriting the grammars, acquiring new dictionaries, searching trigger words, and so on.  (O) NE extractor system for Spanish + Portuguese corpus  (X) developing linguistic tools such as parsers, POS taggers, grammars and the like.

4/15 Related Work (1/2)  Hidden Markov Models Zhou and Su, 2002  Internal features: gazetteer information  External features: the context of other NEs already recognized Bikel et al., 1997 and Bikel et al., 1999  Maximum entropy Borthwick, 1999: dictionaries and other orthographic information  Carreras et al.,2003 presented results of a NER system for Catalan using Spanish resources using cross-linguistic features

5/15 Related Work (2/2)  Petasis et al., 2000 extending a proper noun dictionary  an inductive decision-tree classifier  unsupervised probabilistic learning of syntactic and semantic context  POS tags and morphological information  Arévalo et al., 2002 External information provided by gazetteers and lists of trigger words A context free grammar, manually coded, is used for recognizing syntactic patterns

6/15 Data sets  The corpus in Spanish is that used in the CoNLL 2002 competitions for the NE extraction task. training: 20,308 NEs testa: 4,634 NEs  to tune the parameters of the classifiers (development set)  performed experiments with testa only testb: 3,948 NEs  to compare the results of the competitors  The corpus in Portuguese is “ HAREM: Evaluation contest on named entity recognition for Portuguese ”. This corpus contains newspaper articles and consists of 8,551 words with 648 NEs.

7/15 Two-step Named Entity Recognition  Named Entity Delimitation (NED) Determining boundaries of named entities  Named Entity Classification (NEC) Classifying the named entities into categories

8/15 Named Entity Delimitation (1/3)  BIO scheme  We used a modified version of C4.5 algorithm (Quinlan, 1993) implemented within the WEKA environment (Witten and Frank, 1999).  For each word we combined two types of features: Internal: word itself, orthographic information(1~6 possible states) and the position in the sentence. External: POS tag and BIO tag A given word w are extracted using a window of five words anchored in the word w, each word described by the internal and external features mentioned previously.  Our classifier learns to discriminate among the three classes and assigns labels to all the words, processing them sequentially.

9/15 Named Entity Delimitation (2/3)

10/15 Named Entity Delimitation (3/3)  The hand coded system used in this work was developed by the TALP research center (Carreras and Padró, 2002). NLP analyzers for Spanish, English and Catalan Include practical tools such as POS taggers, semantic analyzers and NE extractors. This NER system is based on hand-coded grammars, lists of trigger words and gazetteer information.

11/15 NED - Experimental Results  Average of several runs of 10-fold cross-validation

12/15 Named Entity Classification  To build a training set for the NEC learner: the same attributes as for the NED task suffix of each word. maximum size of 5 characters.  Spanish: person, organization, location and miscellaneous.  Portuguese: person, object, quantity, event, organization, artifact, location, date, abstraction and miscellaneous.  We believe that the learner will be capable of achieving good accuracies in the learning task.

13/15 NEC - Experimental Results (1/2)  Similarly to the NED case we trained C4.5 classifiers for the NEC task

14/15 NEC - Experimental Results (2/2) Only 4 instances

15/15 Conclusions  NER systems must be easy to port and robust, given the great variety of documents and languages for which it is desirable to have these tools available.  Our method does not require any language dependent features. The only information used in this approach is automatically extracted from the documents, without human intervention.  Our method has shown to be robust and easy to port to other languages. The only requirement for using our method is a tokenizer for languages.