Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Large-Scale Entity-Based Online Social Network Profile Linkage.
Application of Subdivisions June 22, 2003 ALA Annual Conference, Toronto.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
1 Search Engines What is the Internet? The Web is only part of the Internet The Internet is a computer network connecting millions of computers.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
New Web of Science Rachel Mangan Customer Education
Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.
Multilingual Synchronization focusing on Wikipedia
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
Related terms search based on WordNet / Wiktionary and its application in ontology matching RCDL'2009 St. Petersburg Institute for Informatics and Automation.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
A Language Independent Method for Question Classification COLING 2004.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Evgeniy Gabrilovich and Shaul Markovitch
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Multilingual Opinion Holder Identification Using Author and Authority Viewpoints Yohei Seki, Noriko Kando,Masaki Aono Toyohashi University of Technology.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
FlashNormalize: Programming by Examples for Text Normalization International Joint Conference on Artificial Intelligence, Buenos Aires 7/29/2015FlashNormalize1.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Machine translation the Wiki way Bittlingmayer Adam Mathias 27 February 2007 University of Washington LING 575 – Machine Translation Машинен превод Strojový.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Zale Library at Paul Quinn College Information Literacy Module 1: Selecting Good Information Dr. David Hamrick Reference/Cataloging Librarian.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Automatically Extending NE coverage of Arabic WordNet using Wikipedia

Social Knowledge Mining
Clustering Algorithms for Noun Phrase Coreference Resolution
Extracting Semantic Concept Relations
Using Uneven Margins SVM and Perceptron for IE
Presentation transcript:

Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen Department of Defense ACL 2008

2 Introduction Using the multilingual Wikipedia to automatically create an annotated corpus of text in any given language. Languages : French, Ukrainian, Spanish, Polish, Russian, and Portuguese. Do not use of any non-English linguistic resources outside of the Wikimedia domain and any semantic resources such as WordNet or POS tagger. Use an internally modified variant of BBN's IdentiFinder (Bikel et al., 1999), specifically modified to emphasize fast text processing, called “PhoenixIDF.” 2

3 Related Work Toral and Muñoz (2006) used Wikipedia to create lists of named entities. Rely on WordNet, and need a manual supervision step Kazama and Torisawa (2007) used Wikipedia to building entity dictionaries. Rely on POS tagger Cucerzan (2007) used Wikipedia primarily for Named Entity Disambiguation, following the path of Bunescu and Pasca (2006) Using Category, but specific to English

4 Wikipedia Multilingual, collaborative encyclopedia on the Web which is freely available As of October 2007, there were over 2 million articles in English, and 30 languages with at least 50,000 articles and another 40 with at least 10,000 articles. 4

5 Wikipedia - feature Article links, links from one article to another of the same language. Category links, links from an article to special “Category” pages. Interwiki links, links from an article to a presumably equivalent, article in another language. Redirect pages, short pages which often provide equivalent names for an entity Disambiguation pages, a page with little content that links to multiple similarly named articles. Example: 5

6 Training Data Generation 1. Initial Set-up 2. English Language Categorization 3. Multilingual Categorization 4. The Full System 6

7 Initial Set-up ACE Named Entity types: PERSON, GPE (Geo-Political Entities), ORGANIZATION, VEHICLE, WEAPON, LOCATION, FACILITY, DATE, TIME, MONEY, and PERCENT. MUC tags like Place Name Process: 1. Identifies words and phrases that might represent entities. 2. Uses category links and/or interwiki links to associate that phrase with an English language phrase or set of Categories. 3. Determines the appropriate type of the English language data and assumes that the original phrase is of the same type.

8 English Language Categorization(1) Wiki Useful Category => Key Category Phrase => Disambiguation Pages? => Wiktionary Useful Category: “Category:Living People” :PERSON “Category:Cities in Norway”:GPE Useless Category: “Category:1912 Establishments” which includes articles on Fenway Park (a facility), the Republic of China (a GPE), and the Better Business Bureau (an organization).

9 English Language Categorization(2)

10 Multilingual Categorization Not all articles have English equivalent, but many of the most useful categories have English equivalents. French: “Catégorie:Commune des Côtes- d'Armor,” “Catégorie:Ville portuaire de France,” “Catégorie:Port de plaisance,” and “Catégorie:Station balnéaire.” English: “Category: Communes of Côtes- d'Armor,” UNKNOWN, “Category:Marinas,” and “Category:Seaside resorts”

11 The Full System The first pass uses the explicit article links within the text. We then search an associated English language article, if available, for additional information. A second pass checks for multi-word phrases that exist as titles of Wikipedia articles. We look for certain types of person and organization instances. We perform additional processing for alphabetic or space-separated languages, including a third pass looking for single word Wikipedia titles. We use regular expressions to locate additional entities such as numeric dates.

12 Evaluation – All Wiki test set Three human annotated newswire test sets: Spanish, French and Ukrainian. 12 F-score Spanis h Frenc h Ukrainia n PolishPortugues e Russian ALL DATE GPE ORG PERSO N

13 Evaluation – Spanish (1) Spanish is a substantial, well-developed Wikipedia, consisting of more than 290,000 articles at October Newswire: 25,000 words from the ACE 2007 test set, manually modified extended MUC-style standards. Wiki test set: 335,000 words.

14 Evaluation – Spanish (2) Either Wikipedia is relatively poor in Organizations or that PhoenixIDF underperforms when identifying Organizations relative to other categories or a combination. Traditional Training: trained PhoenixIDF on ACE 2007 data converted to MUC-style tag.

15 French is one of the largest Wikipedias, containing more than 570,000 articles at October Newswire: 25,000 words from Agence France Presse Wiki test set: 920,000 words. Similar to Spanish. Evaluation – French 15

16 Evaluation – Ukrainian (1) 16 Ukrainian is a medium-sized Wikipedia with 74,000 articles at October The typical article is shorter and less well-linked to other articles than in the French or Spanish versions. Newswire: approximately 25,000 words from various online news sites covering primarily political topics. Wiki test set: 395,000 words. Traditional Training: trained PhoenixIDF Newswire data

17 Evaluation – Ukrainian (2) 17 The Ukrainian newswire contained a much higher proportion of organizations than the French or Spanish versions. The Ukrainian language Wikipedia contains very few articles on organizations relative to other types

18 Conclusion Wikipedia can create a NER system with performance comparable to one developed human-annotated Newswire, while not requiring any linguistic expertise. This level of performance can likely be obtained currently in languages. Wikipedia-derived system could be used as a supplement to other systems for many more languages. An automatically generated entity dictionary embedded in our system. 18

19 Future Work Automatically generate the list of key words and phrases for useful English language categories. The authors also believe performance could be improved by using higher order non-English categories and better disambiguation. Lists of organizations might be particularly useful, and “List of” pages are common in many languages. 19

20 Thank you! 20