Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un 2008.07.31.

Slides:



Advertisements
Similar presentations
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Advertisements

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Application of Subdivisions June 22, 2003 ALA Annual Conference, Toronto.
Chapter 12 – Strategies for Effective Written Reports
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Rubryx Document Classification Technology Authors: V.N. Polyakov, V.V. Sinitsin.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment.
Double Page Spread Analysis. Article Title Slug Leading Text White Space Body Text Anchor Side Bar Caption.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
The Process of Legal Research - 2 West’s Instructional Aids Series How to Use Print and Online Legal Resources To Your Best Advantage.
Multilingual Word Sense Disambiguation using Wikipedia Bharath Dandala (University of North Texas) Rada Mihalcea (University of North Texas) Razvan Bunescu.
Multilingual Synchronization focusing on Wikipedia
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
The Web as a Parallel Corpus A paper by Philip Resnik and Noah A. Smith (2003, Computational Linguistics) My interpretation of their research.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
1. 2 Content The Romanische Bibliographie Online is the only comprehensive specialist bibliography for Romance language and literature studies –available.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Related terms search based on WordNet / Wiktionary and its application in ontology matching RCDL'2009 St. Petersburg Institute for Informatics and Automation.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
A Language Independent Method for Question Classification COLING 2004.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Multilingual Synchronization focusing on Wikipedia
CSC 594 Topics in AI – Text Mining and Analytics
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
A German Corpus for Similarity Detection
Court of Justice of the European Communities
Research4Life Programmes: Similarities and Differences!
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Writing Analytics Clayton Clemens Vive Kumar.
Extracting Semantic Concept Relations
Lexis-Nexis Academic Universe
Presentation transcript:

Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un

Outline Target Strategy Major features will be taken advantage within Wikipedia English language categorization Multilingual categorization Full system Results Summary

Target To utilize the multilingual characteristics of Wikipedia to annotate a large corpus of text with NER(Named Entity Recognition) tags with minimal human intervention and no linguistic expertise.

Strategy Use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity; And use English language data to bootstrap the NER process in other languages.

Five major features will be taken advantage within Wikipedia(1) Article links, links from one article to another of the same language; Category links, links from an article to special “Category” pages; Interwiki links, links from an article to a presumably equivalent, article in another language; Redirect pages, short pages which often provide equivalent names for an entity; Disambiguation pages, a page with title content that links to multiple similarly named articles  The first three types are collectively referred to as wikilinks.  Article links, links from one article to another of the same language;

Five major features will be taken advantage within Wikipedia(2) A Typical Sentence in database format Article links  “Nescopeck Creek is a [[tributary]] of the [[North Branch Susquehanna River]] in [[Luzerne County, Pennsylvania|Luzerne County]].” Category links  Will be found near the end of the same article,such as [[Category: Luzerne County, Pennsylvania ]], [[Category: River of Pennsylvania ]] Interwiki links  For example, in the Turkish language article ”Kanuni Sultan Suleyman”, one can find a set of links including [[en:Suleiman the Magnificent]] and [[ru:Cyлеймаи Ⅰ ]]

English Language Categorization(1) Some Useful Category Phrases (manually derived)

English Language Categorization(2) Procedure  For each article, search the category hierarchy until a threshold of reliability is passed or a preset limit on search distance is reached.  If an article is not classified by this method, check whether it is a disambiguation page(Category:Disambiguation). If it is, the links within are checked to see whether there is a dominant type.  Finally, use wiktionary to eliminate some common nouns.

English Language Categorization(3) Example To classify “Jacqueline Bhabha”  Extract from categories, “British lawyers”, “Jewish American Writers”, and “Indian Jews”.  Extract the second order categories, ”Lawyers by nationality”, “British legal professionals”, “American writers by ethnicity”, ”Indian people by origin”, “Indian people by ethnic or national origin” and so on.  OK, PERSON

Multilingual Categorization(1) To make a decision based on English language information.  First, whenever possible, find the title of an associated English language article by searching for wikilink beginning with “en:”.  If such a title is found, categorize the English article, and decide that the non-English title is the same type.  If not, attempt to make a decision based on Category information, associating the categories with their English equivalents, when possible.

Multilingual Categorization(2) Example The Breton town of Erquy has substantial article in French language Wikipedia, but no article in English.  extract categories: “Catégorie:Commune des Côtes- d'Armor,” “Catégorie:Ville portuaire de France,” “Catégorie:Port de plaisance,” and “Catégorie:Station balnéaire.”  Associate these categories respectively with “Category: Communes of Côtes-d'Armor,” UNKNOWN, “Category: Marinas,” and “Category: Seaside resorts” by looking in the French language pages of each for wikilinks of the form [[en:...]]. The first is a subcategory of “Category: Cities, towns and villages in France”, so GPE

Full system The main processing of each article takes place in several stages:  The first pass uses the explicit article links within the text;  Then search an associated English language article, if available, for additional information;  A second pass checks for multi-word phrases that exist as titles of Wikipedia articles;  Look for certain types of person and organization instances;  Perform additional processing for alphabetic or space- separated languages, including a third pass looking for single Wikipedia titles, to identify more names of people;  Use RE to locate additional entities such as numeric dates.

Results Spanish 25,000 words of human annotated newswire derived from the ACE 2007 test set vs. 335,000 words of data generated by the Wiki process held-out during training (from 290,000 articles of Oct. 2007) French 25,000 words of human annotated newswire (Agence France Presse, 30 April and 1 May 1997) covering diverse topics vs. 920,000 words of Wiki-derived data (from 570,000 articles of Oct. 2007)

Summary More suitable for bilingual or multilingual dictionary More suitable for known entities