Download presentation
Presentation is loading. Please wait.
Published byDebra Adams Modified over 9 years ago
1
Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un 2008.07.31
2
Outline Target Strategy Major features will be taken advantage within Wikipedia English language categorization Multilingual categorization Full system Results Summary
3
Target To utilize the multilingual characteristics of Wikipedia to annotate a large corpus of text with NER(Named Entity Recognition) tags with minimal human intervention and no linguistic expertise.
4
Strategy Use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity; And use English language data to bootstrap the NER process in other languages.
5
Five major features will be taken advantage within Wikipedia(1) Article links, links from one article to another of the same language; Category links, links from an article to special “Category” pages; Interwiki links, links from an article to a presumably equivalent, article in another language; Redirect pages, short pages which often provide equivalent names for an entity; Disambiguation pages, a page with title content that links to multiple similarly named articles @@ The first three types are collectively referred to as wikilinks. Article links, links from one article to another of the same language;
6
Five major features will be taken advantage within Wikipedia(2) A Typical Sentence in database format Article links “Nescopeck Creek is a [[tributary]] of the [[North Branch Susquehanna River]] in [[Luzerne County, Pennsylvania|Luzerne County]].” Category links Will be found near the end of the same article,such as [[Category: Luzerne County, Pennsylvania ]], [[Category: River of Pennsylvania ]] Interwiki links For example, in the Turkish language article ”Kanuni Sultan Suleyman”, one can find a set of links including [[en:Suleiman the Magnificent]] and [[ru:Cyлеймаи Ⅰ ]]
7
English Language Categorization(1) Some Useful Category Phrases (manually derived)
8
English Language Categorization(2) Procedure For each article, search the category hierarchy until a threshold of reliability is passed or a preset limit on search distance is reached. If an article is not classified by this method, check whether it is a disambiguation page(Category:Disambiguation). If it is, the links within are checked to see whether there is a dominant type. Finally, use wiktionary to eliminate some common nouns.
9
English Language Categorization(3) Example To classify “Jacqueline Bhabha” Extract from categories, “British lawyers”, “Jewish American Writers”, and “Indian Jews”. Extract the second order categories, ”Lawyers by nationality”, “British legal professionals”, “American writers by ethnicity”, ”Indian people by origin”, “Indian people by ethnic or national origin” and so on. OK, PERSON
10
Multilingual Categorization(1) To make a decision based on English language information. First, whenever possible, find the title of an associated English language article by searching for wikilink beginning with “en:”. If such a title is found, categorize the English article, and decide that the non-English title is the same type. If not, attempt to make a decision based on Category information, associating the categories with their English equivalents, when possible.
11
Multilingual Categorization(2) Example The Breton town of Erquy has substantial article in French language Wikipedia, but no article in English. extract categories: “Catégorie:Commune des Côtes- d'Armor,” “Catégorie:Ville portuaire de France,” “Catégorie:Port de plaisance,” and “Catégorie:Station balnéaire.” Associate these categories respectively with “Category: Communes of Côtes-d'Armor,” UNKNOWN, “Category: Marinas,” and “Category: Seaside resorts” by looking in the French language pages of each for wikilinks of the form [[en:...]]. The first is a subcategory of “Category: Cities, towns and villages in France”, so GPE
12
Full system The main processing of each article takes place in several stages: The first pass uses the explicit article links within the text; Then search an associated English language article, if available, for additional information; A second pass checks for multi-word phrases that exist as titles of Wikipedia articles; Look for certain types of person and organization instances; Perform additional processing for alphabetic or space- separated languages, including a third pass looking for single Wikipedia titles, to identify more names of people; Use RE to locate additional entities such as numeric dates.
13
Results Spanish 25,000 words of human annotated newswire derived from the ACE 2007 test set vs. 335,000 words of data generated by the Wiki process held-out during training (from 290,000 articles of Oct. 2007) French 25,000 words of human annotated newswire (Agence France Presse, 30 April and 1 May 1997) covering diverse topics vs. 920,000 words of Wiki-derived data (from 570,000 articles of Oct. 2007)
14
Summary More suitable for bilingual or multilingual dictionary More suitable for known entities
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.