Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.

Similar presentations


Presentation on theme: "Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen."— Presentation transcript:

1 Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen Department of Defense ACL 2008

2 2 Introduction Using the multilingual Wikipedia to automatically create an annotated corpus of text in any given language. Languages : French, Ukrainian, Spanish, Polish, Russian, and Portuguese. Do not use of any non-English linguistic resources outside of the Wikimedia domain and any semantic resources such as WordNet or POS tagger. Use an internally modified variant of BBN's IdentiFinder (Bikel et al., 1999), specifically modified to emphasize fast text processing, called “PhoenixIDF.” 2

3 3 Related Work Toral and Muñoz (2006) used Wikipedia to create lists of named entities. Rely on WordNet, and need a manual supervision step Kazama and Torisawa (2007) used Wikipedia to building entity dictionaries. Rely on POS tagger Cucerzan (2007) used Wikipedia primarily for Named Entity Disambiguation, following the path of Bunescu and Pasca (2006) Using Category, but specific to English

4 4 Wikipedia Multilingual, collaborative encyclopedia on the Web which is freely available As of October 2007, there were over 2 million articles in English, and 30 languages with at least 50,000 articles and another 40 with at least 10,000 articles. 4

5 5 Wikipedia - feature Article links, links from one article to another of the same language. Category links, links from an article to special “Category” pages. Interwiki links, links from an article to a presumably equivalent, article in another language. Redirect pages, short pages which often provide equivalent names for an entity Disambiguation pages, a page with little content that links to multiple similarly named articles. Example: http://en.wikipedia.org/wiki/FBIhttp://en.wikipedia.org/wiki/FBI 5

6 6 Training Data Generation 1. Initial Set-up 2. English Language Categorization 3. Multilingual Categorization 4. The Full System 6

7 7 Initial Set-up ACE Named Entity types: PERSON, GPE (Geo-Political Entities), ORGANIZATION, VEHICLE, WEAPON, LOCATION, FACILITY, DATE, TIME, MONEY, and PERCENT. MUC tags like Place Name Process: 1. Identifies words and phrases that might represent entities. 2. Uses category links and/or interwiki links to associate that phrase with an English language phrase or set of Categories. 3. Determines the appropriate type of the English language data and assumes that the original phrase is of the same type.

8 8 English Language Categorization(1) Wiki Useful Category => Key Category Phrase => Disambiguation Pages? => Wiktionary Useful Category: “Category:Living People” :PERSON “Category:Cities in Norway”:GPE Useless Category: “Category:1912 Establishments” which includes articles on Fenway Park (a facility), the Republic of China (a GPE), and the Better Business Bureau (an organization).

9 9 English Language Categorization(2)

10 10 Multilingual Categorization Not all articles have English equivalent, but many of the most useful categories have English equivalents. French: “Catégorie:Commune des Côtes- d'Armor,” “Catégorie:Ville portuaire de France,” “Catégorie:Port de plaisance,” and “Catégorie:Station balnéaire.” English: “Category: Communes of Côtes- d'Armor,” UNKNOWN, “Category:Marinas,” and “Category:Seaside resorts”

11 11 The Full System The first pass uses the explicit article links within the text. We then search an associated English language article, if available, for additional information. A second pass checks for multi-word phrases that exist as titles of Wikipedia articles. We look for certain types of person and organization instances. We perform additional processing for alphabetic or space-separated languages, including a third pass looking for single word Wikipedia titles. We use regular expressions to locate additional entities such as numeric dates.

12 12 Evaluation – All Wiki test set Three human annotated newswire test sets: Spanish, French and Ukrainian. 12 F-score Spanis h Frenc h Ukrainia n PolishPortugues e Russian ALL.846.844.807.859.804.802 DATE.925.910.848.891.861.822 GPE.877.868.887.916.826.867 ORG.701.718.657.785.706.712 PERSO N.821.823.690.836.802.751

13 13 Evaluation – Spanish (1) Spanish is a substantial, well-developed Wikipedia, consisting of more than 290,000 articles at October 2007. Newswire: 25,000 words from the ACE 2007 test set, manually modified extended MUC-style standards. Wiki test set: 335,000 words.

14 14 Evaluation – Spanish (2) Either Wikipedia is relatively poor in Organizations or that PhoenixIDF underperforms when identifying Organizations relative to other categories or a combination. Traditional Training: trained PhoenixIDF on ACE 2007 data converted to MUC-style tag.

15 15 French is one of the largest Wikipedias, containing more than 570,000 articles at October 2007. Newswire: 25,000 words from Agence France Presse Wiki test set: 920,000 words. Similar to Spanish. Evaluation – French 15

16 16 Evaluation – Ukrainian (1) 16 Ukrainian is a medium-sized Wikipedia with 74,000 articles at October 2007. The typical article is shorter and less well-linked to other articles than in the French or Spanish versions. Newswire: approximately 25,000 words from various online news sites covering primarily political topics. Wiki test set: 395,000 words. Traditional Training: trained PhoenixIDF Newswire data

17 17 Evaluation – Ukrainian (2) 17 The Ukrainian newswire contained a much higher proportion of organizations than the French or Spanish versions. The Ukrainian language Wikipedia contains very few articles on organizations relative to other types

18 18 Conclusion Wikipedia can create a NER system with performance comparable to one developed human-annotated Newswire, while not requiring any linguistic expertise. This level of performance can likely be obtained currently in 20-40 languages. Wikipedia-derived system could be used as a supplement to other systems for many more languages. An automatically generated entity dictionary embedded in our system. 18

19 19 Future Work Automatically generate the list of key words and phrases for useful English language categories. The authors also believe performance could be improved by using higher order non-English categories and better disambiguation. Lists of organizations might be particularly useful, and “List of” pages are common in many languages. 19

20 20 Thank you! 20


Download ppt "Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen."

Similar presentations


Ads by Google