Download presentation
Presentation is loading. Please wait.
Published byTimothy Dorsey Modified over 7 years ago
1
Cross-Lingual Named Entity Recognition via Wikification
Chen-Tse Tsai, Stephen Mayhew, and Dan Roth University of Illinois at Urbana-Champaign
2
Multilingual NER NER for many languages
Person, Organization, Location, Miscellaneous Challenge: no training data for most languages Directly transferring a model trained on another language Language-independent features Turkish Tayvan, ABD ve İngiltere'de hukuk okuması, Tsai'ye bir LL.B. kazandırdı Tayvan, ABD ve İngiltere'de hukuk okuması, Tsai'ye bir LL.B. kazandırdı English training data U.N. official Ekeus heads for Baghdad NER Model Bengali এটি মূলত তুরস্কে কথিত হয়, তবে সাইপ্রাস, গ্রিস, ও পূর্ব ইউরোপের বহু দেশে তুর্কীভাষী সম্প্রদায় আছে এটি মূলত তুরস্কে কথিত হয়, তবে সাইপ্রাস, গ্রিস, ও পূর্ব ইউরোপের বহু দেশে তুর্কীভাষী সম্প্রদায় আছে German training data Cross-Lingual Wikifier Verstehen Albrecht Lehmann läßt Flüchtlinge und Yoruba Indonésíà jẹ́ orílẹ̀-èdè olómìnira, pẹ̀lú aṣòfin àti ààrẹ adìbòyàn. Olúìlú rẹ̀ ni Jakarta. Indonésíà jẹ́ orílẹ̀-èdè olómìnira, pẹ̀lú aṣòfin àti ààrẹ adìbòyàn. Olúìlú rẹ̀ ni Jakarta.
3
Cross-Lingual Wikification [Tsai and Roth, NAACL’16]
Given mentions in a non-English document, find the corresponding titles in the English Wikipedia Only requires multilingual Wikipedia dump Grounding to titles in (target language Wikipedia ∩ English Wikipedia) Smaller Wikipedia, poorer coverage Tayvan, ABD ve İngiltere'de hukuk okuması, Tsai'ye bir LL.B. kazandırdı
4
Outline The Key Idea Cross-Lingual NER Model Wikifier-based features
Evaluation and Analysis
5
Albert,_Duke_of_Prussia
Key Idea Cross-lingual wikifier generates good language-independent features for NER by grounding n-grams Words in any language are grounded to the English Wikipedia Features extracted based on the titles can be used across languages Instead of the traditional pipeline: NER -> Wikification Wikified n-grams provide features for the NER model Person Location … nachvollziehenden Verstehen Albrecht Lehmann läßt Flüchtlinge und Vertriebene in Westdeutschland Understanding Albert,_Duke_of_Prussia Jens_Lehmann Refugee Western_Germany media_common quotation_subject person noble_person person athlete field_of_study literature_subject location country
6
Cross-Lingual NER Model
We use Illinois NER [Ratinov and Roth, 2009] as the base model Features Wikifier features facilitate direct transfer! Feature Group Description Base features Word forms and affixes Word type: contain capital? digit? Previous tag pattern Tag context Previous tags Gazetteers Wikipedia titles in multiple languages Wikifier Freebase types and Wikipedia categories
7
Wikifier Features We ground every n-grams (n<5) to the English Wikipedia The cross-lingual wikifier is modified in two ways In the candidate generation step, we only query title candidates by the whole mention string Otherwise the bigram “in Germany” will be linked to the title Germany. In the ranking step, we exclude the ranking features which use “other mentions”, since we don’t know what are other mentions NER Features for each word The FreeBase types and Wikipedia categories of the top two titles of the current, previous, and next word The FreeBase types of the n-grams covering the word
8
Data Sets The model can be used on 293 languages in Wikipedia
English, Spanish, German, Dutch CoNLL 2002/2003 shared task Turkish, Tagalog, Bengali, Tamil, Yoruba LORELEI and REFLEX packages English Spanish German Dutch # Training 23.5K 18.8K 11.9K 13.3K # Test 5.6K 3.6K 3.7K 3.9K Turkish Tagalag Bengali Tamil Yoruba # Training 5.1K 4.6K 8.8K 7.0K 4.1K # Test 2.2K 3.4K 3.5K
9
Monolingual Experiments (train and test on the same language)
Results English Dutch German Spanish Turkish Tagalog Yoruba Bengali Tamil Avg Wikipedia Size 5.1M 1.9M 1.3M 269K 64K 31K 42K 85K Monolingual Experiments (train and test on the same language) Direct Transfer Experiments (train on English) Non-latin script Täckström’12 58.4 40.4 59.3 - Zhang’16 43.6 51.3 36.0 34.8 26.0
10
Multiple Training Languages
Previous transfer model is trained on English EN: English, ES: Spanish, NL: Dutch, DE: German ALL: EN, ES, NL, DE, Turkish, Tagalog, Yoruba Adding training data from other languages helps Training Languages Turkish Tagalog Yoruba Average EN 47.12 65.44 36.65 49.74 EN+ES 44.85 66.61 37.57 49.68 EN+NL 48.34 66.09 36.87 50.43 EN+DE 49.47 64.10 35.14 49.57 EN+ES+NL+DE 49.00 66.37 38.02 51.13 ALL (w/o test lang) 49.83 67.12 37.56 51.50 66.61 37.57 50.43 49.47
11
Domain Adaptation Transfer is good, but not as good as monolingual results Can we improve the monolingual results using transfer? Target: training on the target language Target + Source: adding English training data This is consistent with the analysis in Chang et al. (EMNLP’10) Target+Source can do slightly better Approach Spanish Dutch Turkish Tagalog Average Target 83.87 84.49 73.86 77.64 79.96 Target + Source 84.17 84.81 74.52 77.80 80.33 FrustEasy [Daumé, 07] 83.89 84.08 73.73 77.04 79.69
12
Conclusion We propose a language-independent NER model based on a cross-lingual wikifier. We study a wide range of languages in both monolingual and cross-lingual settings, and show significant improvements over strong baselines. We are making an end-to-end cross-lingual wikification demo for most languages in Wikipedia, and will make the system available. Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.