Presentation is loading. Please wait.

Presentation is loading. Please wait.

This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR0011-15-2-0025.

Similar presentations


Presentation on theme: "This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR0011-15-2-0025."— Presentation transcript:

1 This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR0011-15-2-0025 with the US Defense Advanced Research Projects Agency (DARPA). Approved for Public Release, Distribution Unlimited. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.) Cross-lingual Wikification is the task of grounding mentions written in non-English documents to entries in the English Wikipedia. This task involves the problem of comparing textual clues across languages, which requires developing a notion of similarity between text snippets across languages. In this paper, we address this problem by jointly training multilingual embeddings for words and Wikipedia titles. The proposed method can be applied to all languages represented in Wikipedia, including those for which no machine translation technology is available. We create a challenging dataset in 12 languages and show that our proposed approach outperforms various baselines. Moreover, our model compares favorably with the best systems on the TAC KBP2015 Entity Linking task including those that relied on the availability of translation from the target language to English. 1.Learning monolingual word and title embeddings [Wang et al., 2014] It is led by and mainly composed of Sunni Arabs from Iraq… It is led by and mainly composed of en.wikipedia.org/wiki/Sunni_Islam Arabs from en.wikipedia.org/wiki/Iraq… Skip-gram with negative sampling [Mikolov et al., 2013] Since a title appears as a token in the transformed text, we will obtain an embedding for each word and title from the model. 2.Aligning embeddings of two languages by the model based on canonical correlation analysis (CCA) [Hotelling, 1936; Faruqui and Dyer, 2014] Instead of using a dictionary which maps the words between two languages, we use the title mapping obtained from inter-language links in Wikipedia P en, P tr = CCA(, ) M en = E en P en, M tr = E tr P tr E: Monolingual embeddings, M: Multilingual embeddings Unlike other multilingual embedding models, this method can be done on all languages in Wikipedia For a (foreign mention, English title) pair, we use the multilingual embeddings to compute various features We represent a foreign mention using embeddings of: Other mentions in the same document Tayvan, ABD ve İngiltere'de hukuk okuması, Tsai'ye bir LL.B. kazandırdı Context words Tayvan, ABD ve İngiltere'de hukuk okuması, Tsai'ye bir LL.B. kazandırdı Disambiguated English titles before the mention A candidate title is represented by its English title embedding Ranking Features Pr(c|m), Pr(m|c) Cosine similarity of the candidate title embeddings and the above mention representations Linear Ranking SVM One third of the test mentions are hard which can not be solved by the most common title given the mention Given mentions in a non-English document, find the corresponding titles in the English Wikipedia Tayvan, ABD ve İngiltere'de hukuk okuması, Tsai'ye bir LL.B. kazandırdı … M ain challenge : comparing non-English words to English Wikipedia titles United_States Texas Turkey Istanbul … Amerika_Birleşik_Devletleri Teksas Türkiye İstanbul … We focus on the English titles in the intersection of the English and the foreign language Wikipedia title space Two dictionaries Hyperlinked foreign string  all possible English titles. Foreign word  all possible English titles Query the first dictionary by the full mention string. If fails, query the second dictionary by each word in the mention LanguageMethodHardEasyTotal Spanish EsWikifier40.1199.2879.56 MonoEmb38.4696.1276.90 WordAlign48.7595.7880.10 WikiME54.4694.8381.37 German WordAlign52.3995.3281.01 WikiME53.2895.5381.45 French WordAlign41.7096.0877.96 WikiME47.5195.7279.65 ItalianWikiME48.2895.5279.79 ChineseWikiME57.6198.0384.55 HebrewWikiME56.6797.7184.03 ThaiWikiME70.0299.1789.46 ArabicWikiME62.0598.1786.13 TurkishWikiME60.1897.5585.10 TamilWikiME54.1399.1384.15 TagalogWikiME56.7098.4684.54 UrduWikiME74.5199.3591.07 ApproachSpanishChinese Translation + EnWikifier 79.35N/A EsWikifier 79.04N/A WikiME 82.43 85.07 +Typing Top TAC’15 System 80.4 83.1 WikiME 80.93 83.63


Download ppt "This research is supported by NIH grant U54-GM114838, a grant from the Allen Institute for Artificial Intelligence (allenai.org), and Contract HR0011-15-2-0025."

Similar presentations


Ads by Google