Download presentation
Presentation is loading. Please wait.
Published bySusan Marsh Modified over 9 years ago
1
Multilingual Word Sense Disambiguation using Wikipedia Bharath Dandala (University of North Texas) Rada Mihalcea (University of North Texas) Razvan Bunescu (Ohio University) IJCNLP, Oct 16, 2013
2
Word Sense Disambiguation Select the correct sense of a word based on the context: –The word bar has multiple senses: 2 bar (counter) bar (law) bar (landform)bar (establishment) bar (music) Sumner was admitted to the bar at the age of twenty-three, and entered private practice in Boston.
3
Word Sense Disambiguation Use a repository of senses such as WordNet: –Static resource, short glosses, too fine grained. Unsupervised: –Similarity between context and sense definition or gloss. Supervised: –Train on text manually tagged with word senses. Limited amount of manually labeled data. 3 Use Wikipedia for WSD: –Large sense repository, continuously growing. –Large training dataset. –Support for multilingual WSD.
4
Three WSD Systems WikiMonoSense: –Address the sense-tagged data bottleneck problem by using Wikipedia hyperlinks as a source of sense annotations. –Use the sense annotated corpora to train monolingual WSD classifiers. WikiTransSense: –The sense tagged corpus extracted for the reference language is machine translated into a number of supporting languages. –The word alignments between the reference sentences and the supporting translations are used to generate complementary features in our first approach to multilingual WSD. 4
5
Three WSD Systems WikiMuSense: –The reliance on machine translation (MT) is significantly reduced during training for this second approach to multilingual WSD. –Sense tagged corpora in the supporting languages are created through the interlingual links available in Wikipedia. 5
6
Wikipedia for WSD 6 ’’’Palermo’’’ is a city in [[Southern Italy]], the [[capital city | capital]] of the [[autonomous area | autonomous region]] of [[Sicily]]. wiki Palermo is a city in Southern Italy, the capital of the autonomous region of Sicily. html capital city capital (economics) financial capital human capital capital (architecture)
7
A Monolingual Dataset through Wikipedia Links 1)Collect all WP titles that are linked from the anchor word bar. => Bar (law), Bar (music), Bar (establishment), … 2)Create a sense repository from all titles that have sufficient support in WP (ignore named entities, resolve redirects): => { Bar (law), Bar (music), Bar (establishment), Bar (counter), Bar (landform) } Use a subset of ambiguous words from Senseval 2 & 3: Avoid words with only one Wikpedia label. => English (30), Spanish (25), Italian (25), German (25). 7
8
The WikiMonoSense Learning Framework For each word, use WP links as examples and train a classifier to distinguish between alternative senses: –each WP sense acts as a different label in the classification model. Each word context is represented as a vector of features: –Current word and its part-of-speech, –Local context of three words to the left and to the right. –Parts-of-speech of the surrounding words. –Verb and noun before and after the ambiguous words. –A global context implemented through sense-specific keywords determined as a list of all words occurring at least three times in the contexts defining a certain word sense. 8
9
A Multilingual Dataset through Machine Translation Treat each of the 4 languages as a reference language: –Use Google Translate to translate the data from the reference language into the other 3 supporting languages. Translate into French as an additional supporting language. => each reference sentence is translated into 4 supporting languages. 9 EnAn airline seat is a chair on an airliner in which passengers are accommodated for the duration of the journey. DeEin Flugzeugsitz ist ein Stuhl auf einem Flugzeug, in dem Passagiere frdie Dauer der Reise untergebracht sind. EnFor a year after graduation, Stanley served as chair of belles-lettres at Christian College in Hustonville. DeSeit einem Jahr nach dem Abschluss, diente Stanley als Vorsitzender Belletristik bei Christian College in Hustonville.
10
Benefits of Machine Translation 1)Knowledge of the target word translation can help in disambiguation: –Two different senses of the target ambiguous word may be translated into a different word in the supporting language. –Assuming access to word alignments. 2)Features extracted from the translated sentence can be used to enrich the feature space: –For example, the two senses “(unit)" and (establishment)" of the English word “bar" translate to the same German word “bar". –In cases like this, words in the context of the German translation may help in identifying the correct English meaning. 10
11
The WikiTransSense Learning Framework Extract the same type of features Φ as in WikiMonoSense. Append features from supporting languages to vector of features from the reference language: –Φ’ EN = [Φ EN | Φ SP ; Φ IT ; Φ DE ; Φ FR ]. Train a multilingual WSD classifier using the augmented feature vectors. 11
12
A Multilingual Dataset through Wikipedia Interlingua Links Wikipedia articles on the same topic in different languages are often connected through interlingual links. Use interlingua links to project sense repository in reference language to sense repository in supporting language. –Given reference sense repository for word “bar" in English is: EN = {bar (establishment), bar (landform), bar (law), bar (music)} –Projected supporting sense repository in German will be: DE = {Bar (Lokal), Sandbank, NIL, Takt (Musik)} Use projected repositories in supporting languages to train additional WSD classifiers for reference language senses. 12
13
Two Problematic Issues for Interlingua Links 1)There may be reference language senses that do not have interlingua links to the supporting language: –randomly sample a number of examples for that sense in the reference language. –use GT to create examples in the supporting language. 2)The distribution of examples per sense in the corpus for the supporting language may be different from the corresponding distribution for the reference language: –use the distribution of reference language as the true distribution and calculate the number of examples to be considered per sense from the supporting languages using [Agirre & Martinez, 2004]. 13
14
The WikiMuSense Learning Framework Given an ambigous word in the reference language, at training time: –Train a probabilistic classifier P R for the reference language: use the same WP sense repository developed for WikiMonoSense and WikiTtransSense. –Train a probabilistic classifier P S for each supporting language: use the reference sense repository projected in the supporting language. –Use same types of features as in WikiMonoSense, for each classifier. Five probabilistic classifiers: –One from the reference language (P R ). –Four from the supporting languages (P S ). 14
15
The WikiMuSense Learning Framework Given an ambigous word in the reference language, at test time: –Use GT to translate reference sentence in all supporting languages. –Run probabilistic classifier P R on reference sentence and classifiers P S on supporting sentences. –Combine the 5 probabilistic outputs into one disambiguation score: D R = the set of training examples in reference language R. D S = the set of training examples in supporting language S. –WSD = select the sense that maximizes score P. 15
16
WikiMuSense vs. WikiTransSense WikiMuSense significantly reduces the # of sentence translations required to create the multilingual dataset. Features extracted from each supporting language are more diverse, as sentences are natural, as opposed to translated: –although may lead to potential mismatch between training and testing distributions. 16
17
Experimental Evaluation Used a subset of ambiguous words from Senseval 2 & 3: –Avoid words with only one Wikpedia label. => English (30), Spanish (25), Italian (25), German (25). 17
18
Experimental Evaluation: Macro & Micro 1.cdcd 18
19
Experimental Evaluation: Macro Results WikiMonoSense better than MFS on 76 out of 105 words: –Average relative error reduction of 44%, 38%, 44%, and 28%. WikiTransSense better than MFS on 83 out of 105 words: –Average relative error reduction over WikiMonoSense of 13.7%. utility of using features from translated contexts. WikiMuSense better than MFS on 89 out of 105 words: –Average relative error reduction over WikiMonoSense of 16.5%. multilingual WP data can successfully replace MT component during training. 19
20
Varying the Number of Supporting Languages 20
21
Varying the Amount of Supporting Language Data 21 Dip likely due to suboptimal combination of classifiers in: [Future Work]: train weights for each supporting language.
22
Varying the Amount of Supporting Language Data 22 Peak likely due to suboptimal combination of classifiers in: [Future Work]: train weights for each supporting language. # of supporting examples = # of reference examples.
23
Future Work 1.Train weights in for each supporting language, when combining classifier outputs in WikiMuSense. 2.Reduce the number of translations in WikiMuSense by choosing from the 280 languages in WP those supporting languages with largest number of examples per sense. 3.Exploit directly the distributions used inside a MT system: eliminate MT altogether from WikiMuSense. 23
24
Conclusion WikiMonoSense: –Use Wikipedia hyperlinks to train monolingual WSD classifiers. WikiTransSense: –The sense tagged corpus extracted for the reference language is machine translated into a number of supporting languages. –Use aligned sentences to generate additional features in a first approach to multilingual WSD. WikiMuSense: –Use Wikipedia the interlingual links to reduce reliance on MT. –Train and combine multiple probabilistic classifiers, in a second approach to multilingual WSD. 24
25
Questions ? 25
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.