Multilingual Word Sense Disambiguation using Wikipedia Bharath Dandala (University of North Texas) Rada Mihalcea (University of North Texas) Razvan Bunescu.

Slides:

Advertisements

Similar presentations

On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,

A Robust Approach to Aligning Heterogeneous Lexical Resources Mohammad Taher Pilehvar Roberto Navigli MultiJEDI ERC

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.

Word Sense Disambiguation for Machine Translation Han-Bin Chen

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.

Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.

Coarse to Fine Grained Sense Disambiguation in Wikipedia Hui Shen [Ohio University] Razvan Bunescu [Ohio University] Rada Mihalcea [University of North.

ELN – Natural Language Processing Giuseppe Attardi

Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen University of Houston-Downtown Wei Ding University of Massachusetts-Boston.

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.

Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.

Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.

Word Sense Disambiguation UIUC - 06/10/2004 Word Sense Disambiguation Another NLP working problem for learning with constraints… Lluís Màrquez TALP, LSI,

Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.

Open Information Extraction using Wikipedia

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.

MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.

A Language Independent Method for Question Classification COLING 2004.

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee

2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.

CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.

Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

Presented By- Shahina Ferdous, Student ID – , Spring 2010.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un

Subjectivity Recognition on Word Senses via Semi-supervised Mincuts Fangzhong Su and Katja Markert School of Computing, University of Leeds Human Language.

Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.

1 Gloss-based Semantic Similarity Metrics for Predominant Sense Acquisition Ryu Iida Nara Institute of Science and Technology Diana McCarthy and Rob Koeling.

1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Second Language Learning From News Websites Word Sense Disambiguation using Word Embeddings.

Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.

Coarse-grained Word Sense Disambiguation

SENSEVAL: Evaluating WSD Systems

A Brief Introduction to Distant Supervision

Special Topics in Text Mining

Unsupervised Word Sense Disambiguation Using Lesk algorithm

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

Multilingual Word Sense Disambiguation using Wikipedia Bharath Dandala (University of North Texas) Rada Mihalcea (University of North Texas) Razvan Bunescu (Ohio University) IJCNLP, Oct 16, 2013

Word Sense Disambiguation Select the correct sense of a word based on the context: –The word bar has multiple senses: 2 bar (counter) bar (law) bar (landform)bar (establishment) bar (music) Sumner was admitted to the bar at the age of twenty-three, and entered private practice in Boston.

Word Sense Disambiguation Use a repository of senses such as WordNet: –Static resource, short glosses, too fine grained. Unsupervised: –Similarity between context and sense definition or gloss. Supervised: –Train on text manually tagged with word senses. Limited amount of manually labeled data. 3 Use Wikipedia for WSD: –Large sense repository, continuously growing. –Large training dataset. –Support for multilingual WSD.

Three WSD Systems WikiMonoSense: –Address the sense-tagged data bottleneck problem by using Wikipedia hyperlinks as a source of sense annotations. –Use the sense annotated corpora to train monolingual WSD classifiers. WikiTransSense: –The sense tagged corpus extracted for the reference language is machine translated into a number of supporting languages. –The word alignments between the reference sentences and the supporting translations are used to generate complementary features in our first approach to multilingual WSD. 4

Three WSD Systems WikiMuSense: –The reliance on machine translation (MT) is significantly reduced during training for this second approach to multilingual WSD. –Sense tagged corpora in the supporting languages are created through the interlingual links available in Wikipedia. 5

Wikipedia for WSD 6 ’’’Palermo’’’ is a city in [[Southern Italy]], the [[capital city | capital]] of the [[autonomous area | autonomous region]] of [[Sicily]]. wiki Palermo is a city in Southern Italy, the capital of the autonomous region of Sicily. html capital city capital (economics) financial capital human capital capital (architecture)

A Monolingual Dataset through Wikipedia Links 1)Collect all WP titles that are linked from the anchor word bar. => Bar (law), Bar (music), Bar (establishment), … 2)Create a sense repository from all titles that have sufficient support in WP (ignore named entities, resolve redirects): => { Bar (law), Bar (music), Bar (establishment), Bar (counter), Bar (landform) }  Use a subset of ambiguous words from Senseval 2 & 3:  Avoid words with only one Wikpedia label. => English (30), Spanish (25), Italian (25), German (25). 7

The WikiMonoSense Learning Framework For each word, use WP links as examples and train a classifier to distinguish between alternative senses: –each WP sense acts as a different label in the classification model. Each word context is represented as a vector of features: –Current word and its part-of-speech, –Local context of three words to the left and to the right. –Parts-of-speech of the surrounding words. –Verb and noun before and after the ambiguous words. –A global context implemented through sense-specific keywords determined as a list of all words occurring at least three times in the contexts defining a certain word sense. 8

A Multilingual Dataset through Machine Translation Treat each of the 4 languages as a reference language: –Use Google Translate to translate the data from the reference language into the other 3 supporting languages. Translate into French as an additional supporting language. => each reference sentence is translated into 4 supporting languages. 9 EnAn airline seat is a chair on an airliner in which passengers are accommodated for the duration of the journey. DeEin Flugzeugsitz ist ein Stuhl auf einem Flugzeug, in dem Passagiere frdie Dauer der Reise untergebracht sind. EnFor a year after graduation, Stanley served as chair of belles-lettres at Christian College in Hustonville. DeSeit einem Jahr nach dem Abschluss, diente Stanley als Vorsitzender Belletristik bei Christian College in Hustonville.

Benefits of Machine Translation 1)Knowledge of the target word translation can help in disambiguation: –Two different senses of the target ambiguous word may be translated into a different word in the supporting language. –Assuming access to word alignments. 2)Features extracted from the translated sentence can be used to enrich the feature space: –For example, the two senses “(unit)" and (establishment)" of the English word “bar" translate to the same German word “bar". –In cases like this, words in the context of the German translation may help in identifying the correct English meaning. 10

The WikiTransSense Learning Framework Extract the same type of features Φ as in WikiMonoSense. Append features from supporting languages to vector of features from the reference language: –Φ’ EN = [Φ EN | Φ SP ; Φ IT ; Φ DE ; Φ FR ]. Train a multilingual WSD classifier using the augmented feature vectors. 11

A Multilingual Dataset through Wikipedia Interlingua Links Wikipedia articles on the same topic in different languages are often connected through interlingual links. Use interlingua links to project sense repository in reference language to sense repository in supporting language. –Given reference sense repository for word “bar" in English is: EN = {bar (establishment), bar (landform), bar (law), bar (music)} –Projected supporting sense repository in German will be: DE = {Bar (Lokal), Sandbank, NIL, Takt (Musik)} Use projected repositories in supporting languages to train additional WSD classifiers for reference language senses. 12

Two Problematic Issues for Interlingua Links 1)There may be reference language senses that do not have interlingua links to the supporting language: –randomly sample a number of examples for that sense in the reference language. –use GT to create examples in the supporting language. 2)The distribution of examples per sense in the corpus for the supporting language may be different from the corresponding distribution for the reference language: –use the distribution of reference language as the true distribution and calculate the number of examples to be considered per sense from the supporting languages using [Agirre & Martinez, 2004]. 13

The WikiMuSense Learning Framework Given an ambigous word in the reference language, at training time: –Train a probabilistic classifier P R for the reference language: use the same WP sense repository developed for WikiMonoSense and WikiTtransSense. –Train a probabilistic classifier P S for each supporting language: use the reference sense repository projected in the supporting language. –Use same types of features as in WikiMonoSense, for each classifier.  Five probabilistic classifiers: –One from the reference language (P R ). –Four from the supporting languages (P S ). 14

The WikiMuSense Learning Framework Given an ambigous word in the reference language, at test time: –Use GT to translate reference sentence in all supporting languages. –Run probabilistic classifier P R on reference sentence and classifiers P S on supporting sentences. –Combine the 5 probabilistic outputs into one disambiguation score: D R = the set of training examples in reference language R. D S = the set of training examples in supporting language S. –WSD = select the sense that maximizes score P. 15

WikiMuSense vs. WikiTransSense WikiMuSense significantly reduces the # of sentence translations required to create the multilingual dataset. Features extracted from each supporting language are more diverse, as sentences are natural, as opposed to translated: –although may lead to potential mismatch between training and testing distributions. 16

Experimental Evaluation Used a subset of ambiguous words from Senseval 2 & 3: –Avoid words with only one Wikpedia label. => English (30), Spanish (25), Italian (25), German (25). 17

Experimental Evaluation: Macro & Micro 1.cdcd 18

Experimental Evaluation: Macro Results WikiMonoSense better than MFS on 76 out of 105 words: –Average relative error reduction of 44%, 38%, 44%, and 28%. WikiTransSense better than MFS on 83 out of 105 words: –Average relative error reduction over WikiMonoSense of 13.7%.  utility of using features from translated contexts. WikiMuSense better than MFS on 89 out of 105 words: –Average relative error reduction over WikiMonoSense of 16.5%.  multilingual WP data can successfully replace MT component during training. 19

Varying the Number of Supporting Languages 20

Varying the Amount of Supporting Language Data 21 Dip likely due to suboptimal combination of classifiers in: [Future Work]: train weights for each supporting language.

Varying the Amount of Supporting Language Data 22 Peak likely due to suboptimal combination of classifiers in: [Future Work]: train weights for each supporting language. # of supporting examples = # of reference examples.

Future Work 1.Train weights in for each supporting language, when combining classifier outputs in WikiMuSense. 2.Reduce the number of translations in WikiMuSense by choosing from the 280 languages in WP those supporting languages with largest number of examples per sense. 3.Exploit directly the distributions used inside a MT system:  eliminate MT altogether from WikiMuSense. 23

Conclusion WikiMonoSense: –Use Wikipedia hyperlinks to train monolingual WSD classifiers. WikiTransSense: –The sense tagged corpus extracted for the reference language is machine translated into a number of supporting languages. –Use aligned sentences to generate additional features in a first approach to multilingual WSD. WikiMuSense: –Use Wikipedia the interlingual links to reduce reliance on MT. –Train and combine multiple probabilistic classifiers, in a second approach to multilingual WSD. 24

Questions ? 25