Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

1 Opinion Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization (UIUC at TAC 2008 Opinion Summarization Pilot) Nov 19,
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Evaluating Search Engine
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani Center.
April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Query Operations Relevance Feedback & Query Expansion.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Vector Space Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Extraction of Translational Japanese- KATAKANA.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Ranking Related Entities Components and Analyses CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Indri at TREC 2004: UMass Terabyte Track Overview Don Metzler University of Massachusetts, Amherst.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Compact Query Term Selection Using Topically Related Text
Relevance and Reinforcement in Interactive Browsing
Learning to Rank with Ties
Presentation transcript:

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies University of Maryland SIGIR 2005

INTRODUCTION Query translation requires access to some form of translation dictionary Use machine translation system to translate the entire query into the target language Use of a dictionary to produce a number of target-language translations for words or phrases in the source language Use of a parallel corpus to estimate the probabilities that w in the source language translates into w ’ in target language

INTRODUCTION An approach that does not require a parallel corpus to induce translation probabilities a machine-readable dictionary (without any rankings or frequency statistics) a monolingual corpus in the target language

TRANSLATION SELECTION Translation ambiguity is very common Apply word sense disambiguation For most languages the appropriate resources do not exist Word-sense disambiguation is a non-trivial enterprise

TRANSLATION SELECTION Our approach uses co-occurrences between terms to modeling context for the problem of word selection Ex. S1=>t11,t21,t31 S2=>t21,t22 S3=>t31

TRANSLATION SELECTION Computing co-occurrence statistics for a larger number of terms induces a data- sparseness issue Use very large corpora (Web) Apply smoothing techniques

ITERATIVE DISAMBIGUATION Only examine pairs of terms in order to gather partial evidence for the likelihood of a translation in a given context

ITERATIVE DISAMBIGUATION Assume that t i1 occurs more frequently with t j1 than any other pair of candidates between a translation for si and sj On the other hand, assume that t i1 and t j1 do not co-occur with tk1 at all, but t i2 and t j2 do Which should be preferred: ti1 and tj1 or ti2 and tj2

ITERATIVE DISAMBIGUATION Associate with each translation candidate a weight (t is a translation candidate for si) Each term weight is recomputed based on two different inputs the weights of the terms that link to the term (W L (t,t ’ )=link weight between t and t ’ )

ITERATIVE DISAMBIGUATION Normalize term weights The iteration stops if the changes in term weights become smaller than some threshold (W T= the vector of all term weights V k =kth element in the vector)

ITERATIVE DISAMBIGUATION There are a number of ways to compute the association strength between two terms MI Dice coefficient log likelihood

ITERATIVE DISAMBIGUATION Example

EXPERIMENT Set-Up Test Data CLEF 2003 English to German bilingual data Contains 60 topics, four of which were removed by the CLEF organizers, as no relevant documents Each topic has a title, a description, and a narrative field, for our experiments, we used only the title field to formulate the queries

EXPERIMENT Set-Up Morphological normalization Since the dictionary only contains base forms, the words in the topics must be mapped to their respective base forms as well Compounds are very frequent in German Instead of de-compounding, we use character 5- grams, an approach that yields almost the same retrieval performance as decompounding

EXPERIMENT Set-Up Ex. Topics Intermediate results of the query formulation process

EXPERIMENT Set-Up Retrieval Model - Lnu.ltc weighting scheme we used sl=0.1,pv=the average number of unique words per document, uw d = refers to the number of unique words in document d, w(i) = weight of term i

Experimental Results

Individual average precision decreases for a number of queries 6% of all English query terms were not in the dictionary Unknown words are treated as proper names, and the original word from the source language is included in the target language query Ex. the word Women is falsely considered a proper noun, although faulty translations of this type affect both the baseline system and the run using term weights, the latter is affected more severely

RELATEDWORK Pirkola ’ s approach does not consider disambiguation at all Jang ’ s approach use MI to re-compute translation probabilities for cross-language retrieval Only considers mutual information between consecutive terms in the query they do not compute the translation probabilities in an iterative fashion

RELATEDWORK Adriani ’ s approach is similar to the approach by Jang does not benefit from using multiple iterations. Gao use a decaying mutual-information score in combination with syntactic dependency relations We did not consider distances between words

RELATEDWORK Maeda compare a number of co-occurrence statistics with respect to their usefulness for improving retrieval effectiveness They consider all pairs of possible translations of words in the query use co-occurrence information to select translations of words from the topic for query formulation, instead of re-weighting them

RELATEDWORK Kikui Only need a dictionary and monolingual resources in the target language Computes the coherence between all possible combinations of translation candidates of the source terms

CONCLUSIONS introduced a new algorithm for computing topic dependent translation probabilities for cross-language information retrieval We experimented with different term association measures, experimental results show Log Likelihood Ratio has the strongest positive impact on retrieval effectiveness

CONCLUSIONS An important advantage of our approach is that it only requires a bilingual dictionary and a monolingual corpus An issue that remains open at this point is the computation of query terms that are not covered by the bilingual dictionary