Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language Information Retrieval Advisor : Dr. Hsu Presenter : Yu-San Hsieh Author : Christof Monz and Bonnie J. Dorr 2005.SIGIR
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Motivation Objective Approach Experiment Result Introduction Experiment Conclusions Outline
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Many words or phrases in one language can be translated into another language in a number of way, so translation ambiguity is very common,that impacting the effectiveness of information retrieval. Penalty (English) Elfmeter (Soccer) Strafe (punishment)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective Finding a proper distribution of translation probabilities that can solve the translation ambiguity problem.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Approach Find a proper of translation probabilities. Computing Term Weight ─ Initialization Step ─ Iteration Step ─ Normalization Step ─ All term weights in a vector ─ Iteration Stop trade union europe gewerbe geschaeft handel europa union gewerkschaft
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Approach Measuring association strength ─ Pointwise mutual information ─ Dice coefficient ─ Log Likelihood ratio
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Experiment Result Individual queries (topic) Differences baseline Improve
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Introduction Two techniques for cross-language retrieval ─ Translate collection of document into target language and apply monolingual retrieval ─ Translate the query into target language and apply translated query retrieval Three approach may be used produce the translations ─ Machine translation system ─ Dictionary ─ Parallel corpus to estimate the probabilities
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Introduction One language translation into another language in a number ways. ─ Penalty (English) => Elfmeter (soccer) or Strafe (punishment)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Introduction A approach can solve the problem of word selection is to use co-occurrences between term. Problem (a larger number of terms) ─ Data-sparseness Use very large corpora for counting co-occruences frequencies Use internet search engines Smoothing
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Experiment Test Data ─ CLEF 2003 English to German bilingual data ─ Choice 56 topic (title, description, narrative) Morphological Normalization ─ Source-language word (topic) normalized to match in bilingual dictionary ─ De-compounding : 5-grams ─ Assign weights to 5-gram substrings
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Experiment Retrieval Model ─ Lnu.Itc weighting scheme ─ Weighted document similarity Statistical Significance ─ Bootstrap method Bootstrap sample One-tailed significance testing (compare two retrieval method)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Experiment Found some problem in experiment ─ Individual average precision of Log Likelihood ratio decreases for a number of query. Unknown word The original word from the source language is include in the target language query. Example Women’s Conference Beijing Women ( 專有名詞 ) Women Assign weighted =1 Result 1.Woman control document simliarity 2.Most top-ranked documents contain Women as the only matching term. normalized Not find : Woman
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Conclusions Our approach improve retrieval effectiveness compare to baseline using bilingual dictionary lookup. Experimental result show that Log Likelihood Ratio has the strong positive impact.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 My opinion Advantage: It only requires a bilingual dictionary and a monolingual corpus in the target language. Disadvantage: Unknown word Apply