Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland www.unine.ch/info/clef/

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Chapter 5: Introduction to Information Retrieval
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Multilingual experiments of CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.
CONTROL: CLEF-2003 with Open, Transparent Resources Off-Line Monica Rogati Computer Science Department, Carnegie Mellon.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Chapter 5: Information Retrieval and Web Search
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
LREC Combining Multiple Models for Speech Information Retrieval Muath Alzghool and Diana Inkpen University of Ottawa Canada.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Applying the KISS Principle with Prior-Art Patent Search Walid Magdy Gareth Jones Dublin City University CLEF-IP, 22 Sep 2010.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
A merging strategy proposal: The 2-step retrieval status value method Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
Cross-Language Evaluation Forum CLEF 2003 Carol Peters ISTI-CNR, Pisa, Italy Martin Braschler Eurospider Information Technology AG.
1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio.
Stiftung Wissenschaft und Politik German Institute for International and Security Affairs CLEF 2005: Domain-Specific Track Overview Michael Kluck SWP,
Departamento de Lenguajes y Sistemas Informáticos Cross-language experiments with IR-n system CLEF-2003.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Thomas Mandl: Robust CLEF Overview 1 Cross-Language Evaluation Forum (CLEF) Thomas Mandl Information Science Universität Hildesheim
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Diana Inkpen, University of Ottawa, CLEF 2005 Using various indexing schemes and multiple translations in the CL-SR task at CLEF 2005 Diana Inkpen, Muath.
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 1 Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language.
1 SINAI at CLEF 2004: Using Machine Translation resources with mixed 2-step RSV merging algorithm Fernando Martínez Santiago Miguel Ángel García Cumbreras.
Information Retrieval: Models and Methods
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
A Straightforward Author Profiling Approach in MapReduce
Indexing & querying text
Information Retrieval: Models and Methods
Special Topics on Information Retrieval
IR Theory: Evaluation Methods
John Lafferty, Chengxiang Zhai School of Computer Science
Chapter 5: Information Retrieval and Web Search
Large scale multilingual and multimodal integration
Information Retrieval and Web Design
Recuperação de Informação
Cheshire at GeoCLEF 2008: Text and Fusion Approaches for GIR
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland www.unine.ch/info/clef/

Our IR Strategy 1. Effective monolingual IR system (combining indexing schemes?) 2. Combining query translation tools 3. Effective collection selection and merging strategies

Effective Monolingual IR Define a stopword list Have a « good » stemmer Removing only the feminine and plural suffixes (inflections) Focus on nouns and adjectives see www.unine.ch/info/clef/

Indexing Units For Indo-European languages : Set of significant words For Finnish, we used 4-grams. CJK: bigrams with stoplist without stemming

IR Models Vector-space Probabilistic Lnu-ltc Okapi tf-idf (ntc-ntc) binary (bnn-bnn) Probabilistic Okapi Prosit or deviation from randomness

Monolingual Evaluation Model English French Portug. Q=TD word Okapi 0.5422 0.4685 0.4835 Prosit 0.5313 0.4568 0.4695 Lnu-ltc 0.4979 0.4349 0.4579 tf-idf 0.3706 0.3309 0.3708 binary 0.3005 0.2017 0.1834

Monolingual Evaluation Model Finnish Russian Q=TD word Okapi 0.4773 0.3800 Prosit 0.4620 0.3448 Lnu-ltc 0.4643 0.3794 tf-idf 0.3862 0.2716 binary 0.1859 0.1512

Monolingual Evaluation Model Finnish Russian Q=TD word 4-gram Okapi 0.4773 0.5386 0.3800 0.2890 Prosit 0.4620 0.5357 0.3448 0.2879 Lnu-ltc 0.4643 0.5022 0.3794 0.2852 tf-idf 0.3862 0.4466 0.2716 0.1916 binary 0.1859 0.2387 0.1512 0.0373

Problems with Finnish Do we need a deeper morpholgical analysis? Finnish can be characterized by: 1. A large number of cases (12+), 2. Agglutinative (saunastani) & compound construction (rakkauskirje) 3. Variations in the stem (matto, maton, mattoja, …). Do we need a deeper morpholgical analysis?

Improvement using… Relevance feedback (Rocchio)?

Monolingual Evaluation Q=TD English French Finnish Russian Okapi 0.5422 0.4685 0.5386 0.3800 +PRF 0.5704 0.4851 0.5308 0.3945 +5.2% +3.5% -1.5% +3.8% Prosit 0.5313 0.4568 0.5357 0.3448 0.5742 0.4643 0.5684 0.3736 +8.1% +1.6% +6.1% +8.4%

Relevance Feedback

Improvement using… Data fusion approaches (see the paper) Improvement with the English, Finnish and Portuguese corpora Hurt the MAP with the French and Russian collections

Bilingual IR 1. Machine-readable dictionaries (MRD) e.g., Babylon 2. Machine translation services (MT) e.g., FreeTranslation BabelFish 3. Parallel and/or comparable corpora (not used in this evaluation campaign)

Bilingual IR “Tour de France Winner Who won the Tour de France in 1995?” (Babylon1) “France, vainqueur Organisation Mondiale de la Santé, le, France,* 1995” (Right) “Le vainqueur du Tour de France Qui a gagné le tour de France en 1995 ?”

Bilingual EN->FR/FI/RU/PT TD Okapi French word Finnish 4-gram Russian word Portug word Manual 0.4685 0.5386 0.3800 0.4835 Baby1 0.3706 0.1965 0.2209 0.3071 FreeTra 0.3845 N/A 0.3067 0.4057 InterTra 0.2664 0.2653 0.1216 0.3277 Reverso 0.3830 0.2960 Combi. 0.4066 0.3042 0.3888 0.4204

Bilingual EN->FR/FI/RU/PT TD Okapi French word Finnish 4-gram Russian word Portug word Manual 0.4685 0.5386 0.3800 0.4835 Baby1 79.1% 36.5% 58.1% 63.5% FreeTra 82.1% N/A 80.7% 83.9% InterTra 56.9% 49.3% 32.0% 67.8% Reverso 81.8% 77.9% Combi. 86.8% 56.5% 102.3% 86.9%

Bilingual IR Improvement … Query combination (concatenation) Relevance feedback (yes for PT, FR, RU, no for FI) Data fusion (yes for FR, no for RU, FI, PT)

Multilingual EN->EN FI FR RU 1. Create a common index Document translation (DT) 2. Search on each language and merge the result lists (QT) 3. Mix QT and DT 4. No translation

Merging Problem EN FI FR RU <–– Merging

Merging Problem 1 EN120 1.2 2 EN200 1.0 3 EN050 0.7 4 EN705 0.6 … FR043 0.8 FR120 0.75 FR055 0.65 … 1 RU050 1.6 2 RU005 1.1 3 RU120 0.9 4 …

Merging Strategies Round-robin (baseline) Raw-score merging Normalize Biased round-robin Z-score Logistic regression

Merging Problem 1 RU… 2 EN… 3 RU… 4 …. 1 EN120 1.2 2 EN200 1.0 FR043 0.8 FR120 0.75 FR055 0.65 … 1 RU050 1.6 2 RU005 1.1 3 RU120 0.9 4 … 1 RU… 2 EN… 3 RU… 4 ….

Merging Strategies Round-robin (baseline) Raw-score merging Normalize Biased round-robin Z-score Logistic regression

Test-Collection CLEF-2004 EN FI FR RU size 154 MB 137 MB 244 MB 68 MB doc 56,472 55,344 90,261 16,716 topic 42 45 49 34 rel./query 8.9 9.2 18.7 3.6

Merging Strategies Round-robin (baseline) Raw-score merging Normalize Biased round-robin Z-score Logistic regression

Z-Score Normalization Compute the mean m and standard deviation s 1 EN120 1.2 2 EN200 1.0 3 EN050 0.7 4 EN765 0.6 … … New score = ((old score-m) / s ) + d 1 EN120 7.0 2 EN200 5.0 3 EN050 2.0 4 EN765 1.0 …

MLIR Evaluation Condition A (meta-search) Prosit/Okapi (+PRF, data fusion for FR, EN) Condition C (same SE) Prosit only (+PRF)

Multilingual Evaluation EN->EN FR FI RU Cond. A Cond. C Round-robin 0.2386 0.2358 Raw-score 0.0642 0.3067 Norm 0.2899 0.2646 Biased RR 0.2639 0.2613 Z-score wt 0.2669 0.2867 Logistic 0.3090 0.3393

Collection Selection General idea: 1. Use the logistic regression to compute a probability of relevance for each document; 2. Sum the document scores from the first 15 best ranked documents; 3. If this sum ≥ d, consider all the list, otherwise select only the first m doc.

Multilingual Evaluation EN->EN FR FI RU Cond. A Cond. C Round-robin 0.2386 0.2358 Logistic reg. 0.3090 0.3393 Optimal select 0.3234 0.3558 Select (m=0) 0.2957 0.3405 Select (m=3) 0.2953 0.3378

Conclusion (monolingual) Finnish is still a difficult language Blind-query expansion: different patterns with different IR models Data fusion (?)

Conclusion (bilingual) Translation resources not available for all languages pairs Improvement by combining query translations yes, but can we have a better combination scheme?

Conclusion (multilingual) Selection and merging are still hard problems Overfitting is not always the best choice (see results with Condition C)

Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland www.unine.ch/info/clef/