1 SINAI at CLEF 2004: Using Machine Translation resources with mixed 2-step RSV merging algorithm Fernando Martínez Santiago Miguel Ángel García Cumbreras.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Multilingual experiments of CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.
CONTROL: CLEF-2003 with Open, Transparent Resources Off-Line Monica Rogati Computer Science Department, Carnegie Mellon.
Morris LeBlanc.  Why Image Retrieval is Hard?  Problems with Image Retrieval  Support Vector Machines  Active Learning  Image Processing ◦ Texture.
Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.
With or without users? Julio Gonzalo UNEDhttp://nlp.uned.es.
Chapter 5: Information Retrieval and Web Search
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Applying the KISS Principle with Prior-Art Patent Search Walid Magdy Gareth Jones Dublin City University CLEF-IP, 22 Sep 2010.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Chapter 6: Information Retrieval and Web Search
A merging strategy proposal: The 2-step retrieval status value method Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
UA in ImageCLEF 2005 Maximiliano Saiz Noeda. Index System  Indexing  Retrieval Image category classification  Building  Use Experiments and results.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
Chapter 23: Probabilistic Language Models April 13, 2004.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Departamento de Lenguajes y Sistemas Informáticos Cross-language experiments with IR-n system CLEF-2003.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
INAOE at GeoCLEF 2008: A Ranking Approach based on Sample Documents Esaú Villatoro-Tello Manuel Montes-y-Gómez Luis Villaseñor-Pineda Language Technologies.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Why IR test collections are so bad Mark Sanderson University of Sheffield.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
RapStar’s Solution to Data Mining Hackathon on Best Buy Mobile Site Kingsfield, Dragon.
Diana Inkpen, University of Ottawa, CLEF 2005 Using various indexing schemes and multiple translations in the CL-SR task at CLEF 2005 Diana Inkpen, Muath.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
1 Knowledge-Based Medical Image Indexing and Retrieval Caroline LACOSTE Joo Hwee LIM Jean-Pierre CHEVALLET Daniel RACOCEANU Nicolas Maillot Image Perception,
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.
Queensland University of Technology
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Compact Query Term Selection Using Topically Related Text
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Cheshire at GeoCLEF 2008: Text and Fusion Approaches for GIR
Presentation transcript:

1 SINAI at CLEF 2004: Using Machine Translation resources with mixed 2-step RSV merging algorithm Fernando Martínez Santiago Miguel Ángel García Cumbreras Manuel Carlos Díaz Galiano Alfonso Ureña López University of Jaén - SPAIN

University of Jaén -- SINAI at CLEF SINAI at CLEF 2004  Multilingual CLEF task  Only queries are translated: merging algorithms are required  Main interest: testing Machine Translation(MT) with mixed 2-step RSV merging algorithm. 2-step RSV calculus requires that given a word of the original query, its translation to the rest of languages must be known. MT translates the whole of the phrase better than word for word, we can’t use MT with 2-step RSV merging algorithm directly. We propose an straightforward and efective algorithm to align the original query and its translation at term level.

University of Jaén -- SINAI at CLEF Content  How 2-step RSV merging algorithm works?  An algorithm to align parallel text at term level based on MT  Experimentation framework  Results  Global pseudo-relevance feedback  Conclusions a future work

Conflict of Interest in Italy? Conflicto intereses Italia Conflitto interessi Italia Conflits intérêt Italie Spanish IR Italian IR French IR Spanish, Italian, French retrieved documents Concept IR Indexing concepts (#1,#2,#3) SpanishItalianFrenchConcept TermdfTermdfTermdfId.df conflicto1000conflitto1500conflits1250# =3750 intereses5000interessi6000Intérêt4000# =15000 Italia3500Italia6000Italie4500# =14000 #1 (conflicto, conflitto, conflits) #2 (intereses, interessi, intérêt) #3 (Italia, Italia, Italie) Multilingual list of Ranked documents STEP 1 STEP 2 2-step RSV method

University of Jaén -- SINAI at CLEF An algorithm to align at term level a phrase and its translations by using machine translation resources P en  Pesticides in baby food P sp  Pesticidas en alimento para niños STEP 1 Machine Translation Unigrams (P en ) = {Pesticides, baby, food} Bigrams (P en ) = {Pesticides baby, baby food} U (P sp ) = {Pesticidas,alimento,bebés} B (P sp ) = {Pesticidas bebés, alimento para niños} STEP 2 Non Stopwords Stemmed P sp  Pesticida alimento niño U (P sp ) = {Pesticida,alimento,bebé} B (P sp ) = {Pesticida bebé, alimento niño} STEP 4 STEP 3 STEP 5 To Align Unigrams Pesticida alimento niño, {Pesticida, alimento, bebé} Pesticide baby food, {Pesticide, Food, baby} STEP 6 To Align Bigrams Pesticida alimento niño, {Pesticida bebé, alimento niño} Pesticide baby food, {Pesticide baby, baby food}...

University of Jaén -- SINAI at CLEF SpanishGermanFrenchItalian 91%87%86%88% Percentage of aligned non-empty words (CLEF2001+CLEF2002+CLEF2003 query set, Title+Description fields, Babelfish machine Translation) FinnishFrenchRussian 100%85%80% Percentage of aligned non-empty words (CLEF2004 query set, Title+Description fields, MT for French and Russian. MDR for Finnish) An algorithm to align at term level a phrase and its translations by using machine translation resources

University of Jaén -- SINAI at CLEF Mixed 2-step RSV:Queries partially aligned  Raw mixed 2-step RSV:  Logistic regression: 573 SP123 Dinamic 2-step RSV index Monolingual index (Spanish) Aligned terms Non- a ligned terms 39 RSV aligned terms RSV not aligned terms Conflicto de intereses en Italia

University of Jaén -- SINAI at CLEF Experimentation framework – language dependent features EnglishFinnishFrenchRussian Preprocessingstop words removed and stemming Additional preprocessing compounds words to simple words Cyrillic  ASCII Query aligment at word level algorithm based on MT Translation approach FinnPlace MDR Reverso MT Prompt MT

University of Jaén -- SINAI at CLEF Experimentation framework – language independent features  ZPrise IR engine  OKAPI probabilistic model (fixed at b = 0.75 and k1 = 1.2 for every language and for the 2-step RSV index)  Neither blind feedback nor query expansion (no improvement except of French)

University of Jaén -- SINAI at CLEF Results

University of Jaén -- SINAI at CLEF Global pseudo-relevance feedback  The idea: Since 2-step RSV creates a only index for all collections and it returns a only multingual list of documents, why don’t apply PRF with such index?  The implementation: 1. Merge the document rankings using 2-step RSV. 2. Apply blind relevance feedback to the top-N documents ranked into the multilingual list of documents. 3. Add the top-N more meaningful terms to the query. 4. Expand the concept query with the selected terms. 5. Apply again 2-step RSV over the ranked lists of documents, but by using the expanded query instead of the original query.

Conflict of Interest in Italy? Conflicto intereses Italia Conflitto interessi Italia Conflits intérêt Italie Spanish IR Italian IR French IR Spanish, Italian, French retrieved documents Concept IR Indexing concepts (#1,#2,#3) #1 (conflicto, conflitto, conflits) #2 (intereses, interessi, intérêt) #3 (Italia, Italia, Italie) Multilingual list of Ranked documents OCSE, concept#2, politica concept#1, broadcasting

University of Jaén -- SINAI at CLEF  But global PRF doesn’t work: Usually, blind relevance feedback is poorly suited to CLEF document collections. We use the expanded query to apply 2-step RSV re- weighting the documents retrieved for each language, but the list of retrieved documents does not change (it only changes the score of such documents). Global pseudo-relevance feedback

University of Jaén -- SINAI at CLEF Conclusions and future work  2-step RSV merging algorithm works well with Machine Translation!  We propose a new word-level aligment algorithm based on MT  In the future: Partially aligned queries  integration of two scores for documents must be improved by using other algorithms (normalized versions of raw mixed 2- step RSV, SVM and neural and bayesian networks...) Global PRF idea must be more investigated