MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. www.daedalus.es Universidad Carlos III de.

Slides:



Advertisements
Similar presentations
SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Advertisements

Chapter 5: Introduction to Information Retrieval
Multilingual experiments of CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Modern Information Retrieval
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
Evaluating the Performance of IR Sytems
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Chapter 5: Information Retrieval and Web Search
Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani Center.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Search Engines and Information Retrieval Chapter 1.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( The Multiple Language Question Answering Track at CLEF 2003.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
Applying the KISS Principle with Prior-Art Patent Search Walid Magdy Gareth Jones Dublin City University CLEF-IP, 22 Sep 2010.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Query Operations Relevance Feedback & Query Expansion.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Chapter 6: Information Retrieval and Web Search
Multilingual Retrieval Experiments with MIMOR at the University of Hildesheim René Hackl, Ralph Kölle, Thomas Mandl, Alexandra Ploedt, Jan-Hendrik Scheufen,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
UA in ImageCLEF 2005 Maximiliano Saiz Noeda. Index System  Indexing  Retrieval Image category classification  Building  Use Experiments and results.
An Architecture for Emergent Semantics Sven Herschel, Ralf Heese, and Jens Bleiholder Humboldt-Universität zu Berlin/ Hasso-Plattner-Institut.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Chapter 23: Probabilistic Language Models April 13, 2004.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Stiftung Wissenschaft und Politik German Institute for International and Security Affairs CLEF 2005: Domain-Specific Track Overview Michael Kluck SWP,
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Departamento de Lenguajes y Sistemas Informáticos Cross-language experiments with IR-n system CLEF-2003.
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
INAOE at GeoCLEF 2008: A Ranking Approach based on Sample Documents Esaú Villatoro-Tello Manuel Montes-y-Gómez Luis Villaseñor-Pineda Language Technologies.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏.
CLEF Workshop ECDL 2003 Trondheim Michael Kluck slide 1 Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language.
1 SINAI at CLEF 2004: Using Machine Translation resources with mixed 2-step RSV merging algorithm Fernando Martínez Santiago Miguel Ángel García Cumbreras.
CLEF Budapest1 Measuring the contribution of Word Sense Disambiguation for QA Proposers: UBC: Agirre, Lopez de Lacalle, Otegi, Rigau, FBK: Magnini.
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Irion Technologies (c)
Chapter 5: Information Retrieval and Web Search
Cheshire at GeoCLEF 2008: Text and Fusion Approaches for GIR
Presentation transcript:

MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de Madrid (UC3M) Universidad Politécnica de Madrid (UPM) Partially funded by IST (OmniPaper) and CAM 07T/0055/2003 projects

MIRACLE – ImageCLEF ImageCLEF 2003 Participation in: Monolingual task:  English -> English: 5 different runs Bilingual tasks:  Spanish->English: 6 runs  German->English: 6 runs  French->English: 4 runs  Italian->English: 4 runs  TOTAL: 25 runs

MIRACLE – ImageCLEF System Architecture IR engine: XAPIAN (based on Probabilistic IR model) Filtering components: text and word extraction, topic extraction, word count, statistic calculations Linguistic components: tokenizers, stemmers (based on Porter algorithm), German word decompounding module, stopword filters Translation components: API to FreeTranslation.com (full text) and ERGANE dictionary (word by word) Semantic components: Synonym expansion for English (WordNet) Our idea is to couple these components in different ways to evaluate different approaches and compare the influence of each one in the P/R of the IR process for each language

MIRACLE – ImageCLEF IR Process: Index All the images are indexed in the same XAPIAN collection For each image, HEADLINE and TEXT fields are used (without tags and IDs)

MIRACLE – ImageCLEF IR Process: Retrieval Different runs, basically consisting on: 1.Create the query from the topic 2.Execute the query in XAPIAN system 3.Retrieve 1000 best results (ranked list) For each topic, only the TITLE field and the 1st translation variant are used Evaluation: four relevance sets (2 judges)  Union (any assessor) / Intersection (both assessors)  Strict (relevant only) / Relaxed (also partially relevant)  In our evaluation, we have considered the intersection-strict, which is the most restrictive

MIRACLE – ImageCLEF Monolingual Runs (en->en) OR:  Word extraction in topic title  stop word filtering  stemming  weighted OR operator with stems  Intended as the baseline run ORlem:  Word extraction in topic title  stop word filtering  stemming  weighted OR operator with stems and original words  Idea: measure the effect of stemming ORlemexp:  Word extraction in topic title  stop word filtering  synonym expansion  stemming  weighted OR operator with stems and original words and synonyms  Idea: measure the effect of increasing the recall despite the penalization in precision Doc:  Index topic title as document  retrieve similar docs  Idea: Confirm that this is a similar approach to vector space model ORrf:  Query with OR operator with stems  Top 25 docs  250 most important terms  new weighted OR query  Idea: measure the effect of simplest blind relevance feedback

MIRACLE – ImageCLEF P-R curve (en->en) 1.Best runs have too high precision values (“the set of relevant documents is not complete”) 2.Relevance feedback is the worst (“noise due to unappropriate parameter values terms when the mean length of image description is about 50 words”) 3.Any kind of term expansion reduces precision (“low number of documents, existence of ambiguity”)

MIRACLE – ImageCLEF Average Precision (en->en) 1.Best run is weighted OR query and Doc (“in Probabilistic IR model, weighted OR is like term weights in Vector Space Model”) 2.The evaluation with other relevance sets gives a slight increase in overall precision

MIRACLE – ImageCLEF Bilingual Runs (fr,ge,it,sp->en) TOR1:  Topic title  FreeTranslation  Word extraction  stop word filtering  stemming  weighted OR operator with stems  Similar to monolingual OR, intended as the baseline run TOR3:  Topic title  FreeTranslation + ERGANE  Word extraction  stop word filtering  stemming  weighted OR operator with stems  Idea: improve translation by combining different sources Tdoc:  Topic title  FreeTranslation  Index as document  retrieve similar docs TOR3exp:  Topic title  FreeTranslation + ERGANE  Word extraction  stop word filtering  synonym expansion  stemming  weighted OR operator with stems and original words and synonyms TOR3full:  The same as TOR3 but also including topic title in original language  Idea: evaluate the effect of text that cannot be or is incorrectly translated TOR3fullexp:  Combination of TOR3exp and TOR3full

MIRACLE – ImageCLEF P-R curve (fr,ge,it,sp->en)

MIRACLE – ImageCLEF P-R curve (fr,ge,it,sp->en) 1.Although all results are similar, TOR1 and Tdoc are the best ones in all cases 2.Using word by word translation with ERGANE has proved to be worse: translation is not adequate or the expansion of the query makes wider queries thus reducing precision 3.Again, as in monolingual task, any kind of term expansion reduces precision, if not coping with ambiguity 4.Spanish, German and Italian have similar results, but French is slightly worse: FreeTranslation is worse for French or the French topics are harder to translate 5.Spanish->English gives our best individual results !!! 6.Comparing bilingual/monolingual results, a difference of about 10-15% arises (similar to our participation in CLEF tasks this year)

MIRACLE – ImageCLEF Average Precision (fr,ge,it,sp->en)

MIRACLE – ImageCLEF Conclusions and Future Work As new-comers to CLEF, we have worked hard to build the infrastructure to be able to easily execute different runs Simplest approaches have proved to be the best if not handling ambiguity caused by term expansion Next time…:  POS filtering for syntactic disambiguation to handle ambiguity  Evaluate the effect of using stemming (and its quality) or not in high flexible languages like Spanish/French/Italian  More focus on Spanish: better stemmer, better synonym expansion (directly in Spanish)  Evaluate the quality of translation engines with respect to the IR process