IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.

Slides:



Advertisements
Similar presentations
SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern Information Retrieval Chapter 1: Introduction
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay.
The XLDB Group at GeoCLEF 2005 Nuno Cardoso, Bruno Martins, Marcírio Chaves, Leonardo Andrade, Mário J. Silva
Information Retrieval in Practice
Search Engines and Information Retrieval
Modern Information Retrieval Chapter 1: Introduction
Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
S ANDHAN Indian language search engine. S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani Center.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.
AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR.
Overview of RISOT: Retrieval of Indic Script OCR’d Text Utpal GarainIndian Statistical Institute, Kolkata Tamaltaru PalIndian Statistical Institute, Kolkata.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
NERIL: Named Entity Recognition for Indian FIRE 2013.
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Microsoft Research India’s Participation in FIRE2008 Raghavendra Udupa
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Information Retrieval
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Multilingual Search Shibamouli Lahiri
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Information Retrieval in Practice
Cross-Language Information Retrieval (CLIR)
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Multilingual Information Access in a Digital Library
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

Outline Introduction Related Work in Indian Language IR Our CLIR experiments Evaluation & Analysis Future Work FIRE-20082

Introduction Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (courtesy: Wikipedia) Information – text, audio, video, speech, geographical information etc FIRE-20083

CLIR – Indian languages(IL) scenario FIRE தமிழ் Modified from Source: D. Oard’s Cross-Language IR presentation हिन्दी తెలుగు বাংলা मराठी To retrieve documents written in any IL when user queries in one language

Why CLIR for IL? FIRE-20085

6 Why CLIR for IL?

FIRE Internet user growth in India between 2000 to ,100.0 % Source : Growth in Indian language contents on the web between 2000 to 2007 – 700% So, CLIR for IL becomes mandatory!

RELATED WORK IN INDIAN LANGUAGE IR FIRE-20088

Related Work in ILIR ACM TALIP, The surprise language exercises - Task was to build CLIR system for English to Hindi and Cebuano “The surprise language exercises”, Douglas W. Oard. ACM Transactions on Asian Language Information Processing (TALIP), 2(2):79–84, 2003 FIRE-20089

Related Work in ILIR CLEF Ad-hoc bi-lingual track including two Indian languages Hindi and Telugu - Our team from IIIT-H participated in Hindi and Telugu to English CLIR task “Hindi and Telugu to English Cross Language Information Retrieval”, Prasad Pingali and Vasudeva Varma. CLEF FIRE

Related Work in ILIR CLEF Indian language subtask consisting of Hindi, Bengali, Marathi, Telugu and Tamil - Five teams including ours participated - Hindi and Telugu to English CLIR “IIIT Hyderabad at CLEF Adhoc Indian Language CLIR task”, Prasad Pingali and Vasudeva Varma. CLEF FIRE

Related Work in ILIR Google’s CLIR system for 34 languages including Hindi FIRE

OUR CLIR EXPERIMENTS FIRE

Our CLIR experiments Ad-hoc cross-lingual Hindi to English, and English to Hindi. Ad-hoc monolingual runs in Hindi and English 12 runs in total were submitted for the above 4 tasks FIRE

Problem statement CLIR system should take a set of 50 topics in the source language and return top 1000 documents for each topic in the target language FIRE ईरान का परमाणु कार्यक्रम ईरान का कार्यक्रम और उसकी परमाणु नीति के बारे में विश्व की राय। ईरान की परमाणु नीति और ऐसे कार्यक्रम के विरुद्ध ईरान पर यूएसए का निरंतर दबाव और धमकी के बारे में सूचना संबंधित प्रलेख में रहनी चाहिए। परमाणु नीति के समझौते के लिए ईरान और यूरोपीय संघ के बीच वार्ता और विश्व दृष्टि भी रुचिकर होगी

CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE

CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE

Named entities Identification Used for identifying the named entities present in the queries for transliteration We used – Our CRF-based NER system( as a binary classifier) for Hindi queries, – Stanford English NER system for English queries Identifies Person, Organization and Location names FIRE "Experiments in Telugu NER: A Conditional Random Field Approach“, Praneeth M Shishtla, Prasad Pingali, Vasudeva Varma. NERSSEAL-08, IJCNLP-08, Hyderabad, 2008.

CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE

Query translation Using bi-lingual lexicons – “Shabdanjali”, an English-Hindi dictionary containing 26,633 entries – IIT Bombay Hindi Wordnet – Manually collected Hindi-English dictionary with 6,685 entries FIRE Shabdanjali - Hindi Wordnet -

CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE

Transliteration Mapping-based approach For a given named entity in source language – Derive the Compressed Word Format (CWF) E.g. academia – cdm E.g. abullah - bll – Generate the list of Named entities & their CWFs at the target language side – Search and map the CWF of source language NE with the CWF of the right target language equivalent within the min. modified edit distance FIRE

Transliteration Implementation – Named entities present in the Hindi and English corpora are identified and listed. – Their CWFs are generated using a set of heuristic, rewrite and remove rules – CWFs are added to the list of NEs FIRE “Named Entity Transliteration for Cross-Language Information Retrieval using Compressed Word Format Mapping algorithm”, Srinivasan C Janarthanam, Sethuramalingam S, Udhyakumar Nallasamy. iNEWS-08, CIKM

CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE

Query Scoring We generate a Boolean OR query with scored query words Query scoring is based on – Position of occurrence of the word in the topic – Number of occurrences of the word – Numbers, Years are given greater weights FIRE

CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing & Ranking module – Stop word remover, – A typical Indexer using Lucene FIRE

Indexing module For the English corpus, stop words are removed and stemmed using Lucene For the Hindi corpus, a list of 246 words is generated from the given corpus based on frequency Documents are indexed using the Lucene Indexer and ranked using the BM-25 algorithm in Lucene FIRE

EVALUATION & ANALYSIS FIRE

Evaluation English-Hindi cross-lingual run FIRE RunMAPGMAPR-PrecBpref Title + Desc Title + Narr Title + Desc + Narr

Evaluation Hindi-English cross-lingual run FIRE RunMAPGMAPR-PrecBpref Title + Desc Title + Narr Title + Desc + Narr

Evaluation Hindi-Hindi monolingual run FIRE RunMAPGMAPR-PrecBpref Title + Desc Title + Narr Title + Desc + Narr

Evaluation English-English monolingual run FIRE RunMAPGMAPR-PrecBpref Title + Desc Title + Narr Title + Desc + Narr

English-Hindi Vs Hindi-Hindi FIRE

Hindi-English Vs English-English FIRE

Evaluation Summary – Our English-Hindi CLIR performance was 58% of the monolingual run – Our Hindi-English CLIR performance was 25% of the monolingual run – Our Hindi-Hindi monolingual run retrieved 52% of total relevant documents – Our English-English monolingual run retrieved 91% of total relevant documents FIRE

Analysis Our English-Hindi CLIR performance can be attributed to factors like – Exact matching of English named entities – Good coverage of English words in our lexicons A relatively lower performance on Hindi- English CLIR is due to – Low dictionary coverage – Query formulation was not complex enough FIRE

FUTURE WORK FIRE

Future Work Error analysis on per topic basis Work on more complex query formulations Work on other possible query translation techniques like – Building dictionaries from parallel corpora – Using web – Using Wikipedia FIRE

THANK YOU!!! FIRE