IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1
Outline Introduction Related Work in Indian Language IR Our CLIR experiments Evaluation & Analysis Future Work FIRE-20082
Introduction Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (courtesy: Wikipedia) Information – text, audio, video, speech, geographical information etc FIRE-20083
CLIR – Indian languages(IL) scenario FIRE தமிழ் Modified from Source: D. Oard’s Cross-Language IR presentation हिन्दी తెలుగు বাংলা मराठी To retrieve documents written in any IL when user queries in one language
Why CLIR for IL? FIRE-20085
6 Why CLIR for IL?
FIRE Internet user growth in India between 2000 to ,100.0 % Source : Growth in Indian language contents on the web between 2000 to 2007 – 700% So, CLIR for IL becomes mandatory!
RELATED WORK IN INDIAN LANGUAGE IR FIRE-20088
Related Work in ILIR ACM TALIP, The surprise language exercises - Task was to build CLIR system for English to Hindi and Cebuano “The surprise language exercises”, Douglas W. Oard. ACM Transactions on Asian Language Information Processing (TALIP), 2(2):79–84, 2003 FIRE-20089
Related Work in ILIR CLEF Ad-hoc bi-lingual track including two Indian languages Hindi and Telugu - Our team from IIIT-H participated in Hindi and Telugu to English CLIR task “Hindi and Telugu to English Cross Language Information Retrieval”, Prasad Pingali and Vasudeva Varma. CLEF FIRE
Related Work in ILIR CLEF Indian language subtask consisting of Hindi, Bengali, Marathi, Telugu and Tamil - Five teams including ours participated - Hindi and Telugu to English CLIR “IIIT Hyderabad at CLEF Adhoc Indian Language CLIR task”, Prasad Pingali and Vasudeva Varma. CLEF FIRE
Related Work in ILIR Google’s CLIR system for 34 languages including Hindi FIRE
OUR CLIR EXPERIMENTS FIRE
Our CLIR experiments Ad-hoc cross-lingual Hindi to English, and English to Hindi. Ad-hoc monolingual runs in Hindi and English 12 runs in total were submitted for the above 4 tasks FIRE
Problem statement CLIR system should take a set of 50 topics in the source language and return top 1000 documents for each topic in the target language FIRE ईरान का परमाणु कार्यक्रम ईरान का कार्यक्रम और उसकी परमाणु नीति के बारे में विश्व की राय। ईरान की परमाणु नीति और ऐसे कार्यक्रम के विरुद्ध ईरान पर यूएसए का निरंतर दबाव और धमकी के बारे में सूचना संबंधित प्रलेख में रहनी चाहिए। परमाणु नीति के समझौते के लिए ईरान और यूरोपीय संघ के बीच वार्ता और विश्व दृष्टि भी रुचिकर होगी
CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE
CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE
Named entities Identification Used for identifying the named entities present in the queries for transliteration We used – Our CRF-based NER system( as a binary classifier) for Hindi queries, – Stanford English NER system for English queries Identifies Person, Organization and Location names FIRE "Experiments in Telugu NER: A Conditional Random Field Approach“, Praneeth M Shishtla, Prasad Pingali, Vasudeva Varma. NERSSEAL-08, IJCNLP-08, Hyderabad, 2008.
CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE
Query translation Using bi-lingual lexicons – “Shabdanjali”, an English-Hindi dictionary containing 26,633 entries – IIT Bombay Hindi Wordnet – Manually collected Hindi-English dictionary with 6,685 entries FIRE Shabdanjali - Hindi Wordnet -
CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE
Transliteration Mapping-based approach For a given named entity in source language – Derive the Compressed Word Format (CWF) E.g. academia – cdm E.g. abullah - bll – Generate the list of Named entities & their CWFs at the target language side – Search and map the CWF of source language NE with the CWF of the right target language equivalent within the min. modified edit distance FIRE
Transliteration Implementation – Named entities present in the Hindi and English corpora are identified and listed. – Their CWFs are generated using a set of heuristic, rewrite and remove rules – CWFs are added to the list of NEs FIRE “Named Entity Transliteration for Cross-Language Information Retrieval using Compressed Word Format Mapping algorithm”, Srinivasan C Janarthanam, Sethuramalingam S, Udhyakumar Nallasamy. iNEWS-08, CIKM
CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene FIRE
Query Scoring We generate a Boolean OR query with scored query words Query scoring is based on – Position of occurrence of the word in the topic – Number of occurrences of the word – Numbers, Years are given greater weights FIRE
CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing & Ranking module – Stop word remover, – A typical Indexer using Lucene FIRE
Indexing module For the English corpus, stop words are removed and stemmed using Lucene For the Hindi corpus, a list of 246 words is generated from the given corpus based on frequency Documents are indexed using the Lucene Indexer and ranked using the BM-25 algorithm in Lucene FIRE
EVALUATION & ANALYSIS FIRE
Evaluation English-Hindi cross-lingual run FIRE RunMAPGMAPR-PrecBpref Title + Desc Title + Narr Title + Desc + Narr
Evaluation Hindi-English cross-lingual run FIRE RunMAPGMAPR-PrecBpref Title + Desc Title + Narr Title + Desc + Narr
Evaluation Hindi-Hindi monolingual run FIRE RunMAPGMAPR-PrecBpref Title + Desc Title + Narr Title + Desc + Narr
Evaluation English-English monolingual run FIRE RunMAPGMAPR-PrecBpref Title + Desc Title + Narr Title + Desc + Narr
English-Hindi Vs Hindi-Hindi FIRE
Hindi-English Vs English-English FIRE
Evaluation Summary – Our English-Hindi CLIR performance was 58% of the monolingual run – Our Hindi-English CLIR performance was 25% of the monolingual run – Our Hindi-Hindi monolingual run retrieved 52% of total relevant documents – Our English-English monolingual run retrieved 91% of total relevant documents FIRE
Analysis Our English-Hindi CLIR performance can be attributed to factors like – Exact matching of English named entities – Good coverage of English words in our lexicons A relatively lower performance on Hindi- English CLIR is due to – Low dictionary coverage – Query formulation was not complex enough FIRE
FUTURE WORK FIRE
Future Work Error analysis on per topic basis Work on more complex query formulations Work on other possible query translation techniques like – Building dictionaries from parallel corpora – Using web – Using Wikipedia FIRE
THANK YOU!!! FIRE