Presentation is loading. Please wait.

Presentation is loading. Please wait.

IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.

Similar presentations


Presentation on theme: "IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1."— Presentation transcript:

1 IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1

2 Outline Introduction Related Work in Indian Language IR Our CLIR experiments Evaluation & Analysis Future Work IIIT-H @ FIRE-20082

3 Introduction Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query (courtesy: Wikipedia) Information – text, audio, video, speech, geographical information etc IIIT-H @ FIRE-20083

4 CLIR – Indian languages(IL) scenario IIIT-H @ FIRE-20084 தமிழ் Modified from Source: D. Oard’s Cross-Language IR presentation हिन्दी తెలుగు বাংলা मराठी To retrieve documents written in any IL when user queries in one language

5 Why CLIR for IL? IIIT-H @ FIRE-20085

6 6 Why CLIR for IL?

7 IIIT-H @ FIRE-20087 Internet user growth in India between 2000 to 2008 - 1,100.0 % Source : www.internetworldstats.com Growth in Indian language contents on the web between 2000 to 2007 – 700% So, CLIR for IL becomes mandatory!

8 RELATED WORK IN INDIAN LANGUAGE IR IIIT-H @ FIRE-20088

9 Related Work in ILIR ACM TALIP, 2003 - The surprise language exercises - Task was to build CLIR system for English to Hindi and Cebuano “The surprise language exercises”, Douglas W. Oard. ACM Transactions on Asian Language Information Processing (TALIP), 2(2):79–84, 2003 IIIT-H @ FIRE-20089

10 Related Work in ILIR CLEF 2006 - Ad-hoc bi-lingual track including two Indian languages Hindi and Telugu - Our team from IIIT-H participated in Hindi and Telugu to English CLIR task “Hindi and Telugu to English Cross Language Information Retrieval”, Prasad Pingali and Vasudeva Varma. CLEF 2006. IIIT-H @ FIRE-200810

11 Related Work in ILIR CLEF 2007 - Indian language subtask consisting of Hindi, Bengali, Marathi, Telugu and Tamil - Five teams including ours participated - Hindi and Telugu to English CLIR “IIIT Hyderabad at CLEF 2007 - Adhoc Indian Language CLIR task”, Prasad Pingali and Vasudeva Varma. CLEF 2007. IIIT-H @ FIRE-200811

12 Related Work in ILIR Google’s CLIR system for 34 languages including Hindi IIIT-H @ FIRE-200812

13 OUR CLIR EXPERIMENTS IIIT-H @ FIRE-200813

14 Our CLIR experiments Ad-hoc cross-lingual Hindi to English, and English to Hindi. Ad-hoc monolingual runs in Hindi and English 12 runs in total were submitted for the above 4 tasks IIIT-H @ FIRE-200814

15 Problem statement CLIR system should take a set of 50 topics in the source language and return top 1000 documents for each topic in the target language IIIT-H @ FIRE-200815 28 ईरान का परमाणु कार्यक्रम ईरान का कार्यक्रम और उसकी परमाणु नीति के बारे में विश्व की राय। ईरान की परमाणु नीति और ऐसे कार्यक्रम के विरुद्ध ईरान पर यूएसए का निरंतर दबाव और धमकी के बारे में सूचना संबंधित प्रलेख में रहनी चाहिए। परमाणु नीति के समझौते के लिए ईरान और यूरोपीय संघ के बीच वार्ता और विश्व दृष्टि भी रुचिकर होगी

16 CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene IIIT-H @ FIRE-200816

17 CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene IIIT-H @ FIRE-200817

18 Named entities Identification Used for identifying the named entities present in the queries for transliteration We used – Our CRF-based NER system( as a binary classifier) for Hindi queries, – Stanford English NER system for English queries Identifies Person, Organization and Location names IIIT-H @ FIRE-200818 "Experiments in Telugu NER: A Conditional Random Field Approach“, Praneeth M Shishtla, Prasad Pingali, Vasudeva Varma. NERSSEAL-08, IJCNLP-08, Hyderabad, 2008.

19 CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene IIIT-H @ FIRE-200819

20 Query translation Using bi-lingual lexicons – “Shabdanjali”, an English-Hindi dictionary containing 26,633 entries – IIT Bombay Hindi Wordnet – Manually collected Hindi-English dictionary with 6,685 entries IIIT-H @ FIRE-200820 Shabdanjali - http://www.shabdkosh.com/shabdanjali Hindi Wordnet - http://www.cfilt.iitb.ac.in/wordnet/webhwn/

21 CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene IIIT-H @ FIRE-200821

22 Transliteration Mapping-based approach For a given named entity in source language – Derive the Compressed Word Format (CWF) E.g. academia – cdm E.g. abullah - bll – Generate the list of Named entities & their CWFs at the target language side – Search and map the CWF of source language NE with the CWF of the right target language equivalent within the min. modified edit distance IIIT-H @ FIRE-200822

23 Transliteration Implementation – Named entities present in the Hindi and English corpora are identified and listed. – Their CWFs are generated using a set of heuristic, rewrite and remove rules – CWFs are added to the list of NEs IIIT-H @ FIRE-200823 “Named Entity Transliteration for Cross-Language Information Retrieval using Compressed Word Format Mapping algorithm”, Srinivasan C Janarthanam, Sethuramalingam S, Udhyakumar Nallasamy. iNEWS-08, CIKM- 2008.

24 CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing module – Stop-word remover, – A typical Indexer using Lucene IIIT-H @ FIRE-200824

25 Query Scoring We generate a Boolean OR query with scored query words Query scoring is based on – Position of occurrence of the word in the topic – Number of occurrences of the word – Numbers, Years are given greater weights IIIT-H @ FIRE-200825

26 CLIR System architecture Query Processing module – Named Entities identification – Query translation using lexicons – Transliteration(mapping-based) – Query Scoring Indexing & Ranking module – Stop word remover, – A typical Indexer using Lucene IIIT-H @ FIRE-200826

27 Indexing module For the English corpus, stop words are removed and stemmed using Lucene For the Hindi corpus, a list of 246 words is generated from the given corpus based on frequency Documents are indexed using the Lucene Indexer and ranked using the BM-25 algorithm in Lucene IIIT-H @ FIRE-200827

28 EVALUATION & ANALYSIS IIIT-H @ FIRE-200828

29 Evaluation English-Hindi cross-lingual run IIIT-H @ FIRE-200829 RunMAPGMAPR-PrecBpref Title + Desc0.15380.00930.16870.1905 Title + Narr0.15160.02290.18710.1918 Title + Desc + Narr0.14320.02150.17930.1886

30 Evaluation Hindi-English cross-lingual run IIIT-H @ FIRE-200830 RunMAPGMAPR-PrecBpref Title + Desc0.09070.01970.12910.1408 Title + Narr0.12040.03660.17180.1734 Title + Desc + Narr0.11120.02870.15410.1723

31 Evaluation Hindi-Hindi monolingual run IIIT-H @ FIRE-200831 RunMAPGMAPR-PrecBpref Title + Desc0.25790.04270.27970.2964 Title + Narr0.26520.05340.28450.3023 Title + Desc + Narr0.24720.05250.25580.2773

32 Evaluation English-English monolingual run IIIT-H @ FIRE-200832 RunMAPGMAPR-PrecBpref Title + Desc0.44160.34370.45790.4889 Title + Narr0.48630.39890.48940.5218 Title + Desc + Narr0.46900.38410.47070.5167

33 English-Hindi Vs Hindi-Hindi IIIT-H @ FIRE-200833

34 Hindi-English Vs English-English IIIT-H @ FIRE-200834

35 Evaluation Summary – Our English-Hindi CLIR performance was 58% of the monolingual run – Our Hindi-English CLIR performance was 25% of the monolingual run – Our Hindi-Hindi monolingual run retrieved 52% of total relevant documents – Our English-English monolingual run retrieved 91% of total relevant documents IIIT-H @ FIRE-200835

36 Analysis Our English-Hindi CLIR performance can be attributed to factors like – Exact matching of English named entities – Good coverage of English words in our lexicons A relatively lower performance on Hindi- English CLIR is due to – Low dictionary coverage – Query formulation was not complex enough IIIT-H @ FIRE-200836

37 FUTURE WORK IIIT-H @ FIRE-200837

38 Future Work Error analysis on per topic basis Work on more complex query formulations Work on other possible query translation techniques like – Building dictionaries from parallel corpora – Using web – Using Wikipedia IIIT-H @ FIRE-200838

39 THANK YOU!!! IIIT-H @ FIRE-200839


Download ppt "IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1."

Similar presentations


Ads by Google