AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus, Chennai

FIRE 2008 – Tamil – English CLIR Problem Definition –Ad-hoc cross-lingual document retrieval task of FIRE. –The task is to retrieve relevant documents in English for a given Indian language query –worked on Tamil – English cross lingual information retrieval system

Our Approach The main components in our CLIR system are –Query Language Analyser –Named Entity recognizer –Query Translation engine –Query Expansion –Ranking

Query Language Analyser – Tamil Morphological Analyser The morphological analyser analyses each word to give the morphs of the word E.g.: patiwwAnY ->pati(V) + ww (Past) + AnY(3SM) For nouns, the inflections mark the case such as Dative, accusative For verbs, the inflections carry information of Person, Number, Gender, tense, aspect and modal Uses paradigm-based approach Implemented as Finite State Machine

Named Entity Recognizer (NER) Generic engine uses Conditional Random Fields (CRFs) Trained on 100000 word corpus from various domains Uses a hierarchical tagset Performs with 80% Recall and Precision 89%

Query Translation Uses a bilingual dictionary based approach Tamil – English bilingual dictionary is 150K size For Named entities, for which transliteration is required, transliteration engine is used. Tamil to English Transliteration is a tough task –Tamil has few consonants. Transliteration is done using a statistical system based on n-grams approach The statistical system works with an accuracy of 81%

Query Expansion The query terms are expanded using –Thesaurus –Ontology Query Expansion is done at two places –Before Query translation –After Query translation Synonyms are obtained using WordNet

Query Expansion (2) Ontology is used to obtain more world knowledge Festivals Hindu Muslim Christian HoliDiwali Dussera Ramazan Christmas

What is there in the Ontology Descriptions about the entity –Ex: Holi- Festival of colours, Good over Evil, –Depavali- Festival of Lights, crackers etc We have an ontology of this type for 100 entities –Festivals, Sports, country, Natural Calamities, Sports, Person Names, etc

Ranking Here standard Okapi (BM25) ranking algorithm is used with customization to suite our need A parameter called boost factor is introduced to the standard algorithm for calculating the score The NEs in the query are given a boost factor of 1.5 and original query terms are given a boost factor of 1.25

Ranking (2) This boost factor parameter show the weightage for certain particular terms in the query NEs get more weightage than other terms, it has been give 0.5 times more weightage And Original query terms are given 0.25 times more weightage to retain the importance of the user given query terms

Experiments – Results (1) We have submitted two runs For query 29, “assistance after Tsunami”, on expanding the query for the terms “assistance” and “ Tsunami”, we obtain “financial assistance, relief material, manpower help, rebuilding infrastructure, government assistance, non-governmental organizations assistance, relief fund, natural calamity, Tsunami, high sea waves” This expansion of the query has helped in increasing the recall, the MAP score for this query is 0.46 For query ids 27 and 59 the system did not perform well

Experiments – Results (2) The query 27 “Sino Indian relationship” is too broad and the query expansion is not done well, due to lack of knowledge in the ontology, here what all constitute relationship needs to be defined The query 59, “Ameican citizens fight against Iraq war”, is too specific and the document collection has more number of documents on Iraq war, rather than on the particular document. The terms “Iraq War” get more weight than the terms “fight against”

Experiments – Results (3) MAPR-precP@5P10P@20Recall 0.48210.48620.72800.69600.63600.8912 Overall Results of the Tamil – English cross lingual information retrieval system.

Conclusion Here Query language analyser is used The difference between two runs MAP score of 0.3921 and 0.4821 The use of query expansion module helps in increasing the recall The results obtained are encouraging –MAP – 0.4821 –P@10 – 0.6960 –Recall – 0.8912

References Mohammad Afraz and Sobha L (2008), ‘English to Dravidian Language Machine Transliteration: A Statistical Approach Based on N-grams’, In the Proceedings of International Seminar on Malayalam and Globalization, Thiruvananthapuram, India. Genesereth, M. R. and Nilsson, N. (1987). Logical Foundations of Artificial Intelligence. Morgan Kaufmann Publishers: San Mateo, CA. Vijayakrishna R and Sobha L (2008), “Domain focused Named Entity Recognizer for Tamil using Conditional Random Fields”, In Proceedings of International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages, Hyderabad, India. pp. 59 – 66. S.Viswanathan, S.Ramesh Kumar, B.Kumara Shanmugam, S.Arulmozi and K.Vijay Shanker. (2003). “A Tamil Morphological Analyser”, In the Proceedings of International Conference on Natural LanguageProcessing-2003, Mysore.

Thank you!

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Similar presentations

Presentation on theme: "AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Similar presentations

Presentation on theme: "AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,"— Presentation transcript:

Similar presentations

About project

Feedback