AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,

Slides:



Advertisements
Similar presentations
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Advertisements

Improved TF-IDF Ranker
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Data Mining and Text Analytics in Music Audi Sugianto and Nicholas Tawonezvi.
DRAVIDIAN WORDNET S.Arulmozi Dravidian University 29 April 2013.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani Center.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Query Relevance Feedback and Ontologies How to Make Queries Better.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Query Expansion.
Search Engines and Information Retrieval Chapter 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
NERIL: Named Entity Recognition for Indian FIRE 2013.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Microsoft Research India’s Participation in FIRE2008 Raghavendra Udupa
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Query Operations Relevance Feedback & Query Expansion.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Development of NE Wordnet: An Integrated Wordnet for Languages of the North-East India Assamese & Bodo by Utpal Saikia Biswajit Brahma Dibyajyoti Sarmah.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Ontology Working Group: Schematic Architecture Inference and Access OntologiesLexica Information Retrieval Machine Translation Information Extraction Ontology.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Mohammad Alqahtani, Dr. Eric Atwell
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Thai AGROVOC Ontology Base for Agricultural Information Retrieval
An Empirical Study of Learning to Rank for Entity Search
Multilingual Information Access in a Digital Library
CS246: Information Retrieval
Project intervention logic
Presentation transcript:

AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus, Chennai

FIRE 2008 – Tamil – English CLIR Problem Definition –Ad-hoc cross-lingual document retrieval task of FIRE. –The task is to retrieve relevant documents in English for a given Indian language query –worked on Tamil – English cross lingual information retrieval system

Our Approach The main components in our CLIR system are –Query Language Analyser –Named Entity recognizer –Query Translation engine –Query Expansion –Ranking

Query Language Analyser – Tamil Morphological Analyser The morphological analyser analyses each word to give the morphs of the word E.g.: patiwwAnY ->pati(V) + ww (Past) + AnY(3SM) For nouns, the inflections mark the case such as Dative, accusative For verbs, the inflections carry information of Person, Number, Gender, tense, aspect and modal Uses paradigm-based approach Implemented as Finite State Machine

Named Entity Recognizer (NER) Generic engine uses Conditional Random Fields (CRFs) Trained on word corpus from various domains Uses a hierarchical tagset Performs with 80% Recall and Precision 89%

Query Translation Uses a bilingual dictionary based approach Tamil – English bilingual dictionary is 150K size For Named entities, for which transliteration is required, transliteration engine is used. Tamil to English Transliteration is a tough task –Tamil has few consonants. Transliteration is done using a statistical system based on n-grams approach The statistical system works with an accuracy of 81%

Query Expansion The query terms are expanded using –Thesaurus –Ontology Query Expansion is done at two places –Before Query translation –After Query translation Synonyms are obtained using WordNet

Query Expansion (2) Ontology is used to obtain more world knowledge Festivals Hindu Muslim Christian HoliDiwali Dussera Ramazan Christmas

What is there in the Ontology Descriptions about the entity –Ex: Holi- Festival of colours, Good over Evil, –Depavali- Festival of Lights, crackers etc We have an ontology of this type for 100 entities –Festivals, Sports, country, Natural Calamities, Sports, Person Names, etc

Ranking Here standard Okapi (BM25) ranking algorithm is used with customization to suite our need A parameter called boost factor is introduced to the standard algorithm for calculating the score The NEs in the query are given a boost factor of 1.5 and original query terms are given a boost factor of 1.25

Ranking (2) This boost factor parameter show the weightage for certain particular terms in the query NEs get more weightage than other terms, it has been give 0.5 times more weightage And Original query terms are given 0.25 times more weightage to retain the importance of the user given query terms

Experiments – Results (1) We have submitted two runs For query 29, “assistance after Tsunami”, on expanding the query for the terms “assistance” and “ Tsunami”, we obtain “financial assistance, relief material, manpower help, rebuilding infrastructure, government assistance, non-governmental organizations assistance, relief fund, natural calamity, Tsunami, high sea waves” This expansion of the query has helped in increasing the recall, the MAP score for this query is 0.46 For query ids 27 and 59 the system did not perform well

Experiments – Results (2) The query 27 “Sino Indian relationship” is too broad and the query expansion is not done well, due to lack of knowledge in the ontology, here what all constitute relationship needs to be defined The query 59, “Ameican citizens fight against Iraq war”, is too specific and the document collection has more number of documents on Iraq war, rather than on the particular document. The terms “Iraq War” get more weight than the terms “fight against”

Experiments – Results (3) Overall Results of the Tamil – English cross lingual information retrieval system.

Conclusion Here Query language analyser is used The difference between two runs MAP score of and The use of query expansion module helps in increasing the recall The results obtained are encouraging –MAP – – –Recall –

References Mohammad Afraz and Sobha L (2008), ‘English to Dravidian Language Machine Transliteration: A Statistical Approach Based on N-grams’, In the Proceedings of International Seminar on Malayalam and Globalization, Thiruvananthapuram, India. Genesereth, M. R. and Nilsson, N. (1987). Logical Foundations of Artificial Intelligence. Morgan Kaufmann Publishers: San Mateo, CA. Vijayakrishna R and Sobha L (2008), “Domain focused Named Entity Recognizer for Tamil using Conditional Random Fields”, In Proceedings of International Joint Conference on Natural Language Processing Workshop on NER for South and South East Asian Languages, Hyderabad, India. pp. 59 – 66. S.Viswanathan, S.Ramesh Kumar, B.Kumara Shanmugam, S.Arulmozi and K.Vijay Shanker. (2003). “A Tamil Morphological Analyser”, In the Proceedings of International Conference on Natural LanguageProcessing-2003, Mysore.

Thank you!