Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information Technology Hyderabad, India
IIIT Hyderabad - http://dli.iiit.ac.in Context Digital Library of India 155,000 English books 145,000 Other language books Population of literates 20% of India understand English 80% can not IIIT Hyderabad - http://dli.iiit.ac.in
Multilingual Access to Information Retrieve a book By metadata By keyword / content Cross Lingual Information Retrieval Read a book Help understand sentences in a language Help understand sentences across languages Machine Translation IIIT Hyderabad - http://dli.iiit.ac.in
Approaches to Multilingual Access Cross Lingual Retrieval Translate Query to Document Language Translate Document to Query Language Machine Translation Knowledge Based Approaches Corpus Based Approaches Hybrid Approaches IIIT Hyderabad - http://dli.iiit.ac.in
Challenges in Multilingual Access Corpus Based Approaches Unavailability of Parallel Corpus for pairs of languages Unavailability of Computational Linguistics Resources Dictionary Based Approaches Unavailability of multiple bilingual dictionaries IIIT Hyderabad - http://dli.iiit.ac.in
IIIT Hyderabad - http://dli.iiit.ac.in Resources Universal Dictionary Conceived and implemented by Michael Shamos at CMU, USA ITRANS A transcription scheme and associated tool built by IISc, IIIT and CMU Corpus Data Entry by TTD and DLI project TIDES project IIIT Hyderabad - http://dli.iiit.ac.in
IIIT Hyderabad - http://dli.iiit.ac.in Universal Dictionary IIIT Hyderabad - http://dli.iiit.ac.in
IIIT Hyderabad - http://dli.iiit.ac.in How are we doing it Cross Lingual Search (Identify Information) Dictionary lookup User feedback based Lucene Search Engine Machine Translation (Understand Information) Corpus based technique (EBMT) Dictionary based word-word lookup Good-enough translation vs Perfect translation IIIT Hyderabad - http://dli.iiit.ac.in
Cross Lingual Retrieval IIIT Hyderabad - http://dli.iiit.ac.in
Cross Lingual Retrieval IIIT Hyderabad - http://dli.iiit.ac.in
Reading Assistant System IIIT Hyderabad - http://dli.iiit.ac.in
IIIT Hyderabad - http://dli.iiit.ac.in Reading Assistant IIIT Hyderabad - http://dli.iiit.ac.in
IIIT Hyderabad - http://dli.iiit.ac.in Status Today CLIR for 6 languages MT for 3 languages Shakti (a knowledge based MT system) Parallel Corpus for Hindi-Eng UDICT About 40 Foreign Languages 6 Indian Languages IIIT Hyderabad - http://dli.iiit.ac.in
IIIT Hyderabad - http://dli.iiit.ac.in What more is needed? UDICT Improving coverage of existing languages Adding new languages Machine Translation Corpus acquisition State of art techniques applied to Indian Languages Multi-way parallel corpus development Textual format for the books Books currently are in Image formats OCR should be developed for textual content IIIT Hyderabad - http://dli.iiit.ac.in
Thank You Questions ?