Download presentation
Presentation is loading. Please wait.
Published byGavin Booker Modified over 9 years ago
1
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information Technology Hyderabad, India
2
IIIT Hyderabad - http://dli.iiit.ac.in 2 Context Digital Library of India Digital Library of India 155,000 English books 155,000 English books 145,000 Other language books 145,000 Other language books Population of literates Population of literates 20% of India understand English 20% of India understand English 80% can not 80% can not
3
IIIT Hyderabad - http://dli.iiit.ac.in 3 Multilingual Access to Information Retrieve a book Retrieve a book By metadata By metadata By keyword / content By keyword / content Cross Lingual Information Retrieval Cross Lingual Information Retrieval Read a book Read a book Help understand sentences in a language Help understand sentences in a language Help understand sentences across languages Help understand sentences across languages Machine Translation Machine Translation
4
IIIT Hyderabad - http://dli.iiit.ac.in 4 Approaches to Multilingual Access Cross Lingual Retrieval Cross Lingual Retrieval Translate Query to Document Language Translate Query to Document Language Translate Document to Query Language Translate Document to Query Language Machine Translation Machine Translation Knowledge Based Approaches Knowledge Based Approaches Corpus Based Approaches Corpus Based Approaches Hybrid Approaches Hybrid Approaches
5
IIIT Hyderabad - http://dli.iiit.ac.in 5 Challenges in Multilingual Access Corpus Based Approaches Corpus Based Approaches Unavailability of Parallel Corpus for pairs of languages Unavailability of Parallel Corpus for pairs of languages Unavailability of Computational Linguistics Resources Unavailability of Computational Linguistics Resources Dictionary Based Approaches Dictionary Based Approaches Unavailability of multiple bilingual dictionaries Unavailability of multiple bilingual dictionaries
6
IIIT Hyderabad - http://dli.iiit.ac.in 6 Resources Universal Dictionary Universal Dictionary Conceived and implemented by Michael Shamos at CMU, USA Conceived and implemented by Michael Shamos at CMU, USA ITRANS ITRANS A transcription scheme and associated tool built by IISc, IIIT and CMU A transcription scheme and associated tool built by IISc, IIIT and CMU Corpus Corpus Data Entry by TTD and DLI project Data Entry by TTD and DLI project TIDES project TIDES project
7
IIIT Hyderabad - http://dli.iiit.ac.in 7 Universal Dictionary
8
IIIT Hyderabad - http://dli.iiit.ac.in 8 How are we doing it Cross Lingual Search (Identify Information) Cross Lingual Search (Identify Information) Dictionary lookup Dictionary lookup User feedback based User feedback based Lucene Search Engine Lucene Search Engine Machine Translation (Understand Information) Machine Translation (Understand Information) Corpus based technique (EBMT) Corpus based technique (EBMT) Dictionary based word-word lookup Dictionary based word-word lookup Good-enough translation vs Perfect translation Good-enough translation vs Perfect translation
9
IIIT Hyderabad - http://dli.iiit.ac.in 9 Cross Lingual Retrieval
10
IIIT Hyderabad - http://dli.iiit.ac.in 10 Cross Lingual Retrieval
11
IIIT Hyderabad - http://dli.iiit.ac.in 11 Reading Assistant System
12
IIIT Hyderabad - http://dli.iiit.ac.in 12 Reading Assistant
13
IIIT Hyderabad - http://dli.iiit.ac.in 13 Status Today CLIR for 6 languages CLIR for 6 languages MT for 3 languages MT for 3 languages Shakti (a knowledge based MT system) Shakti (a knowledge based MT system) Parallel Corpus for Hindi-Eng Parallel Corpus for Hindi-Eng UDICT UDICT About 40 Foreign Languages About 40 Foreign Languages 6 Indian Languages 6 Indian Languages
14
IIIT Hyderabad - http://dli.iiit.ac.in 14 What more is needed? UDICT UDICT Improving coverage of existing languages Improving coverage of existing languages Adding new languages Adding new languages Machine Translation Machine Translation Corpus acquisition Corpus acquisition State of art techniques applied to Indian Languages State of art techniques applied to Indian Languages Multi-way parallel corpus development Multi-way parallel corpus development Textual format for the books Textual format for the books Books currently are in Image formats Books currently are in Image formats OCR should be developed for textual content OCR should be developed for textual content
15
Thank You Questions ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.