Download presentation
Published byDouglas Goodwin Modified over 9 years ago
1
सुस्वागतम् Welcome Technology Development for Indian Languages
Presentation at LREC 2010 Conf , Malta Swaran Lata & Somnath Chandra Human Centred Computing Division Department of Information Technology Presented by : Shyam S Agrawal Executive Director KIIT, Advisor CDAC TDIL 1
2
Technology Development for Indian Languages (TDIL) Programme
Promotes Research & Development of Technology, Software Tools and Applications for Indian Languages Catalyzes proliferation of Language Technology products and solutions Promotes Standardization TDIL 2
3
Complexity for Language Technology Development Complexity:
Very challenging area for computer scientists due to voluminous, informal and ambiguous nature of human languages. Involves interdisciplinary research in advanced and sophisticated computer processing involving Artificial Intelligence and Machine Learning in one hand ; linguistic knowledge for incorporating human communication techniques on the other hand. Still in research stage in many areas despite huge efforts by academia and scientists in India as well as abroad More intense challenge for Indian Languages: Large linguistic diversity with 22 officially recognized languages and 12 scripts. One-language Many Scripts ; Many Languages – One Script Specificity for each language and script is unique in nature and can not be easily replicated. Difference in perceptions of usage among various user groups, e.g. State Governments , Academia and industry 3 TDIL
4
Official Indian Languages & Scripts
Sl. No. Language Script 1. Hindi Devanagari 2. Sanskrit 3. Marathi 4. Konkani 5. Nepali 6. Maithili 7. Sindhi 8. Bodo Devenagari 9. Dogri Devanagari, Sharda 10. Bengali 11. Assamese 12. Manipuri Bengali, Meitei (Mayak) 13. Gujarati 14. Kannada 15. Malayalam 16. Oriya 17. Punjabi Gurmukhi 18. Tamil 19. Telugu 20. Urdu Arabic 21. Santhali Ol-Chiki, Devanagai, 22. Kashmiri Arabic, Sharda TDIL 4
5
Genesis of Language Technology Development in India-Early Initiatives
Pioneering Effort by DIT in collaboration with IIT Kanpur in 1983: Department of Electronics (Now DIT) entrusted a sponsored project to IIT Kanpur to build an integrated Devanagari Terminal (GIST). A standalone system with a computer keyboard was used for inputting the character in devanagari, a monitor for display and a Dot Matrix printer for printing and a serial communication for sending character to another terminal was developed. Developed Indian Script Code for Information Interchange (ISCII standard) C-DAC Pune adopted GIST technology to develop products and licensed it to manufacturers Technology Development for Indian Languages (TDIL) Programme started in the year 1991 as a separate entity. TDIL 5
6
Phases of Language Technology developments
Seeding Phase : TDIL programme established in the year 1991 Some linguistic resources such as corpora developed NLP training programme for Computer Scientists and linguists Some stand-alone language learning tools have also been developed Exploratory Work in the area of NLP Exploratory Phase : Development of Proof –of –concept Machine Translation System for English to Indian Languages and Indian Languages (Angla-Bharti) to Indian Languages (Anusaraka) systems have been developed. Laboratory model of font dependent Optical Character Recognition in Hindi Text-to-Speech for Hindi TDIL 6
7
Resource Centres for Indian Languages Technology Solutions (RCILTS)
Catch-up Phase : The TDIL programme gathered momentum by establishing 13 Resource Centres for Indian Languages Technology Solutions (RCILTS)and 10 CoIL-Net Centres. Resource Centres for Indian Languages Technology Solutions (RCILTS) The objective was to proliferate this activity to a large number of institutions across the country with the specific mandate for a language or a group of languages. Under this project, these centres have developed several important tools , linguistic resources and technologies for Indian language support Many of these tools are now being modified and upgraded to be released in public domain under National Roll-Out Project. Some of the important language technology tools and resources developed under Resource centres Project are: TDIL 7
8
Spell-Checkers in Indian Languages [Resource Centres]
Bi-lingual Dictionaries: between Indian Languages with over 30,000 words [Resource Centres] Spell-Checkers in Indian Languages [Resource Centres] Ontology & Word-Net: 9000 syn-sets with morphological analyzer and front end for Hindi Word-net with 1100 lexical entries with X-window interface for Oriya. Proof-of –concept technologies for Optical Character Recognition system (OCR) in other Indian languages Proof-of-concept TTS in other Indian languages In addition , several other tools , Operating Systems and resources have been developed under various sponsored projects. Some of the notables are: INDIX-2 (Localized LINUX in 12 Indian languages) Phrasal Dictionaries: in Tamil and Kananda [IIIT, Hyderabad] Online VishwaKosha: with 9162 topics [CDAC] Parallel Corpora: One Million pages Parallel Corpora in 11 languages [CDAC] TDIL 8
9
COIL-Net Centres: The objective was to develop Localized Content in Hindi Speaking states for enhancement of IT proliferation E- content of approximately HTML & Dynamic pages in the domains of health, education, tourism and agri-business have been developed. Content on the eminent personalities, tourist places, classical work, and cultural heritage information on these regions have been developed. The developed content is uploaded on the internet at the website National Train Enquiry website localized in Hindi by CDAC. TDIL 9
10
Product Development and Proliferation Phase :2005-onwards
A ‘Roadmap for Language Technology Development in India’ was evolved-to formulate short-term & long-term mission plan and strategy for development of Language Technologies in India. The Focus is to synergize development efforts and Develop deployable products National Roll-Out Programme and Six Mission Mode Projects have been initiated to facilitate Speedy Development & Availability of the Language Technologies. TDIL 10
11
Proliferation of Indian Language Technology Products : National Roll-Out Plan
Objectives of the initiative To facilitate Speedy Development & Availability of the Language Technologies. Broad contents of the CD Common user’s Toolkit – Content Creation Tools, DTP, Office Automation, Code Converters Productivity Tools – Spellchecker, Domain based Dictionaries, Transliteration. Power user – OCR, Text to Speech, MAT, etc Distribution channel for the CD Registered users of web site of TDIL, DIT – through postal department. IT magazines, publications, etc. Schools, Government departments, etc. Software tools and fonts for 12 Indian languages namely Hindi, Tamil, Telugu, Assamese, Kannada, Malayalam, Marathi, Oriya, Punjabi and Urdu and Gujarati and Sanskrit languages have been released in public domain CDs containing 4 Indian Languages namely Bodo , Dogri , Maithili and Nepali languages are being released on Feb 21, 2009 – UNESCO International Mother Language Day. TDIL 11
12
Software tools and fonts CD contents
1 Language True Type Fonts with Keyboard Driver - more than 200 Supporting INSCRIPT, Typewriter, Phonetic Keyboard layouts Allows content creation in Indian languages using applications running under Microsoft windows 2 Language Multi-font Keyboard Engine for True Type Fonts Allows content creation in Indian languages using applications running under Microsoft windows in variety of font encoding. 3 Language Unicode Compliant Open Type Fonts - more than 200 Allows to render the Indian language Unicode data. 4 Unicode Compliant Keyboard Driver Supporting INSCRIPT, Typewriter, Phonetic Keyboard layouts. Allows Unicode complaint data inputting 5 Generic fonts and storage code converter Allows user to convert the existing data in different encoding to ISCII / UNICODE 6 Localized version of Bharateeya OO (Office Suite) This consists of word processor, presentation tool, spreadsheet & drawing tool 7 Fire fox browser Localized version of Fire fox browser TDIL 12
13
Software tools and fonts CD contents
8 Colombo - client for Windows and Linux Operating systems. Using this user can send / receive s in Indian languages. The menus are also in local language. 9 GAIM - Multiprotocol Messenger. This enables the user to user various messenger clients for communications 10 Optical Character Recognition With the help of OCR one can scan the printed text matter and convert it into editable form for further processing. 11 Typing Tutor This application teaches the user to type in Indian languages. 12 Spellchecker Allows the end user to rectify spelling mistakes in the document 13 Dictionaries English to Indian language and vice versa dictionaries in general, administrative, technical domains. 14 Transliteration Tool Transliterates a given Indian language text into Roman & vice versa. Useful for user who is not familiar with the script. 15 Text to Speech system Readouts the text TDIL 13
14
TDIL 14
15
Product Development Efforts…. Mission Mode Projects Phase -I
In the consortium mode 26 premier Institutes and R&D organizations are working together on six projects to develop the advanced technologies & applications. Development of English to Indian Languages Machine Translation (MT) System: 10 institutions are participating to build deployable MT System. Consortium Leader: CDAC, Pune Domains: Tourism and Health Six Languages pairs: English to Hindi/ Marathi/ Bengali/ Oriya/ Tamil/ Urdu. Development of English to Indian Languages Machine Translation (MT) System with Angla-Bharti Technology: 4 institutions are participating to build deployable MT System. Consortium Leader: IIT Kanpur Domains: Tourism and Health Six Languages pairs: English to Hindi/ Marathi/ Bengali/ Oriya/ Tamil/ Urdu. Development of Indian Language to Indian Language Machine Translation System: 11 institutions are participating to build deployable Bi-directional MT System. Consortium Leader: IIIT, Hyderabad Domains: Tourism and Health Nine Language pairs: Tamil-Hindi, Telugu-Hindi, Urdu-Hindi, Kannada-Hindi, Punjabi-Hindi, Marathi-Hindi, Bengali-Hindi, Tamil-Telugu, Malayalam-Tamil TDIL 15
16
Development Efforts…. Mission Mode Projects Phase -I
Development of Robust Document Analysis & Recognition System for Indian Languages: 11 institutions participating to build OCR System with improved accuracy, font and point-size independent recognition capability. Consortium Leader: IIT, Delhi 10 Scripts: Bengali, Devanagari, Malayalam, Gujarati, Telugu, Tamil, Oriya, Tibetan/Nepali, Gurmukhi, Kannada Development of On-line handwriting recognition system: Seven institutions are participating to build On-Line Handwriting Recognition System. Consortium Leader: IISc, Bangalore 6 Scripts: Devanagari, Bengali, Tamil, Telugu, Kannada and Malayalam Development of Cross-lingual Information Access 11 institutions participating to develop a portal where, a user will be able to give a query in one Indian Language and the user will be able to access documents available in (a) The language of the query and (b) Hindi (If the query language is not Hindi) and (c) English. Consortium Leader: IIT, Bombay Domains: Tourism and Health Six Languages: Bengali, Hindi, Marathi, Punjabi, Tamil and Telugu. 16 TDIL
17
Status of the readiness of the consortium mode projects
Sl No Name of the product /system Language Pairs Domains Version Possible Date 1 English to Indian Languages Machine Translation System (E-IL) Tourism α 2 English to Indian Languages Machine Translation System (E-IL) with Angla-Bharti approach 3 Indian Language to Indian Language Machine Translation (IL-IL) March 31, 2009 4 Cross-lingual Information Access (CLIA) Marathi , Tamil and Bengali 5 Printed Text OCR -- March 31,2009 6 On-line Handwriting recognition system (OHWR) TDIL 17
18
Development Efforts…. Speech Processing
Speech Corpora: Annotated Speech Corpora of approximately 50 hours developed for Hindi, Marathi, Punjabi, Bengali, Assamese and Manipuri. [CDAC] Speech Corpora for Tamil, Malayalam, Telugu and Kannada under development. [CDAC] Speech Recognition: Phonetic Engine for Speech recognition system for Hindi and Telugu languages are being developed [IIIT Hyderabad] Text-to-Speech (TTS) and Automatic Speech Recognition in Indian Languages: Consortium Mode project for development of Text-to-Speech system for visually challenged persons in six Indian languages namely Hindi, Tamil , Telugu , Marathi , Malayalam and Bengali languages has been initiated. Development for Automatic Speech Processing in Indian languages is also being initaited. TDIL 18
19
Development Efforts…. Sanskrit Computing :
Consortium Mode project has been initiated for development of Sanskrit Computational tool kit and Sanskrit-Hindi Machine Translation System [ Univ. of Hyderabad] Corpora Consortium Mode project is being initiated for development of annotated corpora in 11 Indian languages. The project will evolve the standards for natural language processing TDIL 19
20
Development Efforts for North –Eastern Languages
Consortium Mode Projects to develop linguistic resources and basic information processing tool for North-Eastern languages namely Assamese, Bodo Manipuri and Nepali languages have been initiated. [ C-DAC Pune] Consortium Mode project has also been initiated for development of Word-net in North-Eastern Languages [ IIT Bombay] Speech Corpora and standardization of International Phonetic Alphabet (IPA) for Bodo language has been initiated [Univ. of Guwahati] TDIL 20
21
Standardization TDIL 21
22
ISCII – Indian Script Code for Information Interchange
Since the 1970s, efforts were made to evolve different codes for characters and symbols of the 10 Brahmi based Indian scripts due to their common phonetic structure. These efforts culminated in bringing out Indian standards for Indian Script Code for Information Interchange (ISCII) in December, 1991. The ISCII code standard specifies a 7- bit code table which can be used in 7 or 8-bit ISO compatible environment. It allows English and Indian script alphabets to be used simultaneously. TDIL 22
23
INSCRIPT Keyboard Layout
Standardized by Bureau of Indian Standards :1991 Key placement is such a way that a user well versed with one language can type in another without efforts. This is overlaid on the existing QWERTY keyboard. Language selection is done with help of either Caps lock, scroll lock or Num lock key Since it is based on phonetic nature of Indian languages it is very easy to learn. Efforts have been initiated to incorporate the additional characters as per latest UNICODE 5.1 standards in the modified layout TDIL 23
24
In Sync with other International Standards such as W3C
UNICODE Unicode uses a 16 bit encoding that provides code point for more than characters (65536). Corresponds to ISO/IEC Universal Multiple Octet Coded Character Set (UCS) In Sync with other International Standards such as W3C Unicode Standards assigns each character a unique numeric value and name. Encodes all of the characters used for the written languages of the world. Unicode is increasing being accepted as a standard for Information Interchange worldwide as most of the major IT Companies have declared their support for it Department of Information Technology is the voting member of the Unicode Consortium since the year 2000 to ensure the adequate representation of Indic scripts in the Unicode Standards. DIT finalized the changes in the Unicode Standard and majority of changes have been accepted and incorporated in UNICODE Standards version 5.0. Initiatives have been taken to incorporate additional languages/ scripts and additional characters and symbols of Vedic Sanskrit in UNICODE. TDIL 24
25
UNICODE .. Examples Indicates proposed characters/symbols/signs shape change in the existing standard Indicates the change in the annotation/explanation of that particular code point . Indicates proposed characters/symbols/signs addition in the existing standard TDIL 25
26
W3C Project “Web Internationalization Initiative” has been initiated with the objective of adequate representation of Indic scripts in the Web Technology Standards being evolved by World Wide Web Consortium (W3C). Initiative has been taken to incorporate key findings of WII projects in the W3C standards /guidelines TDIL 26
27
W3C - Large amount of works need to be done
In the phase-I of WII projects only few exploratory work has been carried out W3C Internationalization has 115 recommendations covering web internationalization , XML, Cascaded Style Sheet and Speech Synthesis These recommendations needs to carefully studied in the Indian Language perspective and specific recommendations need to projected to W3C. A few specific Interest of them are Internationalization Tag Set (ITS) Version 1.0 ,Voice Extensible Mark-up Language (VoiceXML) 2.1 , Web Content Accessibility Guidelines 1.0 , Cascading Style Sheets (CSS1) Level 1 Specification , Speech Synthesis Mark-up Language 1.0 Need for consultation with all stake holders such as academia , industry and various state governments. Sensitization to industry and web service providers to adopt W3C standards. W3C India Office has been established at DIT under the aegis of the TDIL programme. TDIL 27
28
International Phonetic Alphabet (IPA)
Since phonetic representation of symbols is the required for present-day speech mark-up language like W3C Speech synthesis mark-up language (SSML), standardization of IPA symbols is necessary. India being a multilingual country a standardized phonetic alphabet has to be developed for scientific study of phonetics and SSML for Indian languages. The IPA standardization for all Indian language and acceptance of it by International Phonetic Association is thus required for development of speech technology and associated products. Efforts initiated to standardize IPA symbols in Indian languages TDIL 28
29
Common Locale Data Repository
Common Locale Data Repository (CLDR) is an initiative of UNICODE consortium to develop locale data for World languages. The Unicode CLDR provides key building blocks for software to support the world's languages. CLDR is by far the largest and most extensive standard repository of locale data. This data is used by a wide spectrum of companies for their software internationalization and localization Department of Information Technology has already become TC (Team Coordinator) to incorporate / modify Indian languages in CLDR. Modifications/ Development of Common Locale data repository in Indian languages have been initiated in consultations with state governments and other stake holders. CLDR data for 6 Indian Languages have been incorporated in UNICODE CLDR. TDIL 29
30
Language Tags Language Tags are being used in most of the multilingual applications such as web development, Multilingual Internet Data Exchange, Language Negotiation and web services. The nomenclatures of the Language Tags are being standardized under ISO 639 standard. The Language Tag Standard ISO 639-x (x stands for different versions) are being used in many other international Standards and Best Practices such as IETF (Internet Engineering Task Force) RFC 4646, RFC 4647 and W3C web standards. They are also related to ISO 3166 (for region codes) and ISO (script codes). The present forms of ISO 639-2, ISO and the futuristic ISO and ISO have many ambiguous entries for Indian languages, which need to be corrected urgently in order to prevent propagation of incorrect nomenclatures for Language sets. Modification / Additions of Language Tags in Indian languages have been taken up in consultations with the state governments and all stake holders. TDIL 30
31
Information Dissemination
TDIL portal: ILDC Portal: On ILDC Portal, a user can: Request for a Language Tools CD Register on ILDC website Provide Feedback and Access FAQ (Frequently asked Questions) Free Downloads and Software for Indian Language Tools TDIL Half-yearly Journal: 16 Issues published; accessible through TDIL web-site. TDIL 31
32
Future Activities: All the on-going consortium mode projects, National Roll-Out Plan project and Specialized Manpower Development in Language Technology project would be continued. Phase-II of Consortium Mode Projects: Consortium Mode Projects –Phase –II in the areas of Machine Translation , Cross-lingual Information Access , Optical Character Recognition and OHWR . The systems developed in Phase-I would be improvised and expended for other domains Technology Development: (a) Speech Technology: Development of Automatic Speech Processing engines to be initiated for major Indian languages. (b) Basic Research : Basic Research in the areas of semantic Web technology would be initiated. Web Internationalization Initiative: The phase –II of Project “Web Internationalization Initiative (WII) ” to be initiated with the objective of adequate representation of Indic scripts in the Web Technology Standards being evolved by World Wide Web Consortium (W3C). 32 TDIL
33
Future Activities: Establishment of Data Centre at TDIL
Setting up of Indian Language Data Centre , Language Technology Demonstration Facility at DIT , Up-gradation of ILDC and TDIL websites , Up-gradation of Language CDs and their distribution support would be undertaken. Web Internationalization Initiative: Phase-II of Web Internationalization Initiative (WII) programme would be initiated for finalization of Indian Language specific inputs / recommendations in W3C web technology standards. National Localization Research and Resource Centre (NLRRC) Seeding Activity for National Localization Research and Resource Centre would be initiated. 33 TDIL
34
Government, Academia, Industry together to play globally and to serve locally for making India a Global Multilingual Computing Hub धन्यवाद ਧਨ੍ਯਵਾਦ ધન્યવાદ Thank You
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.