Language Technologies for Multilingual Societies META-FORUM 2011, June 27/28, 2011, Budapest, Hungary Swaran Lata Director & Head, Technology Development.

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Information Society Technologies Third Call for Proposals Norbert Brinkhoff-Button DG Information Society European Commission Key action III: Multmedia.
LRC - X 13-14th September Issues of Multilingual Electronic Publishing in India with Special Reference to Academic Universities Rajesh Chandrakar.
By : Swaran Lata Country Manager,W3C India Office 6,CGO complex, Electronics Niketan New Delhi
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Graffiti Reporting A partnership of Local and State Government; My Local Services App enhancements.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
1 Web Accessibility Challenges in Multilingual web access Somanth Chandra Dy. Country Manager W3C India Office 6,CGO Complex, Electronics Niketan, New.
Knowledge Sharing Platform Empowering Communities through regional Content and Services C. Kathiresan C-DAC, Hyderabad, India Session V : e-Content & ICT.
Provisional draft ICT for Independent Living and Inclusion European Commission, DG Information Society and Media E-Inclusion Unit (H3) Challenge 7.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
HLT Research and Development for Baltic Languages in Tilde Andrejs Vasiļjevs, Raivis Skadiņš Tilde Riga, October 27, 2004.
‘european digital library’ (EDL) Julie Verleyen TEL-ME-MOR / M-CAST Seminar on Subject Access Prague, 24 November 2006.
सुस्वागतम् Welcome Technology Development for Indian Languages
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
1st Project Introduction to HTML.
An innovative platform to allow translation and indexing of internet sites Localization World
1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India.
Chapter ONE Introduction to HTML.
Information and Communication Technologies in the field of general education in Armenia NATIONAL CENTER OF EDUCATIONAL TECHNOLOGIES.
DISSEMINATION / VALORISATION PLAN AND ACTIVITIES PRESENTED BY DR SHYAM PATIAR.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
ICT work programme ICT 17 Cracking the language barrier Aleksandra Wesolowska Unit G.3 - Data Value Chain.
Qatar Planning Council 1 Best Statistical Information to Support Qatar’s Progress Statistical Capacity Building for Information Society in Qatar.
Building Inclusive Knowledge Societies Session organized by the UNCT in India and other UN entities.
Internationalized Domain Names (IDNs) Yale A2K2 Conference New Haven, USA April 27, 2007 Ram Mohan Building a Sustainable Framework.
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
Localization Enablers Technology Development for Indian Languages (TDIL) Programme Department of Information Technology, Ministry of Communication & Information.
DFKI GmbH, , R. Karger Indo-German Workshop on Language Technologies Reinhard Karger, M.A. Deutsches Forschungszentrum für Künstliche Intelligenz.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Project proposal for standardization of Ethiopic script encoding, Keyboard layout and transliteration to Latin Dawit Bekele Mathematics and Computer Science.
Mihir Daptardar Software Engineering 577b Center for Systems and Software Engineering (CSSE) Viterbi School of Engineering 1.
Modular InfoTech’s Modular Infotech is proud to offer Tools and Components enabled with Indian language so as to address each & every client located across.
PRESENTED BY Vashkar Bhattacharjee Focal Person DAISY, Bangladesh
Consolidating the European Library Space Luxembourg November 1999.
SCIENCE, RESEARCH DATA, AND PUBLISHING Stewart Wills Editorial Director, Web & New Media, Science 26 February 2013.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
ICANN COMMUNITY STRATEGIC PLANNING DISCUSSION Brussels, June
Reading Aid for Visually Impaired Veera Raghavendra, Anand Arokia Raj, Alan W Black, Kishore Prahallad, Rajeev Sangal Language Technologies Research Center,
1 World Wide Consortium for the Grid Global Grid Forum Network-Centric Operations Community Session 28 June
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
CLARIN work packages. Conference Place yyyy-mm-dd
UNICODE & Indic Scripts
EU Projects – FP7 Workshop 6: EU Funding –What’s Next? Carolina Fernandes Innovation & Funding Manager GLE Group.
Cooperation & Competition in building the Web, « the universe of network-accessible information » Jean-François Abramatic Chief Product Officer ILOG.
An ISO 9001:2008 Company With all the tools you need to compute in Indian Languages.
Digital Learning India 2008 July , 2008 Mrs. C. Vijayalakshmi Department of Computer science and Engineering Indian Institute of Technology – IIT.
Access to drugs, Reducing bottlenecks Matt Cooper Business Development & Marketing Director NIHR Clinical Research Network
DFKI GmbH, , R. Karger Perspectives for the Indo German Scientific and Technological Cooperation in the Field of Language Technology Reinhard.
UK Interest & Input to the Factories of the Future Horizon 2020 Roadmap. © ActionPlant 2011.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
This has been created by QA InfoTech. Choose QA InfoTech as your Automated testing partner. Visit for more information.
Ellinogermaniki Agogi Research and Development Department DigiSkills Network DigiSkills: Network for the enhancement of Digital competence skills.
SPEECH TECHNOLOGY An Overview Gopala Krishna. A
GISELA & CHAIN Workshop Digital Cultural Heritage Network
The ACCEPT Project Enabling machine translation for the emerging community content paradigm. Allowing citizens across the EU better access to communities.
Project 1 Introduction to HTML.
HLT Research and Development for Baltic Languages in Tilde
ABC Capacity Building Projects
Testing Challenges in Indic Languages
Dejan Pavlovic Director, Regulatory Affairs & Development
Computational Linguistics: New Vistas
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Coordination and Support Actions
Indian Languages Market: The Complex Script
Introduction to UNICODE (ஒருங்குறி)
Presentation transcript:

Language Technologies for Multilingual Societies META-FORUM 2011, June 27/28, 2011, Budapest, Hungary Swaran Lata Director & Head, Technology Development for Indian Languages Programme & Country Manager, W3C India Govt. of India 6 CGO Complex, Lodi Road, New Delhi Meta forum

Diverse Multilinguality in India and its Complexity Meta forum

Organization of my talk Why and How TDIL Programme got initiated Important Milestones  Technology Development  Multilingual Standards  Proliferations Lessons Learnt Problems Arising out of Multilingualism Funding Vs. Long-term Goals Potential for Collaboration Meta forum

Constitution of India (8 th Schedule Covers 22 Indian Languages)  Emphasize on planned development of Indian languages for use in all spheres of life.  Development and use of Indian Languages in all domains of National life to maintain linguistic and cultural diversity  Development of sustainable technologies to break linguistic barriers across diverse speech communities  Provide equal opportunities to citizens through the use of Information Technology Official languages Act 1963  Hindi as Official Language of Republic of India  15 Indian Languages (ILs) in 8 th Schedule  3 ILs added in 1992 (Konkani, Manipuri and Nepali)  4 ILs added in 2003 (Bodo, Maithili, Dogri, and Santali) Multilingual and Multicultural India Meta forum

Why and How TDIL Programme got initiated DoE (1976) Year 2000 MIT Year 2002 MCIT = DIT + DoT Technology Development Council (TDC) 1988 – 1991  Funded Project for Development of Devanagari Graphics and Intelligence based Script Technology (GIST) UNIX Terminal at IIT Kanpur  Exploiting phonetic correspondence of Indian languages – GIST extended to others Indian Languages  GIST Card (PC add-on card) developed at CDAC Pune (Society set up in 1988)  Indian Standard Code for Information Interchange (ISCII) – BIS: (1991) – 8 bit encoding and keyboard layout standard covering 15 languages. Department of Electronics Ministry of Information Technology Ministry of Communication & Information Technology Department of Information Technology Department of Telecommunications Meta forum

Technology Development for Indian Languages (TDIL) Programme – Milestones Increase in Funding & Participation, Evolving Vision and Focus on Standards Seeding Phase Capacity Building phase Multilingual Technology Development Future Roadmap PoC Research in Hindi and monolingual Corpora building Set-up Resource Centres in each state, Mentoring through existing projects  Consortium Mode – Multiple Institutional Projects in MT, OCR, OHWR, CLIA & Speech  Multilingual Resources Development based on standards  Free BIPKs for 22 ILs  Major Thrust on Research in Speech and Mobile Area  Productization Efforts  Standards for Multilingual Web  Addressing language specific bottlenecks  Localization Initiatives Meta forum

Growth of Language Technology Research Institutions Meta forum

Machine Translation System [ – Consolidation] English to Hindi Machine Translation System has been deployed in Parliament for Machine Translation of the Parliament Proceedings.  Matching Efforts in Integrating the MT system into organizational Workflow & Training of the staff  Improvement in quality and speed of translation service English to Indian Languages Machine Translation System in 3 Indian Languages – Hindi, Bengali, Malayalam -- to translate the Voluminous Course Material of Vocational Training Programme:  Reduces cost of translation by 30%  Saves Human Effort by more than 50% Beta Deployments : Meta forum

Machine Translation Systems:- Eng.- Indian Languages – 8 Language Pairs The Machine Translation Systems has been made available through TDIL Data Centre ( feedback and improvisation through crowd sourcing. Machine Translation System [ – Consolidation] Meta forum

Machine Translation Systems:- Indian Languages.- Indian Languages – 6 Language Pairs Machine Translation System [ – Consolidation] Meta forum

Cross-lingual Information Access [since 2006] AcrossAcross six Indian Languages : Hindi, Marathi, Bengali, Punjabi, Tamil and Telugu. ; Tourism Domain Index based searching based pre-processing of Indian Language query = 0.4 to 0.5]. UNL based search tried in Tamil to compare the efficacy. [ Precision based on Indexed based search =0.42 ; UNL based search = 0.59]. Next 3 years target :  Enhance precision to 0.7  Addition of 3 languages [ Assamese, odia, Gujarati] Beta Trial proposed on existing search engine. Meta forum

Optical Character Recognition [since 2006] 11 Indian Scripts11 Indian Scripts Accuracy - Character level 97% ; Word-level 80-85% Working on printed documents between Response time : 3-4 Minutes Next 3 years target :  Word-level > 90%  Handling bi-lingual documents [IL + English]  Multi-column layout support  Post Correction Tools  Braille Interface development and deployment for Indian language book publishing  On-line OCR service through TDIL Data Centre  Deployment at a Historical Library Meta forum

On-line Handwriting Recognition System [OHWR] - since 2006 AcrossAcross six Indian Languages : Devanagri, Kannada, Malayalam, Bengali, Tamil and Telugu. ; SDK developed  Stroke Level – 95%  Character Level – 84% Census Data Collection stored as Unicode Database Next 3 years target :  Achieve complete Coverage of Conjuncts & Complex Characters, Nukta characters Integration with TTS and deployment for Speech Impaired  Addition of new languages [Assamese, Urdu, Marathi, Manipuri, Bodo] Beta Trial proposed on existing search engine. Meta forum

Text-to-Speech in Indian Languages [since 2006] Based on Festivox Frame Work TTS Engine Integrated with NVDA (Windows) and ORCA (Linux) screen readers Mean Opinion Score : Hindi 3.2, Bengali, Marathi, Telugu, Tamil, Malayalam : ~3.0 Training of Visually Challenged Persons on screen readers. Next 3 years target :  Improvement of MOS Score of TTS engine up-to 3.8 – 4.0  TTS engine for Indian Languages for Mobile Android Platforms  Addition of 5 New Indian Languages – Odia, Gujarati, Assamese, Bodo  Proof of concept for adaptation for one Hindi Dialect. Meta forum

ORCA Screen Reader integrated with IL TTS Meta forum

Multi-lingualStandards Multilingual Standards – Multi stake holders Meta forum 2011 UNICODE ISO Encoding Web Content, architecture and Web Based Services Web Content, architecture and Web Based Services W3C Language Tag, Ref Glyph set, Key- Board ISO UNICODE Locale Data ELRA, NIST, LDC Linguistic Resources, Tools and Evaluation Internet Protocol and Domain Name ICANN, IANA, IETF, ISOC 15 Meta forum

No of Languages/Standards components Year W3C Work initiated in 5 areas : Internationalization CSS, Mobile Web, E-Gov and Speech Standardization Activity for Indian Languages Meta forum

UNICODE Completed for 22 Official Indian Languages and Vedic Sanskrit - Unicode 6.0 Devanagari BengaliMalayalam UNICODE 18

Encoding Included in Unicode 6.0 – Code Point 20B9 [August 2010] Included in ISO [ Oct 2010] Included in ISCII – Notification issued by BIS Key Board `  Key Combination – CTRL + ALT+4 or AltGr + 4  Consensus by all stake-holders and major industry players  ISO Notification issued by BIS [Dec 1, 2010]  Software Patches released by Microsoft, Redhat, C-DAC [April 2011] FontsSakal-Bharti font for New Rupee Symbol Meta forum Enabling of New Rupee Symbol in ICT environment [Govt. Notification in July -2010]

Common Locale Data Repository Completed in 9 Indian Languages - Included in CLDR 2.0Work for Rest of the Indian Languages in Progress for their inclusion in the next version of CLDRMost of the Changes suggested by Govt. of India accepted by Unicode consortium. Screen shots of CLDR Hindi UpdationScreen shots of CLDR Bengali Up-dation CLDR

Web Standards - W3C StandardsWork InitiatedProgress So far Cascading Style Sheet (CSS) Hindi Listing submitted to W3C Akshara Definition for Indic Languages requirements of text-segmentation of CSS specification Detailed Testing of CSS 2.1 underway Pronunciation Lexicon Specification (PLS) and Speech synthesis Mark-Up Language (SSML)  Reference Phoneme set development  IPA verification in Indic languages  Acoustic –phonetic analysis  Initiated for Hindi, Bengali, Punjabi  IPA verification for Bengali completed Mobile Web  Gap Analysis for Mobile Web in Indian Languages  Mobile Fonts and Rasterization Engine in Indic Languages  Mobile OK Checker Proposed to Work with Telecom Centres of Excellences in India. Mobile Industry Associations E-Gov Best Practices  Internationalization Best Practices for Indic Languages Draft developed and under finalization. Web Accessibility Adoption of W3C WCAG 2.0 standard in India Incorporation of WCAG 2.0 into National Electronic Accessibility Policy.

Lessons Learnt Language Resource Development  Copyright issues  Standardization of Meta data and Tag sets  Language specificities  Validation vs. Time and Cost investment  Investment in Semantic and Syntactic Resources like Word-Net, Tree banks etc respectively Language Independent Methodologies  Core Technology Development engine identification  Availability of Researchers and Scientific manpower  Domain Selection  Limited technology institutions Meta forum

Leadership Issues  Computer Science Experts vs. Linguistic Experts  Multi Institutional Consortia Project Leadership  Development plan vs. Budget plan vs. National five year plan  Researchers in Academics Language dependent planning  Language selection criteria  Participation of State Language Departments  Availability of Institutions  Availability of Linguistic and Language Experts Lessons Learnt Meta forum

Standardization Issues  Development level Standards  Third party testing  Software engineering practices  Use case scenario  Integration issues Other Issues  User involvements  Limited deployments  Models for proliferation  Lab to Pilot to Commercial  Divergent requirements of GenX and non ICT communities Lessons Learnt Meta forum

Problem rising from Multilingualism Multiple language speakers (Native language, Hindi and English) English default language of official communication and higher education and also spoken language in urban and semi urban areas Orthographic complexity  Tamil language having lesser alphabets  Conjunct and Glutenation problem  Reforms in orthography Spoken language issues:  Phonetic variation among Indian languages  Variation of Hindi spoken in 7 to 8 states  Dialect variation (Awadhi, Bhojpuri, Khadi boli, Braj Bhasha etc) The paradigm shift to statistical approaches:  Huge amount of speech corpora capturing dialect variation  Parallel text corpora and other language resources  Interfacing from multilingual language resources  Cross lingual access 25

Funding vs. Long Term Goals Expl seeding Capacity Building Multilingual consortia Social impact Meta forum

Graphs infer that optimal funding is available Language activities have crossed threshold Next plan (12 th) higher allocation of resources targeted More Language groups need to be funded in each state with special focus on small language resources Multiple script issues Funding vs. Long Term Goals Time Frame – future challenges for five years Replication of successful technology development for newer languages Improvisation of language technologies:  Improve accuracy to bring it a usable level  Productization efforts  Porting efforts on mobile platforms  Providing services on cloud based services Strategies for social impact Meta forum

Potential for Cooperation Enhancement and Adaptation of engines like sphinx, festival, HTS, NUTCH harfbuzz, free type etc. to bring a paradigm shift in development form Latin centric to Multi-lingual centric. Pilot projects to try methodology applied for Indian languages to European language and vice versa  Angla-Bharati English to Indian languages MT Framework may be tried for English to other European Languages  Replicating European localization models for taking localization technologies to users in India.  Cross-lingual Information Retrieval between Indian Languages and European Languages.  Collaborative Effort on Speech Technology development in Indo-EU Languages – new research frontiers in speech modeling, Speech recognition grammar, Phonetic Search.  Speech Enabling of Mobile Devices in Indo-EU Languages involving the mobile manufacturers and innovative product development for mass market applications Linguistic Resource Sharing for Research Purpose. Language Technology Evaluation Models in Indian Language Technology / Product / Solutions based on Successful European Models 28

Thanks & Questions ক ક क ಕ കൂ क କ ਕ క గ ક ಕ କ ਕ ক क ક గ ಕ ಕ