Indo-US Workshop, June 25, 2003 XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences Shalini R. Urs Vidyanidhi.

Slides:



Advertisements
Similar presentations
Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect
Advertisements

Unicode Mark Davis Unicode Consortium President IBM Chief SW Globalization Architect.
The Seven Pillars of Open Language Archiving: A Vision Statement Gary Simons and Steven Bird Workshop on Web-based Language Documentation and Description.
The Seven Pillars of Open Language Archiving: Introducing the OLAC Vision Gary Simons SIL International LREC Symposium: The Open Language Archives Community.
The Seven Pillars of Open Language Archiving: Introducing the OLAC Vision Gary Simons SIL International LSA Symposium: The Open Language Archives Community.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
By : Swaran Lata Country Manager,W3C India Office 6,CGO complex, Electronics Niketan New Delhi
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
The Digital Dissertations and Theses in the Russian State Library Dr. Olga Lavrenova, Russian State Library.
Using technology to enhance the teaching of South Asian Languages Steve Cushion.
Books and Bibliographic Information In and Out of South Asia Ramesh K. Mittal Director D.K. Agencies (P) Ltd. International Booksellers, Publishers & Subscription.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
CIM2564 Introduction to Development Frameworks 1 Overview of a Development Framework Topic 1.
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages Sowmya V.B. *, Monojit Choudhury *, Kalika Bali *, Tirthankar.
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
Data Representation in Computers
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Dissertations and Theses Definitions Dissertations and theses are the written results of scholarly research results completed by someone in an academic.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
2.1 Different Text Attributes Font A set of printable or displayable text characters with its style and size specified Arial 16 point bold Arial 32 point.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
Internationalized Domain Names (IDNs) Yale A2K2 Conference New Haven, USA April 27, 2007 Ram Mohan Building a Sustainable Framework.
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
Chapter 6 Text and Multimedia Languages and Properties
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
What Agencies Should Know About PDF/A September 20, 2005 Susan J. Sullivan, CRM
Modular InfoTech’s Modular Infotech is proud to offer Tools and Components enabled with Indian language so as to address each & every client located across.
Publishing Software to create, customize, publish materials such as newsletters, brochures, flyers, catalogs, and web sites.
DATAD WORKSHOP In collaboration With Kenyatta University Nairobi 11 – 12 July 2007 The Database of African Theses and Dissertations (DATAD) Pascal Hoba.
ALCME: OAI at OCLC Jeffrey A. Young OCLC Online Computer Library Center, Inc.
India Jai Hind!. Cuisine Places Culture Languages Dresses Traditions.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
Reading Aid for Visually Impaired Veera Raghavendra, Anand Arokia Raj, Alan W Black, Kishore Prahallad, Rajeev Sangal Language Technologies Research Center,
26 June 2008 DG REGIO Evaluation Network Meeting Ex-post Evaluation of Cohesion Policy Programmes co-financed by the European Fund for Regional.
Archivists' Toolkit - CRADLE Presentation, 10 Feb The Archivists’ Toolkit CRADLE Presentation 10 Feb
Implementation Issues Mark Davis Properties.
21st September 2004localisation and the digital divide1 and the Development and the Information Society Economic divides Language divides Cultural divides.
Archivists' Toolkit - CDL Presentation, October 17, 2005 The Archivists’ Toolkit Lee Mandell Brad Westbrook.
India. Homework. Due next lesson. Complete your presentation. If you choose to do a PowerPoint it should be e mailed to your Geography teacher by 08:00.
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
Extending Access To Information Resource Discovery Service William E. Moen, Ph.D. Kathleen R. Murray, Ph.D. School of Library and Information Sciences.
Your Search for Indian languages ends at Modular InfoTech, Pune Web-Samhita from Modular InfoTech Pvt. Ltd. Modular InfoTech is proud to offer various.
UNICODE & Indic Scripts
Integrating Access to Digital Content Sarah Shreeves University of Illinois at Urbana-Champaign Visual Resources Association 23 rd Annual Conference Miami.
An ISO 9001:2008 Company With all the tools you need to compute in Indian Languages.
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
© 2015 albert-learning.com Indian languages Indian Languages.
ALR 2013 Some observations Pushpak Bhattacharyya, ALR Chair.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Proposed Vedic Sanskrit Coding Scheme: Some suggestions Akshar Bharati Amba Kulkarni Department of Sanskrit Studies University of Hyderabad Hyderabad
Managing ETDs with Associated Complex Digital Objects Gabrielle V. Michalek Director, Scholarly Publishing, Archives and Data Services Carnegie Mellon.
Introduction to Indian language computing 20 th MAR 2014.
Electronic Theses and Dissertations: A Status Report for 2001 Paul A. Soderdahl University of Iowa Libraries IACON 2001, Buena Vista University June 1,
A Bibliographic Management Software NORSHUHADA SAIDIN REFERENCE & RESEARCH DIVISION PERPUSTAKAAN KEJURUTERAAN UNIVERSITI SAINS MALAYSIA.
+ Introduction to the Digitization of Hanguk Bulgyo Chonso Bo Kwang Han, Young Sik Hong, Keum Suk Lee, Yong Kyu Lee, Soon Il Hwang, Jae Soo Lee Institute.
A Study on Electronic Theses and Dissertations (ETD) and Related Issues in University Libraries of India Abhijit Chatterjee University Research Fellow.
Created by Kamila zhakupova
OPEN SOURCE SOLUTION FOR e-GOVERNANCE
Representing Information as bit patterns
Workshop on XML-Based Library Applications 5
Testing Challenges in Indic Languages
Project Tukaram Sagar Tamhane
India Geography and Languages
The ultimate in data organization
India Geography and Languages
Presentation transcript:

Indo-US Workshop, June 25, 2003 XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India

Indo-US Workshop, June 25, 2003 Vidyanidhi Digital Library Vidyanidhi began as a pilot project in 2000 Supported by the NISSAT, DSIR, GOI Objective was to demonstrate the feasibility of an Electronic Thesis and Dissertation( ETD) Initiative in the Indian Context It is now evolving into a national effort Supported by the Ford Foundation

Indo-US Workshop, June 25, 2003 Vidyanidhi:Vision To evolve into a information infrastructure to strengthen the research capacities of Indian Universities by-  Developing accessible digital libraries of theses and dissertations.  Sensitizing and training doctoral research students in Scholarly writing, E- publishing and ETDs  Developing appropriate policies  Developing/making available requisite tools and resources

Indo-US Workshop, June 25, 2003 Vidyanidhi: Strategies Policy Framework – through meetings, liaison, participation Education and Training Content Building- full text and metadata Resources and tools (software,interfaces…)

Indo-US Workshop, June 25, 2003 Indian Academic Research Output Large system of higher education More than 300 universities-reservoir of extensive doctoral research work Doctoral research output-around 30,000 annually English is the predominant language Increasing vernacularisation –20-25% in Indian Languages This trend is increasing resulting in more and more research output in Indian Languages

Indo-US Workshop, June 25, 2003 Language Interoperability Vidyanidhi approach has been guided by the language inter operability factor Our choice of technology and tools will have to be inter operable across languages

Indo-US Workshop, June 25, 2003 Indian Languages: Diversity The rich diversity in Indian Languages and scripts is simply overwhelming. India is made up of a number of separate linguistic communities, each of which shares a common language and culture. No of languages listed for India is are living languages 11 are extinct. Many Languages -without script of their own

Indo-US Workshop, June 25, 2003 Eighteen Indian languages Assamese Gujarati Kashmiri Malayalam Marathi Oriya Punjabi Sindhi Telugu Bengali Hindi Kannada Konkani Manipuri Nepali Sanskrit Tamil Urdu

Indo-US Workshop, June 25, 2003 Language Families of Indian Languages Indo European- North and Central India Dravidian – South India Mon-Khmer- Assam and some Eastern parts of India Sino-Tibetan- Northern Himalayan and Burmese border area

Indo-US Workshop, June 25, 2003 Indian Scripts Interestingly, though the languages belong to four different language groups, Indian scripts have a common root/origin Scripts of all Indian Languages are derived from Bhahmi Greater uniformity in the arrangement of Alphabets

Indo-US Workshop, June 25, 2003

Indian Alphabet: Characteristics Consonants –Five Vargs (groups) –Non varg –Have an implicit + vowel Anuswar ( a nasal consonant) Chandrabindu ( a nasalisation Sign) Visarg Vowels and Vowel Signs Vowel omission sign( Halant) Conjuncts

Indo-US Workshop, June 25, 2003 Indian Languages and scripts Indic scripts are syllable oriented- phonetic based with imprecise character sets The different scripts look different (different shapes) but have vastly similar yet subtly different alphabet base and script grammar

Indo-US Workshop, June 25, 2003 Indian Languages and scripts:Issues The Indic characters consist of consonants, vowels, dependent vowels-called ‘matras’ or a combination of any or all of them called conjuncts. Collation (sorting) is a contentious issue as the script is phonetic based and not alphabet based

Indo-US Workshop, June 25, 2003 Handling Indian Languages:Possible approaches Transliteration - Glyph based approach –Indic characters are encoded in either ASCII or any other proprietary encoding –Use glyph technologies to display and print Indic scripts –Currently the most popular approach for desktop publishing.

Indo-US Workshop, June 25, 2003 Handling Indian Languages:Possible approaches Develop an encoding system for all the possible characters/combinations running into nearly 13,000 characters in each language-with a possibility of a new combination leading to a new character- an approach developed and adopted by the IIT Madras development team Adopt the ISCII/Unicode encoding

Indo-US Workshop, June 25, 2003 ISCII- Indian Script Code for Information Interchange ISCII-91 -BIS Standard, IS 13194:1991 An outcome of the efforts of Govt. of India, DOE, MIT, C-DAC and many other institutions Is an 8 bit code Is an extension of the 7 bit ASCII code Top 128 characters cater to the 10 Indian Scripts

Indo-US Workshop, June 25, 2003 Unicode The Unicode consortium has encoded all of the world’s scripts Unicode represents a carefully thought out,technically impressive and a full featured attempt at encoding Indic Scripts Unicode has unique code points for all of the Indic scripts

Indo-US Workshop, June 25, 2003 ScriptUnicode RangeMajor Languages DevanagariU+0900 to U+097FHindi, Marathi, Sanskrit BengaliU+0980 to U+09FFBengali, Assamese GurumukhiU+0A00 to U+0A7FPunjabi GujuratiU+0A80 to U+0AFFGujarati OriyaU+0B00 to U+0B7FOriya TamilU+0B80 to U+0BFFTamil TeluguU+0C00 to U+0C7FTelugu KannadaU+0C80 to U+0CFFKannada MalayalamU+0D00 to U+0D7FMalayalam

Indo-US Workshop, June 25, 2003 Unicode implementation for Indic scripts Despite the robustness,technical soundness and practical viability, Unicode implementation for Indic scripts is almost non existent Our search of the major databases-LISA, INSPEC, WOS did not show up any initiative in this direction Vidyanidhi is an example of successful implementation of Unicode for Indic scripts

Indo-US Workshop, June 25, 2003 Vidyanidhi approaches Taking Indian Language thesis to the Web –Full Text –Metadata

Indo-US Workshop, June 25, 2003 Template for thesis in MS Word Student submits thesis in Word Convert to XML using the RTF to XML Converter MS Word to XML Take them to the Web

Indo-US Workshop, June 25, 2003 Full Text Vidyanidhi provides tools for the creation of theses in Indian Languages Our approach is to- provide a style sheet /template on line When the thesis is submitted then convert the same into to XML encoded in Unicode

Indo-US Workshop, June 25, 2003

Vidyanidhi database-approach… Each script /language will have one table. Currently there are three separate tables for the three scripts- one each for Roman, Hindi (Devanagari), & Kannada The theses in Indic languages will have two records -one in the Roman script (transliterated) and the other in the vernacular. However the theses in English will have only one record (in English)

Indo-US Workshop, June 25, 2003 Vidyanidhi database- approach… The two records are linked by the ThesisID number-a unique id for the record The bibliographic description of Vidyanidhi follows the ThesisMS Dublin Core standard adopted by the NDLTD and OCLC

Indo-US Workshop, June 25, 2003 Vidyanidhi - Platform Microsoft Windows XP supports all the 10 Indic scripts Using Windows Glyph processing– Open Type Font Format Uniscribe-Unicode Script Processor Open Type Layout Services library

Indo-US Workshop, June 25, 2003 Vidaynidhi - platform –MS SQL 2000 A truly multilingual-capable SQL Achieves satisfactory collation –Front End- ASP –Java script

Indo-US Workshop, June 25, 2003

Vidyanidhi:Accessing and Searching One can search the Vidyanidhi Database either in - –In English ( Roman Script) –The integrated ( Master) database has metadata records for theses in all languages –Vernacular database has records of the specific language only

Indo-US Workshop, June 25, 2003 Two approaches- differences one affords search in the English language and the other in the vernacular. The first approach also provides for viewing records in Roman script for all theses-search output- that satisfy the conditions of the query and also an option for viewing records in vernacular script for theses in vernacular

Indo-US Workshop, June 25, 2003 The second approach- enables one to search only the vernacular database and thus is limited to records in that language. However, this approach enables the search to be in the vernacular language and script

Indo-US Workshop, June 25, 2003

Unicode and Indic Scripts Vidyanidhi implementation dispels certain misconceptions and misconstructions about Unicode Supposed problems- –Data Input –Display and printing –Collation

Indo-US Workshop, June 25, 2003 Data input/Keyboard layout Our Test bed and comparison with other methods: Unicode layout is as easy as the other in terms of speed In terms of ‘no of key strokes’-No difference and some times Unicode method has less number of keystrokes involved Data input was almost comparable to English records in terms of productivity

Indo-US Workshop, June 25, 2003 Display and Printing It is fairly satisfactory except for a few issues/problem areas- –Handling of certain conjuncts –Inability to display non terminating pure consonant –Limited choice of font types Unicode can handle conjunct clusters of four consonants

Indo-US Workshop, June 25, 2003 Collation issues-some observations Consensus with respect of Indic scripts is hard to come by Difference of opinion is not uncommon as Indic languages are a cross between syllabic and phonemic writing systems Collation according to phonetic order would be different from alphabetic order

Indo-US Workshop, June 25, 2003 Collation Issues A few of the disorder stem from the common script base and order for all Indic scripts Differences between Indic scripts -in the number and arrangement of consonants and vowels-despite strong similarity

Indo-US Workshop, June 25, 2003 Collation by Unicode Given the above collation problems, the collation achieved by Unicode is fairly satisfactory and compares very well with other more popular Font based software package-Nudi

Indo-US Workshop, June 25, 2003 Conclusion Unicode is able to handle admirably the challenges of a Multilanguage multi script database implementation despite the complexity and the minutiae of a family of Indian languages and scripts with strong commonalities and faint distinctions among themselves

Indo-US Workshop, June 25, 2003 Contact