Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indo-US Workshop, June 25, 2003 XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences Shalini R. Urs Vidyanidhi.

Similar presentations


Presentation on theme: "Indo-US Workshop, June 25, 2003 XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences Shalini R. Urs Vidyanidhi."— Presentation transcript:

1 Indo-US Workshop, June 25, 2003 XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India shalini@vidyanidhi.org.in

2 Indo-US Workshop, June 25, 2003 Vidyanidhi Digital Library Vidyanidhi began as a pilot project in 2000 Supported by the NISSAT, DSIR, GOI Objective was to demonstrate the feasibility of an Electronic Thesis and Dissertation( ETD) Initiative in the Indian Context It is now evolving into a national effort Supported by the Ford Foundation

3 Indo-US Workshop, June 25, 2003 Vidyanidhi:Vision To evolve into a information infrastructure to strengthen the research capacities of Indian Universities by-  Developing accessible digital libraries of theses and dissertations.  Sensitizing and training doctoral research students in Scholarly writing, E- publishing and ETDs  Developing appropriate policies  Developing/making available requisite tools and resources

4 Indo-US Workshop, June 25, 2003 Vidyanidhi: Strategies Policy Framework – through meetings, liaison, participation Education and Training Content Building- full text and metadata Resources and tools (software,interfaces…)

5 Indo-US Workshop, June 25, 2003 Indian Academic Research Output Large system of higher education More than 300 universities-reservoir of extensive doctoral research work Doctoral research output-around 30,000 annually English is the predominant language Increasing vernacularisation –20-25% in Indian Languages This trend is increasing resulting in more and more research output in Indian Languages

6 Indo-US Workshop, June 25, 2003 Language Interoperability Vidyanidhi approach has been guided by the language inter operability factor Our choice of technology and tools will have to be inter operable across languages

7 Indo-US Workshop, June 25, 2003 Indian Languages: Diversity The rich diversity in Indian Languages and scripts is simply overwhelming. India is made up of a number of separate linguistic communities, each of which shares a common language and culture. No of languages listed for India is 418 407 are living languages 11 are extinct. Many Languages -without script of their own

8 Indo-US Workshop, June 25, 2003 Eighteen Indian languages Assamese Gujarati Kashmiri Malayalam Marathi Oriya Punjabi Sindhi Telugu Bengali Hindi Kannada Konkani Manipuri Nepali Sanskrit Tamil Urdu

9 Indo-US Workshop, June 25, 2003 Language Families of Indian Languages Indo European- North and Central India Dravidian – South India Mon-Khmer- Assam and some Eastern parts of India Sino-Tibetan- Northern Himalayan and Burmese border area

10 Indo-US Workshop, June 25, 2003 Indian Scripts Interestingly, though the languages belong to four different language groups, Indian scripts have a common root/origin Scripts of all Indian Languages are derived from Bhahmi Greater uniformity in the arrangement of Alphabets

11 Indo-US Workshop, June 25, 2003

12 Indian Alphabet: Characteristics Consonants –Five Vargs (groups) –Non varg –Have an implicit + vowel Anuswar ( a nasal consonant) Chandrabindu ( a nasalisation Sign) Visarg Vowels and Vowel Signs Vowel omission sign( Halant) Conjuncts

13 Indo-US Workshop, June 25, 2003 Indian Languages and scripts Indic scripts are syllable oriented- phonetic based with imprecise character sets The different scripts look different (different shapes) but have vastly similar yet subtly different alphabet base and script grammar

14 Indo-US Workshop, June 25, 2003 Indian Languages and scripts:Issues The Indic characters consist of consonants, vowels, dependent vowels-called ‘matras’ or a combination of any or all of them called conjuncts. Collation (sorting) is a contentious issue as the script is phonetic based and not alphabet based

15 Indo-US Workshop, June 25, 2003 Handling Indian Languages:Possible approaches Transliteration - Glyph based approach –Indic characters are encoded in either ASCII or any other proprietary encoding –Use glyph technologies to display and print Indic scripts –Currently the most popular approach for desktop publishing.

16 Indo-US Workshop, June 25, 2003 Handling Indian Languages:Possible approaches Develop an encoding system for all the possible characters/combinations running into nearly 13,000 characters in each language-with a possibility of a new combination leading to a new character- an approach developed and adopted by the IIT Madras development team Adopt the ISCII/Unicode encoding

17 Indo-US Workshop, June 25, 2003 ISCII- Indian Script Code for Information Interchange ISCII-91 -BIS Standard, IS 13194:1991 An outcome of the efforts of Govt. of India, DOE, MIT, C-DAC and many other institutions Is an 8 bit code Is an extension of the 7 bit ASCII code Top 128 characters cater to the 10 Indian Scripts

18 Indo-US Workshop, June 25, 2003 Unicode The Unicode consortium has encoded all of the world’s scripts Unicode represents a carefully thought out,technically impressive and a full featured attempt at encoding Indic Scripts Unicode has unique code points for all of the Indic scripts

19 Indo-US Workshop, June 25, 2003 ScriptUnicode RangeMajor Languages DevanagariU+0900 to U+097FHindi, Marathi, Sanskrit BengaliU+0980 to U+09FFBengali, Assamese GurumukhiU+0A00 to U+0A7FPunjabi GujuratiU+0A80 to U+0AFFGujarati OriyaU+0B00 to U+0B7FOriya TamilU+0B80 to U+0BFFTamil TeluguU+0C00 to U+0C7FTelugu KannadaU+0C80 to U+0CFFKannada MalayalamU+0D00 to U+0D7FMalayalam

20 Indo-US Workshop, June 25, 2003 Unicode implementation for Indic scripts Despite the robustness,technical soundness and practical viability, Unicode implementation for Indic scripts is almost non existent Our search of the major databases-LISA, INSPEC, WOS did not show up any initiative in this direction Vidyanidhi is an example of successful implementation of Unicode for Indic scripts

21 Indo-US Workshop, June 25, 2003 Vidyanidhi approaches Taking Indian Language thesis to the Web –Full Text –Metadata

22 Indo-US Workshop, June 25, 2003 Template for thesis in MS Word Student submits thesis in Word Convert to XML using the RTF to XML Converter MS Word to XML Take them to the Web

23 Indo-US Workshop, June 25, 2003 Full Text Vidyanidhi provides tools for the creation of theses in Indian Languages Our approach is to- provide a style sheet /template on line When the thesis is submitted then convert the same into to XML encoded in Unicode

24 Indo-US Workshop, June 25, 2003

25

26

27

28

29

30

31

32

33

34

35

36 Vidyanidhi database-approach… Each script /language will have one table. Currently there are three separate tables for the three scripts- one each for Roman, Hindi (Devanagari), & Kannada The theses in Indic languages will have two records -one in the Roman script (transliterated) and the other in the vernacular. However the theses in English will have only one record (in English)

37 Indo-US Workshop, June 25, 2003 Vidyanidhi database- approach… The two records are linked by the ThesisID number-a unique id for the record The bibliographic description of Vidyanidhi follows the ThesisMS Dublin Core standard adopted by the NDLTD and OCLC

38 Indo-US Workshop, June 25, 2003 Vidyanidhi - Platform Microsoft Windows XP supports all the 10 Indic scripts Using Windows Glyph processing– Open Type Font Format Uniscribe-Unicode Script Processor Open Type Layout Services library

39 Indo-US Workshop, June 25, 2003 Vidaynidhi - platform –MS SQL 2000 A truly multilingual-capable SQL Achieves satisfactory collation –Front End- ASP –Java script

40 Indo-US Workshop, June 25, 2003

41 Vidyanidhi:Accessing and Searching One can search the Vidyanidhi Database either in - –In English ( Roman Script) –The integrated ( Master) database has metadata records for theses in all languages –Vernacular database has records of the specific language only

42 Indo-US Workshop, June 25, 2003 Two approaches- differences one affords search in the English language and the other in the vernacular. The first approach also provides for viewing records in Roman script for all theses-search output- that satisfy the conditions of the query and also an option for viewing records in vernacular script for theses in vernacular

43 Indo-US Workshop, June 25, 2003 The second approach- enables one to search only the vernacular database and thus is limited to records in that language. However, this approach enables the search to be in the vernacular language and script

44 Indo-US Workshop, June 25, 2003

45

46

47

48

49 Unicode and Indic Scripts Vidyanidhi implementation dispels certain misconceptions and misconstructions about Unicode Supposed problems- –Data Input –Display and printing –Collation

50 Indo-US Workshop, June 25, 2003 Data input/Keyboard layout Our Test bed and comparison with other methods: Unicode layout is as easy as the other in terms of speed In terms of ‘no of key strokes’-No difference and some times Unicode method has less number of keystrokes involved Data input was almost comparable to English records in terms of productivity

51 Indo-US Workshop, June 25, 2003 Display and Printing It is fairly satisfactory except for a few issues/problem areas- –Handling of certain conjuncts –Inability to display non terminating pure consonant –Limited choice of font types Unicode can handle conjunct clusters of four consonants

52 Indo-US Workshop, June 25, 2003 Collation issues-some observations Consensus with respect of Indic scripts is hard to come by Difference of opinion is not uncommon as Indic languages are a cross between syllabic and phonemic writing systems Collation according to phonetic order would be different from alphabetic order

53 Indo-US Workshop, June 25, 2003 Collation Issues A few of the disorder stem from the common script base and order for all Indic scripts Differences between Indic scripts -in the number and arrangement of consonants and vowels-despite strong similarity

54 Indo-US Workshop, June 25, 2003 Collation by Unicode Given the above collation problems, the collation achieved by Unicode is fairly satisfactory and compares very well with other more popular Font based software package-Nudi

55 Indo-US Workshop, June 25, 2003 Conclusion Unicode is able to handle admirably the challenges of a Multilanguage multi script database implementation despite the complexity and the minutiae of a family of Indian languages and scripts with strong commonalities and faint distinctions among themselves

56 Indo-US Workshop, June 25, 2003 Contact shalini@vidyanidhi.org.in


Download ppt "Indo-US Workshop, June 25, 2003 XML-Unicode environment for creating and accessing of Indian language theses: Vidyanidhi experiences Shalini R. Urs Vidyanidhi."

Similar presentations


Ads by Google