Unicode & W3C Jataayu Software C. Kumar January 2007.

Slides:



Advertisements
Similar presentations
Keys to Building a Multilingual Search Engine Thierry Sourbier.
Advertisements

4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I
Teppo Räisänen LIIKE/OAMK 2010
XHTML Basics.
The Application Layer Chapter 7. Electronic Mail Architecture and Services The User Agent Message Formats Message Transfer Final Delivery.
Chinese Information Processing (I): Basic Concepts and Practice Unit 7: Web Pages in Chinese.
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
Developing a Basic Web Page with HTML
4.01B Authoring Languages and Web Authoring Software 4.01 Examine webpage development and design.
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
26 April 2001 Unicode and Windows XP, IUC 18 (Hong Kong) Unicode and Windows XP Cathy Wissink Program Manager, Globalization Windows Division Microsoft.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
UNDERSTANDING WEB AND WEB PROJECT PLANNING AND DESIGNING AND EFFECTIVE WEBSITE Garni Dadaian.
Basics of HTML Shashanka Rao. Learning Objectives 1. HTML Overview 2. Head, Body, Title and Meta Elements 3.Heading, Paragraph Elements and Special Characters.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
Chapter 1 Internet & Web Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D. 1.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Creating a Simple Page: HTML Overview
Creating Interfaces: Localization Language & other issues character codes Homework: preparation for future topics.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
XP The University of Akron Summit College Business Technology Department Computer Information Systems 2440: 140 Internet Tools Instructor: Enoch E. Damson.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
By Amisha Pardasani. Contents Introduction to Wireless Application Protocol Introduction to Wireless Markup Language WML Formatting Links and Images Input.
XP 1 HTML: The Language of the Web A Web page is a text file written in a language called Hypertext Markup Language. A markup language is a language that.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Chapter 1 Internet & Web Basics Key Concepts Copyright © 2013 Terry Ann Morris, Ed.D. Revised 1/12/2015 by William Pegram 1.
14 Publishing a Web Site Section 14.1 Identify the technical needs of a Web server Evaluate Web hosts Compare and contrast internal and external Web hosting.
Build a Free Website1 Build A Website For Free 2 ND Edition By Mark Bell.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.
NASRULLAH KHAN.  Lecturer : Nasrullah   Website :
Section 4.1 Format HTML tags Identify HTML guidelines Section 4.2 Organize Web site files and folder Use a text editor Use HTML tags and attributes Create.
Html Basic Codes Week Two. Start Your Text Editor Windows use 'Notepad’ Macintosh use 'Simple Text'
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
CHAPTER 9 Using the World Wide Web. OBJECTIVES 1.Describe the Internet and the World Wide Web 2.Define related Internet terms 3.Explain the components.
XML About XML Things to be known Related Technologies XML DOC Structure Exploring XML.
A Basic Web Page. Chapter 2 Objectives HTML tags and elements Create a simple Web Page XHTML Line breaks and Paragraph divisions Basic HTML elements.
MySQL and PHP Internet and WWW. Computer Basics A Single Computer.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
10 – 12 APRIL 2005 Riyadh, Saudi Arabia. Building multi-lingual ASP.Net application that handle western languages and Arabic with a single code base.
XHTML By Trevor Adams. Topics Covered XHTML eXtensible HyperText Mark-up Language The beginning – HTML Web Standards Concept and syntax Elements (tags)
ECA 228 Internet/Intranet Design I Intro to Markup.
Complex Scripts* in Internet Explorer 5.0 *and Multilingual text F. Avery Bishop Senior Program Manager Microsoft Corporation.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
XML Engr. Faisal ur Rehman CE-105T Spring Definition XML-EXTENSIBLE MARKUP LANGUAGE: provides a format for describing data. Facilitates the Precise.
Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
Copyright © 2003 Pearson Education, Inc. Slide 1-1 Created by Cheryl M. Hughes The Web Wizard’s Guide to XHTML by Cheryl M. Hughes.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Website design and structure. A Website is a collection of webpages that are linked together. Webpages contain text, graphics, sound and video clips.
NASRULLAH KHAN.  Lecturer : Nasrullah   Website :
Objective: To describe the evolution of the Internet and the Web. Explain the need for web standards. Describe universal design. Identify benefits of accessible.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
JavaScript 101 Introduction to Programming. Topics What is programming? The common elements found in most programming languages Introduction to JavaScript.
Building Database-backended Multilingual, Multimedia Data Repositories: The aAQUA Experience.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
4.01 How Web Pages Work.
HTML5 Basics.
INTERNATIONALIZATION
Representing Information as bit patterns
Structuring Content in a Web Document
CIS 133 mashup Javascript, jQuery and XML
4.01 How Web Pages Work.
ASCII and Unicode.
Introduction to UNICODE (ஒருங்குறி)
Presentation transcript:

Unicode & W3C Jataayu Software C. Kumar January 2007

Agenda About Jataayu Unicode & Encoding W3C Specification for multi-lingual authoring Multilingual WEB Address Indian WEB Sites an Overview W3C Activity

About Jataayu Jataayu formed with a clear focus of delivering solutions for wireless data services Over 60% of the data traffic in Indian Mobile Networks for WAP, Mobile WEB and MMS handled by Jataayu Products Mobile Device Solution Division focusing on wireless data applications like WAP, MMS, SyncML, IMPS, , Web Browsing, Download Active participants in OMA, W3C and MWI Over 350 people strong with offices in UK, Singapore, Korea, Taiwan and the US; headquartered in India with major development center in Bangalore

Localization - Internationalization Localization (l10n) Adaptation of the content to meet the language, cultural and other requirements of a specific target market Internationalization (i18n) Design & Development of the content that enables easy localization for target audiences that vary in culture, region or language. Mission of W3C i18n Activity is to ensure the W3C’s formats and protocols are usable worldwide in all languages and in all writing systems.

Need for Unicode Early character sets based on 7-bit, gave 2 7 (ie. 128) possible characters Adding the 8 th bit gave a total of 256 possible characters. Still not enough for all the European languages. Code page mechanism helped a little by changing the upper cells (0xA0 to 0xFF), but was very complex. Addressing the needs of the other languages requires thousands of ideographic characters at a time.

Unicode & Encoding Unicode, universal character set contains all the characters needed for writing the majority of living languages in use on computers. Allows for simple display and storage of multilingual content An encoding refers to the way that characters are mapped from the character set to actual Unicode value. Different encoding yield different byte sequences.

Unicode & Encoding UTF-8 (Unicode Transformation Format) Variable length 8-bit character encoding for Unicode Able to represent any universal character in the Unicode Standard Uses one to four bytes to encode a Unicode symbol Only one byte is needed to encode the US-ASCII characters

Unicode & Encoding UTF-16 (16-bit Unicode Transformation Format) Variable length 16-bit character encoding for Unicode Uses two or four byte sequence to encode a Unicode symbol Two byte is required to encode the US-ASCII character UCS-2 (2-byte Universal Character Set) Fixed length encoding that always encodes characters into a single 16-bit value It can encode characters in the range 0x0000 to 0xFFFF

Unicode & Encoding UCS-4 / UTF-32 (32-bit Unicode Transformation Format) Fixed length 32-bit character encoding for Unicode Every character it uses 4 bytes and it is very space inefficient Little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode Text

Unicode & Encoding Devanagari (0x0900 – 0x097F) Bengali (0x0980 – 0x09FF) Tamil (0x0B80 – 0x0BFF) Kannada (0x0C80 – 0x0CFF) Code Point U+0041U+05D0U+597DU+233B4 UTF-8 41D7 90E5 A5 BDF0 A3 8E B4 UTF D059 7DD8 4C DF B4 UTF D D B4

Unicode & Encoding Alternate way to represent the character is by using escape value. (א) Not all documents have to be encoded as Unicode But documents can only contain characters defined by Unicode Standard Any encoding can be used as long as it is properly declared and it is the subset of Unicode Unicode encoding also allows many more languages to be mixed on a single page

Other Encoding formats … Shift_JIS (SJIS), character encoding for the Japanese Language Single byte character encoding for the lower-ASCII characters (0x00 – 0x7F) Double-byte character encoding for the upper- ASCII bytes GB2312, character encoding for simplified Chinese characters

W3C Specification - Encoding W3C specification for multi-lingual authoring Encoding of the document needs to be mentioned, so that the application that consumes can interpret it. Meta Tag XML Content-type header returned from the WEB server should also contain the character encoding of the document Content-Type: text/html; Charset=utf-8

W3C Specification - Language Author needs to specify the language of the document (web page content) Browser can choose the appropriate font selection using the Lang attribute Search Engine can group or filter results based on the user’s linguistic preferences (using meta) Translation tools use to recognize the section of text in a particular language

W3C Specification - Language HTTP Content Language Header Content-Language: hi Language Attribute on html tag Content Language in meta tag Language attribute on embedded content Some English Content

What value to use for lang? IANA (Internet Assigned Numbers Authority) Provides a unique value for each language It is available in the Subtag value in the new IANA Language registry Hindi – hi, Kannada – kn, Tamil – ta

Bi-directional text Additional information is required in addition to the language attribute to provide support for non-Latin scripts (like Arabic, Hebrew, Urdu) In HTML, dir attribute is used to specify the direction of the text The title says “ ם ו א נ י ב ה ת ו ל י ע פ, W3C ” in Hebrew.

Multilingual WEB Address A Web address is used to point a resource on the WEB Web address are typically expressed using URIs (Uniform Resource Identifiers) Restricts to a small number of characters (upper & lower case letters of the English alphabet, numerals and few symbols). User’s expectations and use of the Internet have changed this restrictions. There is a growing need to use any language characters in WEB Addresses.

Multilingual WEB Address … A Web address in your own language and alphabet is easier to create, memorize, interpret and relate it. (Ex: खोज.com) Punycode is a way of representing Unicode code points using only ASCII characters. (Ex:

Indian Content an Overview Most Indian Websites are not using Unicode Content are generated within the ASCII range and provide the proprietary fonts which maps the ASCII character set to Indian Languages. Visually it will be fine, but no other entities will be able to interpret it For each site, the user may need to download the proprietary fonts, which is not user friendly Search Engine will not be able to interpret the content which is intended by author as it does not follow the standard encoding.

Indian Content an Overview

Unicode & W3C Importance WEB is also moving towards the mobile W3C Mobile Web Initiative (MWI) defines the best practices for Mobile Browsing Cannot install the required font’s during run-time as used to do in desktop If Unicode character are used the required font may be available within the device

Firefox Firefox ( Provides extensive support for Unicode and related fonts Provides the Add-ons to type in Indian Languages in web pages in Linux. (Such tools are already available for Windows XP Users through the language packs)

W3C i18n activity Core Working group Enable universal access to the World Wide Web by providing adequate support to other W3C Working Groups GEO (Guidelines, Education & Outreach) Internationalization aspects of W3C technology better understood and more widely and consistently used ITS (Internationalization Tag Set) Develop a set of elements and attributes that can be used with new DTDs/Schemas to support the internationalization and localization of documents

Thanks