Unicode & W3C Jataayu Software C. Kumar January 2007.

Unicode & W3C Jataayu Software C. Kumar January 2007

Agenda About Jataayu Unicode & Encoding W3C Specification for multi-lingual authoring Multilingual WEB Address Indian WEB Sites an Overview W3C Activity

About Jataayu Jataayu formed with a clear focus of delivering solutions for wireless data services Over 60% of the data traffic in Indian Mobile Networks for WAP, Mobile WEB and MMS handled by Jataayu Products Mobile Device Solution Division focusing on wireless data applications like WAP, MMS, SyncML, IMPS, Email, Web Browsing, Download Active participants in OMA, W3C and MWI Over 350 people strong with offices in UK, Singapore, Korea, Taiwan and the US; headquartered in India with major development center in Bangalore

Localization - Internationalization Localization (l10n) Adaptation of the content to meet the language, cultural and other requirements of a specific target market Internationalization (i18n) Design & Development of the content that enables easy localization for target audiences that vary in culture, region or language. Mission of W3C i18n Activity is to ensure the W3C’s formats and protocols are usable worldwide in all languages and in all writing systems.

Need for Unicode Early character sets based on 7-bit, gave 2 7 (ie. 128) possible characters Adding the 8 th bit gave a total of 256 possible characters. Still not enough for all the European languages. Code page mechanism helped a little by changing the upper cells (0xA0 to 0xFF), but was very complex. Addressing the needs of the other languages requires thousands of ideographic characters at a time.

Unicode & Encoding Unicode, universal character set contains all the characters needed for writing the majority of living languages in use on computers. Allows for simple display and storage of multilingual content An encoding refers to the way that characters are mapped from the character set to actual Unicode value. Different encoding yield different byte sequences.

Unicode & Encoding UTF-8 (Unicode Transformation Format) Variable length 8-bit character encoding for Unicode Able to represent any universal character in the Unicode Standard Uses one to four bytes to encode a Unicode symbol Only one byte is needed to encode the US-ASCII characters

Unicode & Encoding UTF-16 (16-bit Unicode Transformation Format) Variable length 16-bit character encoding for Unicode Uses two or four byte sequence to encode a Unicode symbol Two byte is required to encode the US-ASCII character UCS-2 (2-byte Universal Character Set) Fixed length encoding that always encodes characters into a single 16-bit value It can encode characters in the range 0x0000 to 0xFFFF

Unicode & Encoding UCS-4 / UTF-32 (32-bit Unicode Transformation Format) Fixed length 32-bit character encoding for Unicode Every character it uses 4 bytes and it is very space inefficient Little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode Text http://www.unicode.org/

Unicode & Encoding Devanagari (0x0900 – 0x097F) Bengali (0x0980 – 0x09FF) Tamil (0x0B80 – 0x0BFF) Kannada (0x0C80 – 0x0CFF) Code Point U+0041U+05D0U+597DU+233B4 UTF-8 41D7 90E5 A5 BDF0 A3 8E B4 UTF-16 00 4105 D059 7DD8 4C DF B4 UTF-32 00 00 00 4100 00 05 D000 00 59 7D00 02 33 B4

Unicode & Encoding Alternate way to represent the character is by using escape value. (א) Not all documents have to be encoded as Unicode But documents can only contain characters defined by Unicode Standard Any encoding can be used as long as it is properly declared and it is the subset of Unicode Unicode encoding also allows many more languages to be mixed on a single page

Other Encoding formats … Shift_JIS (SJIS), character encoding for the Japanese Language Single byte character encoding for the lower-ASCII characters (0x00 – 0x7F) Double-byte character encoding for the upper- ASCII bytes GB2312, character encoding for simplified Chinese characters

W3C Specification - Encoding W3C specification for multi-lingual authoring Encoding of the document needs to be mentioned, so that the application that consumes can interpret it. Meta Tag XML Content-type header returned from the WEB server should also contain the character encoding of the document Content-Type: text/html; Charset=utf-8

W3C Specification - Language Author needs to specify the language of the document (web page content) Browser can choose the appropriate font selection using the Lang attribute Search Engine can group or filter results based on the user’s linguistic preferences (using meta) Translation tools use to recognize the section of text in a particular language

W3C Specification - Language HTTP Content Language Header Content-Language: hi Language Attribute on html tag Content Language in meta tag Language attribute on embedded content Some English Content

What value to use for lang? IANA (Internet Assigned Numbers Authority) Provides a unique value for each language It is available in the Subtag value in the new IANA Language http://www.iana.org/assignments/language-subtag- registry Hindi – hi, Kannada – kn, Tamil – ta

Bi-directional text Additional information is required in addition to the language attribute to provide support for non-Latin scripts (like Arabic, Hebrew, Urdu) In HTML, dir attribute is used to specify the direction of the text The title says “ ם ו א נ י ב ה ת ו ל י ע פ, W3C ” in Hebrew.

Multilingual WEB Address A Web address is used to point a resource on the WEB Web address are typically expressed using URIs (Uniform Resource Identifiers) Restricts to a small number of characters (upper & lower case letters of the English alphabet, numerals and few symbols). User’s expectations and use of the Internet have changed this restrictions. There is a growing need to use any language characters in WEB Addresses.

Multilingual WEB Address … A Web address in your own language and alphabet is easier to create, memorize, interpret and relate it. (Ex: http:// खोज.com) Punycode is a way of representing Unicode code points using only ASCII characters. (Ex: http://xn--21bm4l.com) http://xn--21bm4l.com

Indian Content an Overview Most Indian Websites are not using Unicode Content are generated within the ASCII range and provide the proprietary fonts which maps the ASCII character set to Indian Languages. Visually it will be fine, but no other entities will be able to interpret it For each site, the user may need to download the proprietary fonts, which is not user friendly Search Engine will not be able to interpret the content which is intended by author as it does not follow the standard encoding.

Indian Content an Overview

Unicode & W3C Importance WEB is also moving towards the mobile W3C Mobile Web Initiative (MWI) defines the best practices for Mobile Browsing Cannot install the required font’s during run-time as used to do in desktop If Unicode character are used the required font may be available within the device

Firefox Firefox (http://www.getfirefox.com) Provides extensive support for Unicode and related fonts Provides the Add-ons to type in Indian Languages in web pages in Linux. (Such tools are already available for Windows XP Users through the language packs) https://addons.mozilla.org/firefox/5484/author/

W3C i18n activity Core Working group Enable universal access to the World Wide Web by providing adequate support to other W3C Working Groups GEO (Guidelines, Education & Outreach) Internationalization aspects of W3C technology better understood and more widely and consistently used ITS (Internationalization Tag Set) Develop a set of elements and attributes that can be used with new DTDs/Schemas to support the internationalization and localization of documents

Thanks kumarc@jataayusoft.com

Unicode & W3C Jataayu Software C. Kumar January 2007.

Similar presentations

Presentation on theme: "Unicode & W3C Jataayu Software C. Kumar January 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unicode & W3C Jataayu Software C. Kumar January 2007.

Similar presentations

Presentation on theme: "Unicode & W3C Jataayu Software C. Kumar January 2007."— Presentation transcript:

Similar presentations

About project

Feedback