1 Adrian Rissoné Information Systems Manager Department of Palaeontology The Natural History Museum Introduction ISO 10646 and the.

Slides:



Advertisements
Similar presentations
Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.
Advertisements

The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Text #ICANN50. Text #ICANN50 IDN Variant TLD Program GNSO Update Saturday 21 June 2014.
XHTML Basics.
Solutions for Multilingual Literature by XSL Formatter 6,800 known languages.
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
Administrivia Assignments Labs Questions?? Class questions – –Goes to dpd and the TA’s Hand in lab assignments.
Addition : _________________ Binary Numbers (contd)
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
15 September How Computers Work: Other Forms of Data.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Creating Web Page Forms
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
XML, CM, and KM KMWorld 2001 Thursday November 1, 2001 Darlene Fichter Data Library Coordinator University of Saskatchewan Libraries Frank Cervone Assistant.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Unicode & W3C Jataayu Software C. Kumar January 2007.
Creating a Simple Page: HTML Overview
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Based on: Companion to Data Communications: From Basics to Broadband, Third Edition by William J. Beyda © 2000 Prentice Hall, Inc. All Rights Reserved.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
HTML (HyperText Markup Language)
Computer Math CPS120: Data Representation. Representing Data The computer knows the type of data stored in a particular location from the context in which.
Using Html Basics, Text and Links. Objectives  Develop a web page using HTML codes according to specifications and verify that it works prior to submitting.
Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.
Introduction to HTML. HTML Hyper-Text Markup Language: the foundation of the World-Wide Web Design goals:  Platform independence: pages can be viewed.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
UNICODE & Indic Scripts
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
1 Problem Solving using Computers “Data....Representation, and Storage.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
Characters CS240.
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
How to control bracket and parentheses appearance in right to left display of web Presenter: Yoel Kortick.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
Essential Skills for Computing Fonts
Characters & Fonts Digital Multimedia, 2nd edition
TOPICS Information Representation Characters and Images
WEB PROGRAMMING JavaScript.
Basic Communication Concepts
Characters & Fonts Digital Multimedia, 2nd edition
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
DirectWrite By Lukas Morozovas™.
INFOCODING BASICS & EXAMPLES OF CURRENT USE
ASCII and Unicode.
Varying Character Lengths
Presentation transcript:

1 Adrian Rissoné Information Systems Manager Department of Palaeontology The Natural History Museum Introduction ISO and the UCS Unicode and UTF Support in common products Sorting & Searching Data management products International & Special Characters in Scientific Data The Taxonomic Database Working Group International & Special Characters ©The Natural History Museum, London, SW7 5BD, October 2002

2 Introduction Until a few years ago the only text characters that could be used widely were the 128 (including control characters) contained within the 7-bit ANSI/ASCII character set – in practice, only a limited range of characters from North American English. Later, use of the 8 th bit extended the range to 256 characters, so as to include most Western European characters and some graphical characters. Non-Western characters could only be displayed using Windows Code Pages – they could not be displayed together More recently, support for multi-byte characters has been gradually introduced into operating systems (MacOS since version 8.5, Windows NT/2000), but restricted to certain fonts (eg. Arial Unicode MS) Inclusion into application products has been slow

3 ISO and UCS The international standard ISO defines the Universal Character Set (UCS) UCS is a superset of all other character set standards. It guarantees round-trip compatibility to other character sets UCS contains the characters required to represent practically all known languages This includes not only the Latin, Greek, Cyrillic, Hebrew, Arabic, Armenian, and Georgian scripts, but also also Chinese, Japanese and Korean Han ideographs as well as scripts such as Hiragana, Katakana, Hangul, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, Tibetian, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, Ogham, Myanmar, Sinhala, Thaana, Yi, and others

4 ISO and UCS ISO defines formally a 31-bit character set The most commonly used characters, including all those found in older encoding standards, have been placed in one of the first positions (0x0000 to 0xFFFD) This 16-bit subset of UCS is called the Basic Multilingual Plane (BMP) The characters that were later added outside the 16-bit BMP are mostly for specialist applications such as historic scripts and scientific notation. Current plans are that there will never be characters assigned outside the 21-bit code space from 0x to 0x10FFFF, which covers a bit over one million potential future characters

5 ISO and UCS UCS assigns to each character not only a code number but also an official name A hexadecimal number that represents a UCS or Unicode value is commonly preceded by "U+" as in U+0041 for the character "Latin capital letter A“ The UCS characters U+0000 to U+007F are identical to those in US ASCII and the range U+0000 to U+00FF is identical to ISO (Latin-1)

6 ISO and UCS Combining characters These are similar to the non-spacing accent keys on a typewriter. A combining character is not a full character by itself. It is an accent or other diacritical mark that is added to the previous character. This way, it is possible to place any accent on any character Combining characters follow the character which they modify Precomposed characters Accented characters that have their own code position, but could also be represented as a pair of another character followed by a combining character

7 UCS Implementation Levels Level 1 Combining characters and Hangul Jamo characters are not supported Level 2 Like level 1, however in some scripts, a fixed list of combining characters is now allowed (e.g., for Hebrew, Arabic, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada, Malayalam, Thai and Lao). These scripts cannot be represented adequately in UCS without support for at least certain combining characters Level 3 All UCS characters are supported, such that for example mathematicians can place a tilde or an arrow (or both) on any arbitrary character

8 UCS as a national standard? A number of countries have published national adoptions of ISO :1993, sometimes after adding additional annexes with cross-references to older national standards and specifications of various national implementation subsets China: GB Japan: JIS X :2001 Korea: KS X :1995 (includes ISO :1993 amendments 1-7) Vietnam: TCVN 6909:2001

9 What is Unicode? The ISO standard was a project of the International Organization for Standardization (ISO) The Unicode Project was organized by a consortium of (initially mostly US) manufacturers of multi-lingual software Fortunately, the participants of both projects realized in around 1991 that two different unified character sets is not what the world needs. They joined their efforts and worked together on creating a single code table Both projects still exist and publish their respective standards independently, but they have agreed to keep the code tables of the Unicode and ISO standards compatible

10 What is Unicode? The Unicode Standard published by the Unicode Consortium corresponds to ISO at implementation level 3. All characters are at the same positions and have the same names in both standards The Unicode Standard defines in addition much more semantics associated with some of the characters and is in general a better reference for implementers of high-quality typographic publishing systems Unicode specifies algorithms for rendering presentation forms of some scripts (say Arabic), handling of bi-directional texts that mix for instance Latin and Hebrew, algorithms for sorting and string comparison, and much more

11 What is Unicode? The ISO standard on the other hand is not much more than a simple character set table However, a nice feature of the ISO standard is that it provides CJK example glyphs in five different style variants, while the Unicode standard shows the CJK ideographs only in a Chinese variant

12 What does Unicode look like? Characters are denoted in the Unicode Standard as an optional U+ followed by their hexadecimal number, using at least 4 digits, such as "U+1234" or "U+10FFFD" In XML or HTML this could be expressed as "ሴ" or "􏿽"

13 UTF-8 (UCS Transformation Format) UCS and Unicode are just code tables that assign integer numbers to characters There exist several alternatives for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes The two most obvious encodings store Unicode text as sequences of either 2 or 4 bytes sequences. The official terms for these encodings are UCS-2 and UCS-4 respectively An ASCII or Latin-1 file can be transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes instead before every ASCII byte

14 UTF- 8 Using UCS-2 (or UCS-4) under some operating systems (eg. Unix) would lead to very severe problems. Some bytes and byte sequences have a special meaning in filenames and other C library function parameters The UTF-8 encoding defined in ISO :2000 Annex D (and also described in section 3.8 of the Unicode 3.0 standard) does not have these problems

15 UTF- 8 UTF-8 has the following properties: UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8 All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character

16 UTF- 8 The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF All possible 2 31 UCS codes can be encoded UTF-8 encoded characters may theoretically be up to six bytes long The sorting order of UCS-4 byte strings is preserved

17 Storing Unicode/UTF-8 Most compliant applications store characters 0 – 255 (0xFF) as a single character A few store Unicode as Unicode text strings (&#nnnn;) UFT-8 characters outside of 0x00 – 0xFF are stored as a multibyte sequence, the first byte being a count of the number of following bytes

18 HTML & XML For document and data interchange, the Internet and the World Wide Web are more and more making use of marked-up text such as HTML and XML Because Unicode and UTF-8 “text” characters are interpreted by the browser they may be included naturally in HTML or XML documents However, in many instances, markup provides the same, or essentially similar features to those provided by format characters in the Unicode Standard for use in plain text and there may be conflict Another special character category provided by Unicode are compatibility characters. While there may be valid reasons to support these characters and their specifications in plain text, their use in marked-up text can conflict with the rules of the markup language

19 Support for Unicode/UTF-8 Support for Unicode/UTF-8 is very variable, even within product families. For example, Microsoft Office 2000 supports most UTF-8 characters in data but Frontpage 2000 does not (it does support Unicode) Some major players do not offer, or have only recently introduced, compatibility Database-level support is slowly becoming commonplace but interfaces, programming-level support and clients are lagging behind. For example, the latest version of the PHP scripting language does not formally support UTF-8 (but encoding/decoding functions are available) Application software depends on the underlying operating system. Unicode/UTF-8 versions of products are therefore only available on later versions of the operating system (eg. Windows NT/2000 onwards, MacOS 8.5 onwards). This is a real problem for application developers

20 Support for Unicode/UTF-8 However clever an application is, it can’t display Unicode/UTF-8 if a capable font is not available! The Arial Unicode MS font shipped with Microsoft Office 2000 can display 51,180 characters. Arial Unicode MS is 23 MB and will have a significant impact on the performance of your computer Code 2000 contains many characters that are difficult to find elsewhere. Apple computers need a compatible font installed, but it should handle most Microsoft fonts Bitstream Font Fusion is a new technology which promises to be able to construct scalable Unicode fonts with much less impact – perhaps only one sixteenth the size of current fonts.

21 Support for Unicode/UTF-8 An HTML document should include a metatag defining the character set as UTF-8: A compatible application, such as a browser with the default font set to Arial Unicode MS, should then display Unicode correctly Using only the construct to display a UTF-8 character or Unicode string is not enough: For example, Internet Explorer and Netscape 7 will display the character correctly but Word will not

22 Support for Unicode/UTF-8 Can one trust the products? Maybe not The next slide shows a Microsoft Internet Explorer 6 representation of an HTML document with the document encoding set to UTF-8 and the browser default font set to Arial Unicode MS Some of my sample characters are not displayed at all where the font is defined as a non-Unicode font (rectangular boxes are displayed instead). Curiously, Microsoft Word displays some of those that IE6 does not Netscape 7 (the latest version) displays all characters regardless of what the default font is set to. This involves automatic font substitution – it can be seen in the example - (if a suitable substitution font is available!)

23 Support for Unicode/UTF-8 Microsoft Internet Explorer 6

24 Support for Unicode/UTF-8 Netscape 7.0

25 Support for Unicode/UTF-8 More importantly, note that setting the font to be italicised can result in the wrong character being displayed! Look at the characters (should be д ), highlighted in red The results are the same with IE6 and Netscape 7.0 so the problem looks likely to be in Windows font management

26 Sorting Unicode/UTF-8 Sorting is not quite so straightforward as one might hope! A few products have taken the approach that each character set “locale” should be sorted independently. The effect of this is to separate the “Latin” sets Microsoft products (and others) sort all the Latin sets as a whole, followed by each other set Greek, then Cyrillic, etc.), one after another There is no known way to sort Unicode “phonetically”

27 Sorting Unicode/UTF-8 Moscou Moscov Moscow Moskow Москва Москов Visean Viséan Roemer Römer Examples of data sorted by Microsoft Word The six different representations of the city “Moscow” were found in less than 1 minute using Google

28 Support in Data Management Systems A survey conducted in September 2002 of the thirteen Collection Management Systems listed on the United Kingdom Museums Documentation Association (MDA) web site revealed only two (ADLIB and MUSIMS) which claimed full Unicode or UTF-8 compatibility A further three (CALM, KE Emu and Questor ARGUS) are actively working on multibyte solutions, one (Specify) did not reply but, given the core database (SQLServer) in use, should be able to handle UTF-8

29 Support in Data Management Systems This has important implications for portal developers Unmodified, a query may return data in a variety of formats: plain ASCII, non-Western Windows Code Pages, bespoke fonts, Unicode text strings, UTF-8/UTF-16, etc. The onus will be on the provider to supply data mapped to UTF but there are still likely to be inconsistencies and mapping errors

30 Links & Acknowledgements Much of the explanation of Unicode & UTF-8 originated in UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn ( ) and is reproduced by permission. The original document (and the whole site) also contains many useful linkshttp:// ISO can be ordered from The Unicode Project is at Unicode in XML and other Markup Languages There is a useful list of the capabilities of various Windows and Apple (OSX) fonts at There is a lighthearted, but informative, UTF-8 sampler at Bitstream Font Fusion

31 In Conclusion …..

32 The Taxonomic Database Working Group International & Special Characters Adrian Rissoné Information Systems Manager Department of Palaeontology The Natural History Museum We’re getting there! - but it’s a slow process It will be some time before the majority of the applications have full Unicode/UTF-8 compatibility, especially at the client interface The development of web-based products (including portals enabling searching over multiple datasets) is more promising. There are certainly still problems, mainly at the database interface and in programming languages, but delivery to a browser client (with the correct fonts available) is very nearly a reality Proper handling of multiple human languages, rather than characters is another story …..