Lecture 3 1 ISO/IEC 10646 and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.

Slides:



Advertisements
Similar presentations
26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer.
Advertisements

Murray Sargent III Microsoft Corporation Text Services Group, Word Tips & Tricks on Editing and Displaying Unicode Text.
1 The Ideographic Composition Scheme and Its Applications in Chinese Text Processing Qin LU Department of Computing, The Hong Kong Polytechnic University.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
An Overview Of Windows NT System Student: Yifan Yang Student ID:
中文信息处理 Chinese NLP Lecture 2.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.

Lecture 2 1 Encoding Schemes Encoding methods: a method of encoding at binary level to ensure identification and the use of a mixture of different character.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Lecture4 1 Wide character vs. Multi-byte characters Text information needs to be represented by the right data types. –Multi byte characters: data are.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.
Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
Unicode (and Java) Brice Giesbrecht.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI N305 Information Representation: Characters and Images.
Data Representation Prepared by Dr P Marais (Modified by D Burford)
Computing Higher - Unit 1… Computer Systems 1 Higher Computing Unit 1 – Topic 1 Data Representation.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Informatics I101 February 25, 2003 John C. Paolillo, Instructor.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI 230 Dale Roberts, Lecturer Information.
Postacademic Interuniversity Course in Information Technology – Module C1p1 Chapter 1 Evolution of Communication Networks.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Introduction Lecture 01.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
UNICODE & Indic Scripts
Lis508 lecture 2: characters to textual documents Thomas Krichel
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Lecture 04: Instruction Set Principles Kai Bu
Base 2 Numbering System Chapter 1.
Introduction of XML & XHTML Webmaster - Fort Collins, CO Copyright © XTR Systems, LLC Overview of XML & XHTML Instructor: Joseph DiVerdi, Ph.D., MBA.
1 Problem Solving using Computers “Data....Representation, and Storage.
M204 - Data Representation
Characters CS240.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
CS 101 – Sept. 11 Review linear vs. non-linear representations. Text representation Compression techniques Image representation –grayscale –File size issues.
Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
NUMBER SYSTEMS.
Lesson Objectives Aims You should be able to:
Characters & Fonts Digital Multimedia, 2nd edition
TOPICS Information Representation Characters and Images
Lecture 2 Data representation
Characters & Fonts Digital Multimedia, 2nd edition
Digital Encodings.
Text Encoding.
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Presentation transcript:

Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters in almost all national standards –Framework: Fix the coding architectures, and code- points can be filled up later. –Uniform and Efficient: fixed-width encoding, no need to identify the coding length(ASCII, Big5, GB) –Unambiguous: Any given 16-bit(32-bit) value always represents the same character

Lecture 3 2 UCS-4 (Canonical form of ISO 10646) Fixed 32-bit(actually 31 bits) coding assignment to 7F FF FF FF Each plane: 2 16 = 65,536 code points BMP(the basic multilingual plane) –Both Group No. and Plan No. are 00(first two bytes of zeros) Before ISO part 2 came out(end of year 2001), only BMP contains characters Group No. (total: 128) Plane No (total: 256) High Byte (total: 256) Low Byte (total: 256)

Lecture 3 3 Code Architecture of UCS-4 Plane 00 BMP Planes 256/Group Group 0 Group 1 Group 127 Groups

Lecture 3 4 UCS-2: 2-byte representation of UCS-4 –Basic Multilingual Plan(BMP) –Switching mechanism to use code range of BMP to access another 16 planes (Surrogate pairs) BMP Compatibility Zone: A-Zone Alphabets, Symbols, CJK Misc I-Zone CJK ideographs O-zone Hangul S-Zone(Surrogate) R-Zone Private Use, Compatibility, Arabic Presentations

Lecture 3 5 Unicode Unicode is the implementation of ISO with 16 bit representation using UCS-2 Has definition of actions associated with certain characters control character behavior Rendering behavior: combining characters Examples –Control character bell should cause a sound in the system –Type the character using U+0061(a)U+0300( ̀ ) will be rendered as one symbol à

Lecture 3 6 Extension of ISO –Extension A(BMP) has 6,582 characters, published in 2000, ISO/IEC Second Edition(2000). –Extension B: All characters in 康熙字典,漢語大字典, plus other characters such as those in HK Supplementary Character Set, ISO/IEC (2001), total of 43,253 characters In Plane 2 of UCS-4 –How would Extension B be supported in UCS- 2? => Using some encoding scheme

Lecture 3 7 Surrogate Pairs 2 UCS-2 code H followed by L where –H is in the range of D800 - DBFF –L is in the range of DC00 - DFFF For a given UCS-2 code(or code pair) U, the corresponding UCS-4 code-point value N (scalar value) –N= U if U is a single, non-surrogate value –N=(H-D )* (L-DC00 16 ) where U is a surrogate pair –Undefined for any other U in UCS-2. N: in the range of 0 to 10FFFF 16 => N = => ?

Lecture 3 8 UTF: UCS Transformation Format –Allows a certain number of code values in UCS which correspond to some other coding standard (e.g. ASCII) be transmitted just as what they would be in that coding standard, a property known as transparency-while other code values are represented through escape mechanism –variable length encoding to achieve greater efficiency

Lecture 3 9 UTF-8: 8-bit encoding for 8-but UNIX Environment –ASCII transparent –First-byte indicates the number of characters –Shortest encoding principle for invertible (or bijective) encoding/decoding –Save storage space for ASCII, non-ideographic characters –Example: Unicode A A43 => UTF-8: –Example: UTF CE 82 => UCS-4:

Lecture 3 10 Character vs. glyph Character: smallest component of written language that have semantic value Glyphs: represent the shapes that characters can have when they are rendered or displayed. A, A,Example: A, A, are the same character and having the same code. Concrete shape can be very different and are given one codepoint. Coding of variants

Lecture 3 11 ISO 10646/Unicode Features for Chinese –Han Unification (Chinese, Japanese and Korean) –Unification Problems: Different sources, non-cognate –Three-dimensional Conceptual Model: semantics(x), abstract shape(y), actual shape(z) Examples

Lecture 3 12 Unification Rules( 認同規則 ) R1: Source Separation Rule: If two ideographs are distinct in a primary source standard, then they are not unified.Why R2: Non-cognate( 非同源 )Rule: In general, if two ideographs are unrelated in historical derivation(non-cognate characters), then they are not unified R3: By means of two-level classification, the abstract shape of each ideograph is determined. Any two ideographs that possess the same abstract shape are unified unless disallowed by R1 or R2.

Lecture 3 13 Example: Component structure analysis

Lecture 3 14 Sources of Unified Han Characters

Lecture 3 15 Wide character vs. Multi-byte characters Text information needs to be represented by the right data types. –Multi byte characters: data are processed on a per-byte basis: Big5, GB, EUC, even UTF-8 –Wide characters: Fixed-byte encoding and no testing of high bit needed. Processing representation for wide characters: –Big Endian vs. Little Endian Data type dependent System architecture dependent Distinction: 0xFEFF for Big Endian and 0xFFFE for Little Endian