UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.

Slides:



Advertisements
Similar presentations
Chris Pratley Lead Program Manager Microsoft Office.
Advertisements

© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Unicode: A Grand Tour Character Encodings & Unicode.
Computer Codes Rohit Khokher. Computer Codes Data types NumericNonnumeric IntegerRealAlphabet A, B, C, …,Z a, b, c,…,z Digits 0,…,9 Special Characters.
Lecture 2 1 Encoding Schemes Encoding methods: a method of encoding at binary level to ensure identification and the use of a mixture of different character.
Using Binary Coding Information Remember  Bit = 0 or 1, Binary Digit  Byte = the number of bits used to represent letters, numbers and special characters.
Anti-Virus Product Development Cliff Penton Head of Software Development Sophos Plc Slides © 1999 Sophos Plc
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Win32 Programming Lesson 6: Everything you never wanted to know about strings.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
ASCII and Unicode.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Agenda Data Representation – Characters Encoding Schemes ASCII
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI N305 Information Representation: Characters and Images.
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI 230 Dale Roberts, Lecturer Information.
Discussion on Chinese Domain Name technology including encoding, testing.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
Characters In Java single characters are represented using the data type char. Character constants are written as symbols enclosed in single quotes, for.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.
Anlab ( ) Kim, Yangjung Characters & Fonts.
SEC (1.4) Representing Information as bit patterns.
Week 7 Lecture 2 Globalization Support in the Database.
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Java Programming, Second Edition Chapter Two Using Data Within a Program.
Text encoding: or how to get 黃慧儀 and Ίων Ανδρουτσόπουλος into the same document. Chris Brew Linguistics, Ohio State.
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
Characters CS240.
Characters and Strings
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
MNCS: DMA Extensions for Multinational Character Strings DMA Technical Committee Integration Subcommittee June 16, 1999 [ notes]
THE CODING SYSTEM FOR REPRESENTING DATA IN COMPUTER.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text
Machine level representation of data Character representation
Lesson Objectives Aims You should be able to:
INTERNATIONALIZATION
Data Transfer ASCII FILES.
Advanced UNIX progamming
Representing Information as bit patterns
Phnom Penh International University (PPIU)
TOPICS Information Representation Characters and Images
Coding Schemes and Number Systems
Lecture 3 ISE101: Computing Fundamentals
Chapter 8 Characters In Java single characters are represented using the data type char. Character constants are written as symbols enclosed in single.
COMS 161 Introduction to Computing
Text Encoding.
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Chapter 3 - Binary Numbering System
ASCII and Unicode.
Varying Character Lengths
Presentation transcript:

UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode

Character Sets Character Sets - a complete group of characters for one or more writing system. Coded character set – a mapping from a set of abstract characters to a set of integers. Character encoding scheme- a mapping from a coded character set to a set of octets.

Standards International standards – - RFC 2130, - ISO RFC standards come out of the Internet Engineer Task Force(IETF), it stands for Request For Comments. ISO is a worldwide federation of national standards bodies such as ANSI, JISC, KISI, CSA(GB} and ECMA.

Coded character sets (1) ASCII ( American Standard Code for Information Interchange) or ISO 646 defined by 7-bit -Latin alphabet; -Arabic numerals; -Punctuation marks; -Some computer control codes;

Coded character sets (2) ISO-8859 Standards- ISO to 15 ISO (Latin1) - the most widely used standard, it contains the characters necessary for writing in Western European and Scandinavian languages. It is also an ANSI Standard and known as the ANSI character set. The first 128 positions are identical to ASCII All use 8-bit byte Windows use ISO as its character set.

Coded character sets (3) Asian Language character sets Traditional Chinese(Taiwan) -BIG5: 8-16 bit, 94x157 matrix :13002 characters; -CNS X5012: 16 bit, characters. Simplified Chinese(China) -GB : 8-16 bit, 94x94 matrix: 6763 characters. -GBK (Extended GB ): 8-16 bit, 94x94 matrix, characters.

Coded character sets (3) Japanese(Japan) -JIS x0208: 16 bit, 94x94 matrix, 6879 characters. -JIS x0212: 16 bit, 5801 supplemental kanji. -JIS x0201: 7 bit, JIS-Roman plus half-width katakana. Korean(Korea) - KSX-1001(KSC 5601): 8-16 bit, 94x94 matrix, 4888 hanja. -KSC-5636: 7 bit Korean version of ASCII.

Coded character sets (4) EUC (Extended UNIX Code); ISO 2022 Escape sequence Windows Code pages: 8-16 bit, - cp1252 English is ISO or Latin1or Windows ANSI; - cp1200 Unicode code page; - cp932 Japanese is shift-JIS; - cp936 Simplified Chinese is close to GB ; - cp949 Korean is in KSC order; - cp950 Traditional Chinese is the same as Big 5.

Han Unification The Han standards consist of :JIS X0208(6349 Kanji), GB (6763 Hanzi), CNS (13951 Hanzi) and KSC 5601 (4888 Hanji). All characters in these standards must be included in Without unification, more than 100,000 characters are separately encoded Characters from these standards should be unified in ISO that is identical characters from two or more of these standards may have the same code point in The unification of Han characters allows for approximately 40,000 characters.

Coded character sets (5) Unicode: 16 bit, most languages are covered, developed by the Unicode Consortium. Unicode 1.0  Unicode 2.0 contains 38,885 characters. It will be Unicode 3.0 in this year. Unicode is similar to ISO 10646, the UCS(Universal Character Set) encoding, - UCS-2: i.e.Unicode, 16 bit, a fixed with 2 byte format scheme; - UCS-4: also called ISO 10646, 32 bit, 4 byte format, 32,000 planes each with 65,000 characters capacities, for tall 2,080 million characters. The 1 st plane is in use(that’s Unicode) - UTF-7: 7-40 bit, UCS Transformation Format, a Unicode character encoding scheme using a 7 bit. - UTF-8: 7-48 bit, UCS Transformation Format, a Unicode character encoding scheme using a 8 bit. - UTF-16(UCS-2E): bit. Windows NT, use Unicode, has a single character set.

UCS Transformation Format; A variable-width or multi-byte encoding format; In UTF-8, the standard ASCII characters occupy only one byte, other Unicode characters occupy two or three bytes. Start CharacterEnd CharacterRequired Data Bits Binary Byte Sequence (x = data bits) \u0000\u007F7 0xxxxxxx \u0080\u07FF xxxxx 10xxxxxx \u0800\uFFFF161110xxxx 10xxxxxx 10xxxxxx Table The UTF-8 Encoding UTF-8

Unicode ISO (UCS2) 16 bits 8 bits EUC PC CPs Shift-JIS ISO2022 ISO8859 EBCDIC Apple CPs 7 bits ASCII GB JIS KSC Encoding Evolution

Programming Unicode How does Win32 support it How to write apps for Unicode How to be backward compatible

Win32 Server Windows NT Base Win32 App (Unicode) Win32 App Non-Unicode ANSI to Unicode Conversion Win32 Client Client-Server Boundary System fully Unicode internally Unicode in Win32

Separate Unicode Datatype - Wide character: Unicode - 8-bit char:ANSI, DBCS Parallel Unicode and ANSI APIs - Unicode and ANSI windows classes - Implicit code conversion Resources always in Unicode

Programming Unicode Basic techniques - How to migrate existing code base Special topics - Interaction with non-Unicode apps - Untyped file system Foundation for script/language specific functionality Migration example: Win32

Windows Unicode Programming Code Conversion Data Exchange Common Source Win32 EXE Non-Unicode Win32 EXE Unicode

Generic data types TCHAR LPTSTR wchar_t wchar_t * char char * Explicit data types WCHARCHARLPWSTR LPSTR wchar_tcharwchar_t *char * Generic data types in C

Macros and literals String literals TEXT(“hello”); “hello”;L”hello”; numeric equivalence ‘A’ = 0x41 0x0041 = L’A’

Dual function prototypes SetWindowText (HWND, LPTSTR) ; SetWindowTextA (HWND, LPSTR) ; SetWindowTextW (HWND, LPWSTR) ; #ifdef UNICODE Generic API prototypes Resolved to explicit prototypes

Basic conversion steps Use generic data types: TCHAR, LPTSTR for Text Use explicit types for BYTE pointers (data buffers) Use TEXT() macro for literal constants Adjust pointer arithmetic Use generic function prototypes

Conversion metrics About 10% of source lines need to be modified by global replace LPSTR  LPTSTR strstr()  wcswcs() About 2-5% need small modification lstrlen(s)  lstrlen(s)/sizeof(TCHAR) Less than 1% need revised algorithm OpenFile()  SearchPath(); CreateFile()

Summary: Migrating Windows- based Program to Unicode 1.Modify your code to use generic data types 2.Modify your code to use generic function prototypes 3.Surround any character or string literal with the TEXT macro. 4.Create generic version of your data structures 5.Change your make process. 6.Adjust pointer arithmetic. 7.Check for any code that assumes a character is always 1 byte long. 8.Add Unicode-specific code if necessary. 9.Add code to support special Unicode characters. 10.Determine how using Unicode will affect file I/O. 11.Double check the way in which you retrieve command line arguments. 12.Debug your port by enabling your compiler’s type-checking