1 Unicode Introduction Ken Zook November, 2006. Unicode Introduction 2 Unicode properties 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; Code point:

Slides:



Advertisements
Similar presentations
Globalization Gotchas
Advertisements

Unicode Normalization Mark Davis
21 st International Unicode Conference Dublin, Ireland, May Optimizing the Usage of Normalization Vladimir Weinstein Globalization.
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.

מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
1/25 Writing Character sets Unicode Input methods.
Data Representation in Computers
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
2.1.4 BINARY ASCII CHARACTER SETS A451: COMPUTER SYSTEMS AND PROGRAMMING.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Unicode & W3C Jataayu Software C. Kumar January 2007.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge.
Unicode (and Java) Brice Giesbrecht.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI N305 Information Representation: Characters and Images.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI 230 Dale Roberts, Lecturer Information.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
21 st International Unicode Conference Dublin, Ireland, May Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein
Implementation Issues Mark Davis Properties.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
UNICODE & Indic Scripts
The Information School of the University of Washington Oct 13fit digital1 Digital Representation INFO/CSE 100, Fall 2006 Fluency in Information Technology.
Representation of Characters
Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
M204 - Data Representation
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Cupertino, CA, USA / September, 2000First ICU DeveloperWorkshop1 Transformation Support Alan Liu Globalization Center of Competency IBM Emerging Technology.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
Writing System Implementation On-the-Fly Extensibility for the common man Sharon Correll, SIL International Copyright © 2001.
Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.
Objectives  Explain the basic Unicode concepts in plain language  Install SILConverters 4.0  Install the converters for your branch  Convert several.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
1.4 Representation of data in computer systems Character.
Introduction to computer science Lec2 cs111. Extended Binary Coded Decimal Interchange Code (EBCDIC) is an 8- bit character encoding used mainly on.
Complex Text Layout Issues with examples from Myanmar
Binary Representation in Text
Binary Representation in Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Machine level representation of data Character representation
Binary 1 Basic conversions.
Data Encoding Characters.
TOPICS Information Representation Characters and Images
Representing Characters
Presenting information as bit patterns
COMS 161 Introduction to Computing
Chapter 2 Data Representation.
COMS 161 Introduction to Computing
C Programming Language
ASCII LP1.
ASCII and Unicode.
Presentation transcript:

1 Unicode Introduction Ken Zook November, 2006

Unicode Introduction 2 Unicode properties 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; Code point: 0041 Name: LATIN CAPITAL LETTER A General category: Uppercase letter (Lu) Canonical combining class: Standard spacing (0) Bidirectional category: Left-to-right (L) Mirrored: no (N) Lowercase mapping: 0061 Representative glyph Semantic properties A

November, 2006Unicode Introduction 3 Unicode code space Basic multilingual plane (BMP) Private Use Area (PUA) Surrogates General scripts Symbols & punctuation East Asian Compatibility & specials Planes 1-16 accessed by surrogates when using UTF FFFF 0000FFFF

November, 2006Unicode Introduction 4 Encoding Unicode UTF-16 Surrogates: D800-DFFF High: D800-DBFF, Low: DC00-DFFF 0000FFFF Surrogates used to access FFFF in UTF-16 D800 DF UTF-32 = (1 32-bit value / code point) UTF-16 = D800 DF31 (FW/Win) ( bit values / code point) UTF-8 = F0 90 8C B1 (XML) (1-4 8-bit values / code point) U GOTHIC LETTER BAIRKAN

November, 2006Unicode Introduction 5 Private Use Area (SIL) International PUA: F100-F8FF (2,047) Entity PUA: E000-EFFF (4,095) E010 (Philippines) maps to F2010 E010 (Russia) maps to F1010 Unique entity mappings in upper PUA PUA: E000-F8FF (6,400) PUA: F0000-FFFFD, FFFD (131K)

November, 2006Unicode Introduction 6 Canonical equivalence 01FA 212B C A 0301 LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE ANGSTROM SIGN COMBINING ACUTE ACCENT LATIN CAPITAL LETTER A WITH RING ABOVE COMBINING ACUTE ACCENT LATIN CAPITAL LETTER A COMBINING RING ABOVE COMBINING ACUTE ACCENT

November, 2006Unicode Introduction 7 Normalization (NFD) 006F F ≡ 006F D 0328 ≡ 006F ≡ 006F ED ≡ 01EB 0304 ≡ 006F D;LATIN SMALL LETTER O WITH MACRON;;0;;006F 0304… 01ED;LATIN SMALL LETTER O WITH OGONEK AND MACRON;;0;;01EB 0304… 01EB;LATIN SMALL LETTER O WITH OGONEK;;0;;006F 0328… 0304;COMBINING MACRON;;230… 0328;COMBINING OGONEK;;202…

November, 2006Unicode Introduction 8 Normalization (NFC) 006F ≡ 01EB 0304 ≡ 01ED 006F ≡ 006F ≡ 01EB 0304 ≡ 01ED 014D 0328 ≡ 006F ≡ 01EB 0304 ≡ 01ED 01ED ≡ 006F ≡ 01EB 0304 ≡ 01ED 014D;LATIN SMALL LETTER O WITH MACRON;;0;;006F 0304… 01ED;LATIN SMALL LETTER O WITH OGONEK AND MACRON;;0;;01EB 0304… 01EB;LATIN SMALL LETTER O WITH OGONEK;;0;;006F 0328… 0304;COMBINING MACRON;;230… 0328;COMBINING OGONEK;;202…

November, 2006Unicode Introduction 9 Case mapping SpecialCasing.txt + UnicodeData.txt Unicode digraphs require title casing Case mapping is not reversible McConnel  mcconnel  MCCONNEL 01F1;LATIN CAPITAL LETTER DZ;Lu;;;;;;;01F3;01F2 01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;;;;;;;;;01F1;01F3; 01F3;LATIN SMALL LETTER DZ;Ll;;;;;;;;;01F1;;01F2

November, 2006Unicode Introduction 10 Case mapping Case mapping may produce strings of different length 01F0  004A 030C Case mapping may depend on the locale English0069  0049 Turkish/Azeri0069  0130

November, 2006Unicode Introduction 11 Case mapping Case mapping may depend on context 03A3  03C3 03A3  03C2

November, 2006Unicode Introduction 12 Case mapping Some characters require special handling 1F80  1F88 or...1F … 03B  1F08 03B9 Case mapping may not preserve normalization 01F  004A 030C 0323 ≡ 004A C NFC NFC

November, 2006Unicode Introduction 13 babibu b Smart rendering: Arabic bbabababbabbabibabibabibbabib Screen: Keyboard: babibubabibu e f Code points: e f e f e e e e 0628

November, 2006Unicode Introduction 14 Smart rendering: Burmese kkrkrkrukru Screen: Keyboard: kruikrui b 102f 102d Code points: b 102f b1000

November, 2006Unicode Introduction 15 Smart rendering: Tamil UUrUrUr rUr rUUr rU yUr rU yUUr rU yU NUr rU yU NUUr rU yU NU mUr rU yU NU mUUr rU yU NU mU kUr rU yU NU mU kUUr rU yU NU mU kU j Screen: Keyboard: Ur rU yU NU mU kU jU Code points: b9c bc2 b95 bc2bae bc2 ba3 bc2 baf bc2bb0bb0 bc2b8a bb0b8abaf ba3 baeb95 b9c