Basics of Unicode (base upon a presentation by NRSI, SIL International)

Slides:



Advertisements
Similar presentations
Globalization Gotchas
Advertisements

Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
1 Unicode Introduction Ken Zook November, Unicode Introduction 2 Unicode properties 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; Code point:
Tafseer Ahmed Department of Computer Science University of Karachi Urdu on Linux International Support.

1/25 Writing Character sets Unicode Input methods.
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Computer Systems Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
Supplementary Character Support in Microsoft Products Michael S. Kaplan Software Design Engineer Microsoft.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Introduction to Character Encodings, Java and You.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
Unicode (and Java) Brice Giesbrecht.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 10 This presentation © 2004, MacAvon Media Productions Characters & Fonts.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI N305 Information Representation: Characters and Images.
News On The Go! How NewsHunt reached 1 Crore Downloads ? INDIAN LANGUAGES!!
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
B.Sc. Multimedia ComputingMedia Technologies Character Representation & Font Technology.
1 An ICU Library Supporting the Display of Complex Text Eric Mader Globalization Center of Competency, Cupertino, CA.
Globalisation & Computer systems Week 4 writing systems and their implications for globalisation character representation ASCII extended ASCII code pages.
Using the Unicode Standard for Linguistic Data: Preliminary Guidelines Deborah Anderson Researcher Dept. of Linguistics, UC Berkeley.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Anlab ( ) Kim, Yangjung Characters & Fonts.
SEC (1.4) Representing Information as bit patterns.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Lis508 lecture 2: characters to textual documents Thomas Krichel
Representation of Characters
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
1 Problem Solving using Computers “Data....Representation, and Storage.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Characters CS240.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
Writing System Implementation On-the-Fly Extensibility for the common man Sharon Correll, SIL International Copyright © 2001.
Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.
Objectives  Explain the basic Unicode concepts in plain language  Install SILConverters 4.0  Install the converters for your branch  Convert several.
THE CODING SYSTEM FOR REPRESENTING DATA IN COMPUTER.
1.4 Representation of data in computer systems Character.
© 2016 AQA. Created by Teachit for AQA Character encoding and Representing images Lesson.
Day 6 - Encoding and Sending Formatted Text
Binary Representation in Text
Binary Representation in Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Machine level representation of data Character representation
Lesson Objectives Aims You should be able to:
Characters & Fonts Digital Multimedia, 2nd edition
Day 6 - Encoding and Sending Formatted Text
Representing Information as bit patterns
Data Encoding Characters.
Workshop on XML-Based Library Applications 5
Lecture 2 Data representation
Devanagari Font Support For Linux
Characters & Fonts Digital Multimedia, 2nd edition
Presenting information as bit patterns
INFO/CSE 100, Spring 2005 Fluency in Information Technology
Day 6 - Encoding and Sending Formatted Text
ASCII and Unicode.
Presentation transcript:

Basics of Unicode (base upon a presentation by NRSI, SIL International)

By the Numbers How is text stored in a computer? As a sequence of numbers that represent characters

Encoding The Heart of Data Processing:

Unicode Universally agreed Every character has its own number 100,000+ already defined Space for 1,100,000+ Everyone is using it Windows, Mac, Linux, etc. All fonts All current applications

Unicode Character Set Unicode codespace U U+10FFFF 17 planes of 64K (0x10000) characters Plane 0: BMP (Basic Multilingual Plane) Includes latin, ethiopic, arabic, hebrew & greek Other planes not used in Africa U U+FFFF

Unicode: BMP General Scripts Symbols CJK Misc. CJKV Ideographs Yi Hangul res’d for surrogate code values PUA Compatibility

Unicode Design Principles Encode Characters – not glyphs e.g. Arabic contextual forms: U+0628 ARABIC LETTER BEH: ﺏ ﺐ ﺒ ﺑ U+006A LATIN SMALL LETTER J: j j Encode Characters – not graphemes multi-graphs encoded as sequences: “ch” ↔ “ny” ↔

Unicode Design Principles Character Semantics General Category Letter, punctuation, number, symbol Case CAPITAL, small, Title A - U+0041 a - U+0061

Unicode Design Principles Unification Unify across languages within the same script French é = U+00E9 e + high tone: é = U+00E9 Different scripts don't unify characters Latin capital b: B = U+0062 Cyrillic capital ve:B = U+0432 ● DO NOT MIX AND MATCH ACROSS SCRIPTS

Unicode Design Principles Logical Order Store in reading order, not visual order P 0050 e 0065 a 0061 ש 05E9 ל 05DC ם 05DD c 0063 e 0065 Peace שלם

Unicode Design Principles Dynamic Composition Base characters + combining marks ⇒ “ä” ⇒ “ñ” ⇒ “c ̱ ”

Unicode Design Principles Multiple combining marks any number of combining marks ⇒ “ũ ̥̕ ” each combining mark has its own relative order ⇒ “ũ ̥̕ ” Usually stack vertically outward from base character ⇒ “ü ̥̯ ̃”

Unicode Design Principles Unicode has a dedicate area for Private Use called PUA Unicode is Extensible But it takes forever to add anything Use what is there

Unicode Normalization Canonical Equivalence Different ways of storing the same thing: é = = U+00E9 é = = U+0065 U+0301 U+00E9 ≡ U+0065 U+0301 To be treated as identical in all respects Not all programs do this

Unicode Normalization Normal Forms NFC (Normal Form Composed) é = = U+00E9 Sequences reduced to minimal length NFD (Normal Form Decomposed) é = = U+0065 U+0301 Sequences expanded to maximal length

Unicode Normalization Canonical Combining Order Which order should non-interacting marks be stored? ⇒ “ũ ̥ ̛ ” Important for comparison and sorting

Unicode Storage ● Ways of storing Unicode codepoints ● UTF32 – Single 32 bit number ● UTF16 – Single 16 bit number for BMP, pairs for higher values ● UTF8 – Use 'upper ASCII' (values 0x80 and above) sequences

Unicode Storage BOM (Byte Order Mark) U+FEFF: Zero Width Non-Breaking Space U+FFFE: Undefined Identifies Encoding Scheme: UTF8: 0xEF 0xBB 0xBF ("" ) UTF16BE: 0xFE 0xFF UTF16LE: 0xFF 0xFE UTF32BE: 0x00 0x00 0xFE 0xFF UTF32LE 0xFF 0xFE 0x00 0x00

Unicode Storage UTF8 Used for file storage Plays well with old 8-bit applications Bit structures F: 0xxx xxxx FF: 110x xxxx 10xx xxxx 0800-FFFF: 1110 xxxx 10xx xxxx 10xx xxxx FFFF: xxx 10xx xxxx 10xx xxxx 10xx xxxx ● BOM is redundant, but some files still get it.

Unicode Storage UTF16 Used for pure unicode storage Good average storage performance and speed Bit Structures 0000-FFFF: xxxxxxxx xxxxxxxx FFFF: xx xxxxxxxx xx xxxxxxxx Where xxxxx is USV – 0x10000

Unicode Storage UTF16 UTF16BE Store most significant byte first UTF16LE Store least significant byte first Use BOM to resolve

Questions?