Basics of Unicode (base upon a presentation by NRSI, SIL International)

By the Numbers How is text stored in a computer? As a sequence of numbers that represent characters

Encoding The Heart of Data Processing:

Unicode Universally agreed Every character has its own number 100,000+ already defined Space for 1,100,000+ Everyone is using it Windows, Mac, Linux, etc. All fonts All current applications

Unicode Character Set Unicode codespace U+0000.. U+10FFFF 17 planes of 64K (0x10000) characters Plane 0: BMP (Basic Multilingual Plane) Includes latin, ethiopic, arabic, hebrew & greek Other planes not used in Africa U+0000... U+FFFF

Unicode: BMP General Scripts Symbols CJK Misc. CJKV Ideographs Yi Hangul res’d for surrogate code values PUA Compatibility

Unicode Design Principles Encode Characters – not glyphs e.g. Arabic contextual forms: U+0628 ARABIC LETTER BEH: ﺏ ﺐ ﺒ ﺑ U+006A LATIN SMALL LETTER J: j j Encode Characters – not graphemes multi-graphs encoded as sequences: “ch” ↔ “ny” ↔

Unicode Design Principles Character Semantics General Category Letter, punctuation, number, symbol Case CAPITAL, small, Title A - U+0041 a - U+0061

Unicode Design Principles Unification Unify across languages within the same script French é = U+00E9 e + high tone: é = U+00E9 Different scripts don't unify characters Latin capital b: B = U+0062 Cyrillic capital ve:B = U+0432 ● DO NOT MIX AND MATCH ACROSS SCRIPTS

Unicode Design Principles Logical Order Store in reading order, not visual order P 0050 e 0065 a 0061 ש 05E9 ל 05DC ם 05DD c 0063 e 0065 Peace שלם

Unicode Design Principles Dynamic Composition Base characters + combining marks ⇒ “ä” ⇒ “ñ” ⇒ “c ̱ ”

Unicode Design Principles Multiple combining marks any number of combining marks ⇒ “ũ ̥̕ ” each combining mark has its own relative order ⇒ “ũ ̥̕ ” Usually stack vertically outward from base character ⇒ “ü ̥̯ ̃”

Unicode Design Principles Unicode has a dedicate area for Private Use called PUA Unicode is Extensible But it takes forever to add anything Use what is there

Unicode Normalization Canonical Equivalence Different ways of storing the same thing: é = = U+00E9 é = = U+0065 U+0301 U+00E9 ≡ U+0065 U+0301 To be treated as identical in all respects Not all programs do this

Unicode Normalization Normal Forms NFC (Normal Form Composed) é = = U+00E9 Sequences reduced to minimal length NFD (Normal Form Decomposed) é = = U+0065 U+0301 Sequences expanded to maximal length

Unicode Normalization Canonical Combining Order Which order should non-interacting marks be stored? ⇒ “ũ ̥ ̛ ” Important for comparison and sorting

Unicode Storage ● Ways of storing Unicode codepoints ● UTF32 – Single 32 bit number ● UTF16 – Single 16 bit number for BMP, pairs for higher values ● UTF8 – Use 'upper ASCII' (values 0x80 and above) sequences

Unicode Storage BOM (Byte Order Mark) U+FEFF: Zero Width Non-Breaking Space U+FFFE: Undefined Identifies Encoding Scheme: UTF8: 0xEF 0xBB 0xBF ("ï»¿" ) UTF16BE: 0xFE 0xFF UTF16LE: 0xFF 0xFE UTF32BE: 0x00 0x00 0xFE 0xFF UTF32LE 0xFF 0xFE 0x00 0x00

Unicode Storage UTF8 Used for file storage Plays well with old 8-bit applications Bit structures 0000-007F: 0xxx xxxx 0080-07FF: 110x xxxx 10xx xxxx 0800-FFFF: 1110 xxxx 10xx xxxx 10xx xxxx 10000-10FFFF: 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx ● BOM is redundant, but some files still get it.

Unicode Storage UTF16 Used for pure unicode storage Good average storage performance and speed Bit Structures 0000-FFFF: xxxxxxxx xxxxxxxx 10000-10FFFF: 110110xx xxxxxxxx 110111xx xxxxxxxx Where xxxxx is USV – 0x10000

Unicode Storage UTF16 UTF16BE Store most significant byte first UTF16LE Store least significant byte first Use BOM to resolve

Questions?

Basics of Unicode (base upon a presentation by NRSI, SIL International)

Similar presentations

Presentation on theme: "Basics of Unicode (base upon a presentation by NRSI, SIL International)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Basics of Unicode (base upon a presentation by NRSI, SIL International)

Similar presentations

Presentation on theme: "Basics of Unicode (base upon a presentation by NRSI, SIL International)"— Presentation transcript:

Similar presentations

About project

Feedback