Download presentation
Presentation is loading. Please wait.
Published byRandolph Henry Modified over 8 years ago
1
Basics of Unicode (base upon a presentation by NRSI, SIL International)
2
By the Numbers How is text stored in a computer? As a sequence of numbers that represent characters
3
Encoding The Heart of Data Processing:
4
Unicode Universally agreed Every character has its own number 100,000+ already defined Space for 1,100,000+ Everyone is using it Windows, Mac, Linux, etc. All fonts All current applications
5
Unicode Character Set Unicode codespace U+0000.. U+10FFFF 17 planes of 64K (0x10000) characters Plane 0: BMP (Basic Multilingual Plane) Includes latin, ethiopic, arabic, hebrew & greek Other planes not used in Africa U+0000... U+FFFF
6
Unicode: BMP General Scripts Symbols CJK Misc. CJKV Ideographs Yi Hangul res’d for surrogate code values PUA Compatibility
7
Unicode Design Principles Encode Characters – not glyphs e.g. Arabic contextual forms: U+0628 ARABIC LETTER BEH: ﺏ ﺐ ﺒ ﺑ U+006A LATIN SMALL LETTER J: j j Encode Characters – not graphemes multi-graphs encoded as sequences: “ch” ↔ “ny” ↔
8
Unicode Design Principles Character Semantics General Category Letter, punctuation, number, symbol Case CAPITAL, small, Title A - U+0041 a - U+0061
9
Unicode Design Principles Unification Unify across languages within the same script French é = U+00E9 e + high tone: é = U+00E9 Different scripts don't unify characters Latin capital b: B = U+0062 Cyrillic capital ve:B = U+0432 ● DO NOT MIX AND MATCH ACROSS SCRIPTS
10
Unicode Design Principles Logical Order Store in reading order, not visual order P 0050 e 0065 a 0061 ש 05E9 ל 05DC ם 05DD c 0063 e 0065 Peace שלם
11
Unicode Design Principles Dynamic Composition Base characters + combining marks ⇒ “ä” ⇒ “ñ” ⇒ “c ̱ ”
12
Unicode Design Principles Multiple combining marks any number of combining marks ⇒ “ũ ̥̕ ” each combining mark has its own relative order ⇒ “ũ ̥̕ ” Usually stack vertically outward from base character ⇒ “ü ̥̯ ̃”
13
Unicode Design Principles Unicode has a dedicate area for Private Use called PUA Unicode is Extensible But it takes forever to add anything Use what is there
14
Unicode Normalization Canonical Equivalence Different ways of storing the same thing: é = = U+00E9 é = = U+0065 U+0301 U+00E9 ≡ U+0065 U+0301 To be treated as identical in all respects Not all programs do this
15
Unicode Normalization Normal Forms NFC (Normal Form Composed) é = = U+00E9 Sequences reduced to minimal length NFD (Normal Form Decomposed) é = = U+0065 U+0301 Sequences expanded to maximal length
16
Unicode Normalization Canonical Combining Order Which order should non-interacting marks be stored? ⇒ “ũ ̥ ̛ ” Important for comparison and sorting
17
Unicode Storage ● Ways of storing Unicode codepoints ● UTF32 – Single 32 bit number ● UTF16 – Single 16 bit number for BMP, pairs for higher values ● UTF8 – Use 'upper ASCII' (values 0x80 and above) sequences
18
Unicode Storage BOM (Byte Order Mark) U+FEFF: Zero Width Non-Breaking Space U+FFFE: Undefined Identifies Encoding Scheme: UTF8: 0xEF 0xBB 0xBF ("" ) UTF16BE: 0xFE 0xFF UTF16LE: 0xFF 0xFE UTF32BE: 0x00 0x00 0xFE 0xFF UTF32LE 0xFF 0xFE 0x00 0x00
19
Unicode Storage UTF8 Used for file storage Plays well with old 8-bit applications Bit structures 0000-007F: 0xxx xxxx 0080-07FF: 110x xxxx 10xx xxxx 0800-FFFF: 1110 xxxx 10xx xxxx 10xx xxxx 10000-10FFFF: 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx ● BOM is redundant, but some files still get it.
20
Unicode Storage UTF16 Used for pure unicode storage Good average storage performance and speed Bit Structures 0000-FFFF: xxxxxxxx xxxxxxxx 10000-10FFFF: 110110xx xxxxxxxx 110111xx xxxxxxxx Where xxxxx is USV – 0x10000
21
Unicode Storage UTF16 UTF16BE Store most significant byte first UTF16LE Store least significant byte first Use BOM to resolve
22
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.