Download presentation
Presentation is loading. Please wait.
Published byHailey Gorman Modified over 11 years ago
1
Bits of Unicode Data structures for a large character set Mark Davis IBM Emerging Technologies
2
Caution Characters ambiguous, sometimes: –Graphemes: x̣ (also ch, … ) –Code points:0078 0323 –Code units:0078 0323(or UTF-8: 78 CC A3) For programmers –Unicode associates codepoints (or sequences of codepoints) with properties –See UTR#17
3
The Problem Programs often have to do lookups –Look up properties by codepoint –Map codepoints to values –Test codepoints for inclusion in set e.g. value == true/false Easy with 256 codepoints: just use array
4
Size Matters Not so easy with Unicode! Unicode 3.0 –subset (except PUA) –up to FFFF 16 = 65,535 10 Unicode 3.1 –full range –up to 10FFFF 16 = 1,114,111 10
5
Array Lookup With ASCII Simple Fast Compact –codepoint bit: 32 bytes –codepoint short: ½ K With Unicode Simple Fast Huge (esp. v3.1) –codepoint bit: 136 K –codepoint short: 2.2 M
6
Further complications Mappings, tests, properties often must be for sequences of codepoints. –Human languages don t just use single codepoints. – ch in Spanish, Slovak; etc.
7
First step: Avoidance Properties from libraries often suffice –Test for (Character.getType(c) == Nd) instead of long list of codepoints Easier Automatically updated with new versions Data structures from libraries often suffice –Java Hashtable –ICU (Java or C++) CompactArray –JavaScript properties Consult http://www.unicode.orghttp://www.unicode.org
8
Data structures: criteria Speed –Read (static) –Write (dynamic) –Startup Memory footprint –Ram –Disk Multi-threading
9
Hashtables Advantages –Easy to use out-of-the-box –Reasonably fast –General Disadvantages –High overhead –Discrete (no range lookup) –Much slower than array lookup
10
Overhead: char1 char2 value next key overhead char1 overhead char2 overhead … hash … overhead
11
Trie Advantages –Nearly as fast as array lookup –Much smaller than arrays or Hashtables –Take advantage of repetition Disadvantages –Not suited for rapidly changing data –Best for static, preformed data
12
Trie structure … Index Data M1M2 Codepoint
13
Trie code 5 Operations –Shift, Lookup, Mask, Add, Lookup v = data[index[c>>S1]+(c&M2)]] S1 M1M2 Codepoint
14
Trie: double indexed Double, for more compaction: –Slightly slower than single index –Smaller chunks of data, so more compaction
15
Trie: double indexed … … … Index2 Data Index1 M1M3M2 Codepoint
16
Trie code: double indexed b1 = index1[ c >> S1 ] b2 = index2[ b1 + ((c >> S2) & M2)] v = data[ b2 + (c & M3) ] S2 S1 M1M3M2 Codepoint
17
Inversion List Compaction of set of codepoints Advantages –Simple –Very compact –Faster write than trie –Very fast boolean operations Disadvantages –Slower read than trie or hashtable
18
Inversion List Structure Structure –Index (optional) –List of codepoints in ascending order Example Set [ 0020-0061, 0135, 19A3-201B ] 0020 0062 0135 0136 19A3 201C Index 0: 1: 2: 3: 4: 5: in out in out in out
19
Inversion List Example Find smallest i such that c < data[i] –If no i, i = length Then c List odd(i) Examples: –In:0023, 0135 –Out:001A, 0136, A357 0020 0062 0135 0136 19A3 201C Index 0: 1: 2: 3: 4: 5: in out in out in out
20
Inversion List Operations Fast Boolean Operations Example: Negation 0020 0062 0135 0136 19A3 201C Index 0: 1: 2: 3: 4: 5: 0020 0062 0135 0136 19A3 201C Index 1: 3: 2: 4: 5: 6: 0000 0:
21
Inversion List: Binary Search from Programming Pearls Completely unrolled, precalculated parameters int index = startIndex; if (x >= data[auxStart]) { index += auxStart; } switch (power) { case 21: if (x < data[t = index-0x10000]) index = t; case 20: if (x < data[t = index-0x8000]) index = t; …
22
Inversion Map Inversion List plus Associated Values –Lookup index just as in Inversion List –Take corresponding value 0020 0062 0135 0136 19A3 201C Index 0: 1: 2: 3: 4: 5: 0 5 3 9 8 3 0 0: 1: 2: 3: 4: 5: 6:
23
Key String Value Problem –Often almost all values are 1 codepoint –But, must map to strings in a few cases –Don t want overhead for strings always Solution –Exception values indicate extra processing –Can use same solution for UTF-16 code units
24
Example Get a character ch Find its value v If v is in [D800..E000], may be string –check v2 = valueException[v - D800] –if v2 not null, process it, continue Process v
25
String Key Value Problem –Often almost all keys are 1 codepoint –Must have string keys in a few cases –Don t want overhead for strings always Solution –Exception values indicate possible follow-on codepoints –Can use same solution for UTF-16 code units –Use key closure!
26
Closure If (X + Y) is a key, then X is a key Before s x sh y shch z After shc yw c w s x sh y shch z c w
27
Why Closure? shcha … x y yw z not found, use last
28
Bitpacking Squeeze information into value Example: Character Properties –category: 5 bits –bidi: 4 bits (+ exceptions) –canonical category: 6 bits + expansion compressCanon = [bits >> SHIFT] & MASK; canon = expansionArray[compressCanon];
29
Statetables Classic: –entry = stateTable[ state, ch ]; –state = entry.state; –doSomethingWith( entry.action ); –until (state < 0);
30
Statetables Unicode: –type = trie[ch]; –entry = stateTable[ state, type ]; –state = entry.state; –doSomethingWith( entry.action ); –until (state < 0); Also, String Key Value
31
Sample Data Structures: ICU Trie: CompactArray –Customized for each datatype –Automatic expansion –Compact after setting Character Properties –use CompactArray, Bitpacking Inversion List: UnicodeSet –Boolean Operations
32
Sample Usage #1: ICU Collation –Trie lookup –Expanding character: String Key Value –Contracting character: Key String Value Break Iterators –For grapheme, word, line, sentence break –Statetable
33
Sample Usage #2: ICU Transliteration –Requires Mapping codepoints in context to others Rearranging codepoints Controlling the choice of mapping –Character Properties –Inversion List –Exception values
34
Sample Usage #3: ICU Character Conversion –From Unicode to bytes Trie –From bytes to Unicode Arrays for simple maps Statetables for complex maps –recognizes valid / invalid mappings –provides compaction Complications –Invalid vs. Valid mapped vs. Valid unmapped –Fallbacks
35
References Unicode Open Source ICU –http://oss.software.ibm.com/icuhttp://oss.software.ibm.com/icu –ICU4j: Java API –ICU4c: C and C++ APIs Other references see Mark s website: –http://www.macchiato.comhttp://www.macchiato.com
36
Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.