21 st International Unicode Conference Dublin, Ireland, May 2002 1 Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein

21 st International Unicode Conference Dublin, Ireland, May 2002 1 Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein vweinste@us.ibm.com Globalization Center of Competency, San Jose, CA

21 st International Unicode Conference Dublin, Ireland, May 2002 2 Introduction A lot of data for each code point Need appropriate data structures Unicode version 3.1 introduced code points into supplementary space – addressable range grew to more than a million Repetitive data Sparsely populated range, especially the supplementary space

21 st International Unicode Conference Dublin, Ireland, May 2002 3 Data Structures Arrays –Advantages: very fast access time, fast write time –Disadvantage: Unacceptable memory consumption Hash tables –Advantages: Easy to use, Reasonably fast, General –Disadvantages: High overhead, complicated sequential access, slower than array lookup, data within ranges is not shared

21 st International Unicode Conference Dublin, Ireland, May 2002 4 Data Structures (continued) Inversion Maps –Advantages: simple, very compact, fast boolean operations –Disadvantages: worse access time than arrays and possibly hash tables For more details see “Bits of Unicode” at http://www.macchiato.com/slides/Bits_of_Unicode.ppt http://www.macchiato.com/slides/Bits_of_Unicode.ppt

21 st International Unicode Conference Dublin, Ireland, May 2002 5 Tries A trie is a structure with one or more indexes and one data storage. Name comes from “Information Retrieval” Shares repetitive data Good compaction Not appropriate for frequently changing data

21 st International Unicode Conference Dublin, Ireland, May 2002 6 Single-Index Trie A trie structure with an index array and a data array. Advantages –Excellent size –Very good access performance (two array accesses, shift, mask and addition) Disadvantages –Not appropriate for frequently changing data –Index array gets too big when dealing with supplementary code points

21 st International Unicode Conference Dublin, Ireland, May 2002 7 Single-Index Trie Diagram BMP code point UpperLower 150 LOWER_MASK UPPER_WIDTHLOWER_WIDTH Index Data Array 0 Data 0 Block

21 st International Unicode Conference Dublin, Ireland, May 2002 8 Double-Index Trie Two index arrays and a data block Compared to single-index trie: 1.Provides better compression of the index array 2.Worse performance, but still very fast 3.Feasible for supplementary code points

21 st International Unicode Conference Dublin, Ireland, May 2002 9 Double-Index Trie Diagram Block Code point UpperMiddle 20 0 Index 1 Index 2 0 Lower Data 0 MIDDLE_MASKLOWER_MASK UPPER_WIDTHMIDDLE_WIDTHLOWER_WIDTH Index1

21 st International Unicode Conference Dublin, Ireland, May 2002 10 Folded Trie Fast access for BMP code points Slower access for supplementary code points, but far less frequent Compacts supplementary index Needs additional build time processing Fast address with UTF-16 code units –no need to construct code point

21 st International Unicode Conference Dublin, Ireland, May 2002 11 Folded Trie – Supplementary Access Diagram Lead Surrogate 110110.. 150 0 Trail Surrogate 110111.. 159 Pseudo Code Point Final Data 6 Folded Trie Index + Data 51 2 Has data for surrogate block? No Yes 3 Data Same for the surrogate block 4 4 Lead Surrogate Data BMP code points access same as with single-index

21 st International Unicode Conference Dublin, Ireland, May 2002 12 ICU Implementation: UTrie ICU implementation is called UTrie Stores either 16 bit or 32 bit wide data (extensible in the future) Up to 256K different data elements Can be frozen and reused as memory mapped image for fast startup Using UTrie requires custom code More about ICU at the end of presentation

21 st International Unicode Conference Dublin, Ireland, May 2002 13 Range Enumeration Allows enumerating over a set of contiguous maximal ranges of same data elements Elements can be preprocessed by additional callback Saves time when processing the whole Unicode range by efficiently walking the trie structure start limit Element 3 Element 2 Element 1 start-1 limit-1

21 st International Unicode Conference Dublin, Ireland, May 2002 14 Latin-1 Fast Path Build time option Allows direct array access for the Latin-1 range (0x00-0xFF) Latin-1 range is not compressed if this option is used Appropriate when access for Latin-1 range is critical –collation

21 st International Unicode Conference Dublin, Ireland, May 2002 15 Normalization data is stored using UTries For example, main data has the following format Example: Normalization Data Extra data indexCombining classBCKFWDQC_MAYBE 31157653 Combines back Combines forward Can be either: - index to variable length data - first part of supplementary lookup value - Special handling indicator (Hangul, Jamo) QC_NO 0 Values for normalization quick check Variable-length data contains composition and decomposition info

21 st International Unicode Conference Dublin, Ireland, May 2002 16 Example: Character Properties Data The result of UTrie lookup is an index Double indexing allows for even better compression, since many code points have the same property value UTrie data width is 16 bit (thousands of data entries), while the property data width is 32 bits (few hundred unique data words). Index Data Folded Trie 16 bits Property data 32 bits

21 st International Unicode Conference Dublin, Ireland, May 2002 17 International Components for Unicode International Components for Unicode(ICU) is a library that provides robust and full-featured Unicode support Several library services use the common UTrie implementation Wide variety of supported platforms open source (X license – non-viral) C/C++ and Java versions http://oss.software.ibm.com/icu/

21 st International Unicode Conference Dublin, Ireland, May 2002 18 Conclusion UTrie data structure provides good compression with fast access The main constraint for usage is the nature of the data that needs to be stored Designed for repetitive and sparse data

21 st International Unicode Conference Dublin, Ireland, May 2002 19 Q & A

21 st International Unicode Conference Dublin, Ireland, May 2002 20 Folding and Surrogate Access Folding process compacts the index for supplementaries and moves it right above the BMP index Access in ICU4C: –Define a C callback, invoked when special lead surrogate is detected –Manually detect special lead surrogates In ICU4J, provide a subclass with a method that detects special lead surrogates

21 st International Unicode Conference Dublin, Ireland, May 2002 21 Summary Introduction: Storing Unicode data Types of data structures Tries Single-index trie Double-index trie Folded trie Usage of folded trie in normalization Usage of folded trie for character properties

21 st International Unicode Conference Dublin, Ireland, May 2002 1 Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein

Similar presentations

Presentation on theme: "21 st International Unicode Conference Dublin, Ireland, May 2002 1 Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

21 st International Unicode Conference Dublin, Ireland, May 2002 1 Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein

Similar presentations

Presentation on theme: "21 st International Unicode Conference Dublin, Ireland, May 2002 1 Folded Trie: Efficient Data Structure for All of Unicode Vladimir Weinstein"— Presentation transcript:

Similar presentations

About project

Feedback