The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work done while at Microsoft Research Cambridge
Disclaimer THIS HAS NOTHING TO DO WITH WAVELETS!
Indexed String Sequences (foo, bar, foobar, foo, bar, bar, foo) Queries – Access(i): access the i-th element Access(2) = foobar – Rank(s, pos): count occurrences of s before pos Rank( bar, 5) = 2 – Select(s, i): find the i-th occurrence of a s Select( foo, 2) =
Prefix operations (foo, bar, foobar, foo, bar, bar, foo) Queries – RankPrefix(p, pos): count strings prefixed by p before pos RankPrefix( foo, 5) = 3 – SelectPrefix(p, i): find the i-th string prefixed by p SelectPrefix( foo, 2) =
Example: storing relations Write the columns as string sequences – Store them separately – Reduce relational operations to sequence queries User Leonard Penny Sheldon Penny Leonard Sheldon Likes URL battle.net/wow/ tmz.com battle.net/wow/ thecheesecakefactory.com wikipedia.org/Star_Trek wikipedia.org/String_theory marvel.com What does Sheldon like? Who likes pages from domain wikipedia.org? Other operations: range counting, …
Dynamic sequences We want to support the following operations: Insert(s, pos): insert the string s immediately before position pos Append(s): append the string s at end of the sequence (special case of Insert) Delete(pos): delete the string at position pos If data structure only supports Append, we call it append-only, otherwise dynamic (or fully dynamic)
Requirements Store the sequence in as little space as possible – Close to the information-theoretic lower bound But still be able to support all the described operations (query and update) efficiently – Aim for worst-case polylog operations
Some notation (foo, bar, foobar, foo, bar, bar, foo) Sequence S, |S| = n – In the example n = 7 String set S set is unordered set of distinct strings appearing in S – In the example, {foo, bar, foobar}, |S set | = 3 – Also called alphabet Sequence symbols can also be integers, characters, … – As long as they are binarized to strings
Wavelet Trees Introduced in 2003 to represent Compressed Suffix Arrays Support Access/Rank/Select on sequences on a finite alphabet (of integers) – Reduces to operations on bitvectors by recursively partitioning the alphabet String sequences can be reduced to integer sequences
Wavelet Trees S = (a, b, r, a, c, a, d, a, b, r, a), S set ={a, b, c, d, r} abracadabra abaaaba rcdr 1011 rdr 101 a rd cb {c, d, r}{a, b} {d, r}
Wavelet Trees Space equal to entropy of the sequence – Plus negligible terms Supports Access/Rank/Select in O(log |S set |) Later extended to support Insert/Delete… – … but tree structure is fixed a priori – String set S set is cannot be changed! – Unrealistic restriction in many database applications
The Wavelet Trie The Wavelet Trie is a Wavelet Tree on sequences of binary strings (S set ⊂ {0, 1} * ) Supports Access/Rank(Prefix)/Select(Prefix) Fully dynamic… … or append only (with better bounds) The string set need not be known in advance
Wavelet Trie: Construction Common prefix: α Branching bit: β α: 010 β: Sequence of binary strings
Wavelet Trie: Construction α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: α: 10 α: ε β: 110 α: ε
Wavelet Trie: Access α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: α: 10 α: ε β: 110 α: ε Access(5) = Rank is similar α: 010 β: α: ε β: 1011 α: ε β: 101 1
Wavelet Trie: Select α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: 1011 α: 10 α: ε β: 110 α: ε Select( , 1) = α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: 1011 α: 10 α: ε β: 110 α: ε 4
Wavelet Trie: Append α: 010 β: α: ε β: 101 α: 01 α: 10 α: ε β: α: 10 α: ε β: 110 α: ε α: ε α: 0 α: ε β: 11 0 Insert/Delete are similar
Space analysis Information-theoretic lower bound – LB(S) = LT(S set ) + nH 0 (S) – LT is the information-theoretic lower bound for storing a set of strings Static WT: LB(S) + o(ĥn) Append-only WT: LB(S) + PT(S set )+ o(ĥn) – PT(S set ): space taken by the Patricia Trie Fully dynamic WT: LB(S) + PT(S set )+ O(nH 0 (S))
Operations time complexity Need new dynamic bitvectors to support initialization (create a bitvector 0 n or 1 n ) Static and Append-only Wavelet Trie – All supported operations in O(|s| + h s ) – h s is number of nodes traversed by string s Fully dynamic Wavelet Trie – All supported operations in O(|s| + h s log n) – Deletion may take O(|ŝ| + h s log n) where ŝ is longest string in the trie
Thanks for your attention! Questions?