Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interplay between Stringology and Data Structure Design Roberto Grossi.

Similar presentations


Presentation on theme: "Interplay between Stringology and Data Structure Design Roberto Grossi."— Presentation transcript:

1 Interplay between Stringology and Data Structure Design Roberto Grossi

2 Interplay between Stringology and Data Structure Design (limited view: my own experience) Roberto Grossi

3 Interplay between Stringology and Data Structure Design (limited view: my own experience) Roberto Grossi advertising

4 4 Interaction between stringology and data structures Case studies: Compressed text indexing [G., Gupta, Vitter] Multi-key data structures [Crescenzi, G., Italiano] [Franceschini, G.] [G., Italiano] Order vs. disorder in searching [Franceschini, G.] In-place vector sorting [Franceschini, G.]

5 5 Compressed text indexing Replace text 2  n ) self-indexing binary string [Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter] n log  bits ) n H h + … bits (where H h = h-order empirical entropy) Unique algorithmic framework: wavelet tree + finite set model + succinct dictionaries + … Text indexing: new implementation of CSA (compressed suffix array) Text indexing: new implementation of CSA (compressed suffix array)

6 6 Compressed text indexing Replace text 2  n ) self-indexing binary string [Ferragina, Manzini] [Sadakane] [G., Gupta, Vitter] n log  bits ) n H h + … bits (where H h = h-order empirical entropy) Unique algorithmic framework: wavelet tree + finite set model + succinct dictionaries + … Compression: new analysis of BWT (Burrows-Wheeler transform) Compression: new analysis of BWT (Burrows-Wheeler transform) Text indexing: new implementation of CSA (compressed suffix array) Text indexing: new implementation of CSA (compressed suffix array)

7 7 Suffix arrays, BWT, and H h (high- order empirical entropy) Equivalently use contexts x of order h for cx instead of xc T = mississippi# # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# ipssm#pissiiipssm#pissii 12 11 8 5 2 1 10 9 7 4 6 3

8 8 Suffix arrays, BWT, and H h (high- order empirical entropy) Context x = i, h =1 Chars c = p, s, m Store “ pssm ” using just bits Get n H h bits!!! Add bits to encode the partition. # i# ippi# issippi# ississippi# mississippi pi# ppi# sippi# sissippi# ssippi# ssissippi# ipssm#pissiiipssm#pissii 12 11 8 5 2 1 10 9 7 4 6 3

9 9 Incremental representation Example: mark p pssm ! 1000 remove p ; mark m ssm ! 001 remove m ; mark s ss ! 11 We obtain 3 subsets: Encode each subset, containing t items out of n, using bits.

10 10 Getting the multinomial coefficient Sum of the log binomial coefficients of the subsets’ sizes

11 11 Wavelet trees Generalize the idea from the linear list to any tree shape Cost is independent of the shape (e.g. assign access frequencies)

12 12 Bound on bits of space Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t 1, …, t r, with  i t i = n. Let enc(t 1, …, t r ) be the number of bits for encoding the sequence of these r sizes. Let g’ =  h+1 and g =  h+1 log , both independent of n ! 1. Then, r · g’ and storing BWT takes nH h + [enc(t 1,..., t r ) - 1/2  i log t i ] + O(r log  ) bits.

13 13 Bound on bits of space Crucial encoding for the BWT partition: r (positive) subsets’ sizes, t 1, …, t r, with  i t i = n. Let enc(t 1, …, t r ) be the number of bits for encoding the sequence of these r sizes. Let g’ =  h+1 and g =  h+1 log , both independent of n ! 1. Then, r · g’ and storing BWT takes nH h + [enc(t 1,..., t r ) - 1/2  i log t i ] + O(r log  ) · nH h + g’ log(n/g’) + O(g) bits.

14 14 Interaction between stringology and data structures Case studies: Compressed text indexing Multi-key data structures [Crescenzi, G., Italiano] [Franceschini, G.] [G., Italiano] Order vs. disorder in searching In-place vector sorting

15 15 Why multi-key data? Strings are everywhere… Keys are arbitrarily long Multi-dimensional points Multiple precision numbers Textual data XML paths URLs and IP addresses … Modeled as strings in  k, for unbounded alphabets Q: How to avoid O(k) slowdown factor in the cost of the operations supported by known data structures?

16 16 I. Ad hoc data structures Some examples ternary search trees [Clampett] [Bentley, Sedgewick] tries […] lexicographic D-trees [Mehlhorn] multi-dimensional B-trees [Gueting, Kriegel] multi-dimensional AVL trees [Vaishnavi] lexicographic splay trees [Sleator, Tarjan] multi-dimensional BST [Gonzalez] [Roura] multi-BB-trees [Vaishnavi] … Search, insert, delete in O(k + log n ) time Split and concatenate in O(k + log n ) time

17 17 II. Augmenting access paths Reuse many data structures for 1-dim keys: AVL trees, red-black trees skip lists (a,b)-trees BB[α]-trees self-adjusting trees random search trees (treaps,…) … Inherit their combinatorial properties Traversing is driven by comparisons

18 18 III. Using an oracle for strings Data structure D = black box performing comparisons on pairs of 1-dim keys. General theorem for transforming D into a data structure D’ for strings (no efficiency loss). Oracle DS lcp for maintaing order in a linked list of strings, along with their lcps (extends Dietz-Sleator list).

19 19 The general technique New data structure D ’ = old data structure D + oracle DS lcp Method: comparison is O(1)-time if we know lcp ( x, y )=min { j j x [ j +1]  y [ j +1] } ( x < y iff x [ lcp +1] < y [ lcp +1]) use DS lcp for storing and comparing pairs of strings in D ’ in constant time use predecessors and lcp s computed so far to insert a new string y into D ’ (and DS lcp )

20 20 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ it does not necessarily imply (log n) per ins in a sequence of operations; e.g., finger search trees

21 21 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ String data structure D’ for n strings it does not necessarily imply (log n) per ins in a sequence of operations; e.g., finger search trees

22 22 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ String data structure D’ for n strings Space S ( n )Space S ( n ) + O ( n )

23 23 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ String data structure D’ for n strings Space S ( n )Space S ( n ) + O ( n ) Operation op on O (1) keys in D in T ( n ) time Operation op on O (1) strings in D ’ in O(T ( n ) ) time

24 24 Theorem for general transformation Comparison driven data structure D for n keys : ins identifies pred or succ String data structure D’ for n strings Space S ( n )Space S ( n ) + O ( n ) Operation op on O (1) keys in D in T ( n ) time Operation op on O (1) strings in D ’ in O(T ( n ) ) time Operation op involving y not in D, in T ( n ) time Operation op involving y not in D ’, in O(T ( n ) + k ) time

25 25 Some features No need to reinvent the wheel for data structs designers Better than using compact trie + Dietz-Sleator list + dynamic LCA when T ( n ) = o ( log n ), e.g.: weighted search O( log (  i w i )/ w ) finger search O( log d ) set manipulation O( n log( m / n ) )

26 26 Interaction between stringology and data structures Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching [Franceschini, G.] In-place vector sorting

27 27 Searching In-Place a Sorted(?) Array of Strings “Imagine how hard it would be to use a dictionary if its words were not alphabetized!” -- D.E. Knuth, The Art of Comp. Prog., vol. 3, 1998

28 28 Order vs. Disorder: An experiment  Think of your table desk… 1. Are the papers on your desk in sorted order? 2. Probably not! 3. Unsorted data seems to provide more informative content than sorted data… 4. Can we formalize this intuition in the comparison model?

29 29 Preprocessing by sorting  In-place search the lexicographically sorted array in [Andersson, Hagerup, Håstad, Petersson, ’94, ’95, ’01]: time  Upper/lower bounds. The classical  (log n) when k = 1.

30 30 Permuting is better ! For any key length k, there exists an “unsorted” permutation attaining simultaneously  (k + log n) time O(1) extra space Optimal among all possible permutations, better than those resulting from sorting. Warning: suffix array search is not in-place (since LCP takes more than O(1) extra cells).

31 31 Basic tool: Bit stealing Simple, yet effective, idea on pairwise sorted keys: For keys of length k ) O(k) slowdown factor. Q: Can we get O(1) decoding time? 0 1 4752168 3 Implicit bits encoded by pairwise exchanging keys! Implicit bits encoded by pairwise exchanging keys!

32 32 K-dimensional bit stealing: Digging a ditch! Using d = lcp(x i, x j )+1, decode a bit in O(1) time, by checking mismatches, x i [d] and x j [d]. Idea exploited for digging a ditch, in O(k + r) time: DIGGING(x 1 … x r ) d à 1, i à 1, j à r while i < j do // twin positions i and j while d · k and x i [d] = x j [d] do d à d + 1 i à i + 1, j à j - 1

33 33 Ditch: twin positions and twin intervals Create twin intervals with same digging depth; bit stealing is O(1) time with keys in twin positions.

34 34 Large DITCH Encode information for the twin intervals in O(k log n) distinct keys (which are still searchable). These twin positions can encode 3 bits

35 35 Inside each twin interval T Searching A reduces to searching in a specific twin interval T. Use modified Manber-Myers search for accessing just O(log n) stealed bits in T for lcp information (instead of O(log n) £ O(log n) bits).  It is provably more efficient to keep data “unsorted” rather than “sorted” for in-place searching.

36 36 Interaction between stringology and data structures Case studies: Compressed text indexing Multi-key data structures Order vs. disorder in searching In-place vector sorting [Franceschini, G.]

37 37 Logical order ´ physical layout Knuth’s indirect addressing: 1. permute the records’ pointers to find their ranks 2. permute the records according to the ranks What if records are scrambled during merging? Irregular access pattern to records

38 38 In-place model for vector sorting: GVSP( ) Comparison model extended to keys of length k, using O(1) extra memory cells m vectors of length k to be sorted p vectors for internal buffering h stealed bits with 2h vectors initially ) m = n and p = h = 0

39 39 Optimal time-space bounds Reduce recursively GVSP( ) to simpler instances Use internal implicit data structures for strings in some of the instances Sorting cost is time-space optimal: O(nk + n log n) time/comparisons O(n) vector moves O(1) words of memory for extra space

40 40 Conclusions Joint work on the “reverse” contribution, from stringology to data structure/algorithm design. Fruitful interplay between the two areas: Compressed text indexing Multi-key data structures Order vs. disorder in searching In-place vector sorting


Download ppt "Interplay between Stringology and Data Structure Design Roberto Grossi."

Similar presentations


Ads by Google