Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.

Similar presentations


Presentation on theme: "Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following."— Presentation transcript:

1 Wavelet Trees Ankur Gupta Butler University

2 Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following queries –char(i) – returns the symbol at position i –rank c (i) – the number of c’s from T up to i –select c (i) – the ith occurrence of symbol c in T Text T can be compressed to nH 0 space, answering queries in –O(log Σ ) time using the wavelet tree [GGV03] –O(log log Σ ) time using [GMR06], but space is more When Σ = polylog(n), queries can be answered in O(1) time [FMMN04]

3 preparedpeppers 110101001011011 preparedpeppers 110101001011011 eedee 11011 eedee 11011 ppppp 11111 ppppp 11111 prprppprs 010100011 prprppprs 010100011 eaedee 101111 eaedee 101111 a1a1 a1a1 rrrs 0001 rrrs 0001 s1s1 s1s1 rrr 111 rrr 111 d1d1 d1d1 eeee 1111 eeee 1111 Compute rank r (10) (answer is 2) Actually compute rank 1 (10) = 5 Actually compute rank 0 (2)=2 Actually compute rank 1 (2)=2 Actually compute rank 1 (5) = 2

4 preparedpeppers 110101001011011 preparedpeppers 110101001011011 eedee 11011 eedee 11011 ppppp 11111 ppppp 11111 prprppprs 010100011 prprppprs 010100011 eaedee 101111 eaedee 101111 a1a1 a1a1 rrrs 0001 rrrs 0001 s1s1 s1s1 rrr 111 rrr 111 d1d1 d1d1 eeee 1111 eeee 1111 Compute select r (2) (answer is 6) Actually compute select 1 (4) = 6 Actually compute select 1 (2)=2 Actually compute select 1 (2)=4

5 preparedpeppers 110101001011011 preparedpeppers 110101001011011 eedee 11011 eedee 11011 ppppp 11111 ppppp 11111 prprppprs 010100011 prprppprs 010100011 eaedee 101111 eaedee 101111 a1a1 a1a1 rrrs 0001 rrrs 0001 s1s1 s1s1 rrr 111 rrr 111 d1d1 d1d1 eeee 1111 eeee 1111 Compute char(7) (answer is e) Actually compute char(7)=0 select 0 (7)=3 Actually compute char(2)=1 rank 1 (2)=2 Actually compute char(2)=1 rank 1 (2)=2 Actually compute char(3)=1 select 1 (3)=2

6 Some comments Don’t have to store any of the “all 1s” nodes –That’s just to help for the example. What does the wavelet tree imply? –Converts representation of a finite string on an alphabet to representation of many bitvectors. –Useful to achieve, ultimately, high-order compression. –Easy to implement – very simple structure and query pattern

7 Shapin’ Up To Something Special What about the shape of a wavelet tree? –Does it affect space? No. (You will see why in a bit.) –Time? Yes. Good news! Reorganize it to optimize query time... –Use a Huffman orientation based on query access. – If you choose symbol frequency, you now can search in O(H 0 ) time instead of O(log Σ ).

8 Wavelet Tree Space/Time Simple bitvectors –n bits per level and log |Σ| levels n log |Σ| overall bits O(n log log n / log n) extra bits for rank/select [J89] –Same space as original text but can now support rank c /select c /char in O(log |Σ|) time. (RAM) Fancy –[RRR02] Gets O(nH 0 ) + O(n log log n / log n) bits of space with O(log |Σ|) query time

9 Even Skewed Is a Shape

10 Empirical Entropy Text T of n symbols drawn from alphabet Σ (n lg |Σ| bits) Entropy: measure to assess compression size based on text Higher order entropy H h (of order h) –Considers context x of neighboring h symbols –Each Prob[y|x] term is thus conditioned on context x of h symbols –Note that H h (T) ≤ lg |Σ| –Now the text takes nH h ≤ n lg |Σ| bits of space to encode

11 One Text Indexing Result Because Frankly, There Are Lots Main Results (using CSA [GGV03]) –Space usage: nH h + o(n log |Σ|) bits –Search time: O(mlog |Σ| + polylog(n)) time –Can improve to o(m) time with constant factor more space When the text T is highly compressible (i.e. nH h = o(n)), we achieve the first index sublinear in both space and search time Second-order terms represent the space needed for –Fast indexing –Storing count statistics for the text Obtain nearly tight bounds on the encoding length of the Burrows-Wheeler Transform (BWT)

12 Tell Me More! How Do You Do It? 47 1 5 86 3 2 SA 0 Text Positions SA 1 24 1 3  For even index, use SA 1. Example: SA 0 [5] = 2·SA 1 [Rank red (5)] = 8.  For odd index, use neighbor function Φ 0. Example: SA 0 [2] = SA 0 [Φ 0 (5)] – 1 = 7. Perform these steps recursively for a compressed suffix array Encode increasing subsequences together to get zero-order entropy Φ0Φ0 35 7 6 428 47 1 5 86 3 2 SA 0 Subdivide subsequences and encode to get high-order entropy Neighbor function Φ 0 tells the position in the suffix array of the next suffix (in text order) It turns out that the neighbor function Φ is the primary bottleneck for space. For this example, suppose we know SA 1 Rank red 1 1 1 1 2 3 4 4 1 47 1 5 86 3 2

13 Burrows-Wheeler Transform (BWT) and the Neighbor Φ function The Φ function has a strong relationship to the Burrows-Wheeler Transform (BWT) The BWT has had a profound impact on a myriad of fields. –Simply put, it pre-processes an input text T by a reversible transform. –The result is easily compressible using simple methods. The BWT (and the Φ function) are at the heart of many text compression and indexing techniques, such as bzip2. We also call the Φ function the FL mapping from the BWT.

14 Burrows-Wheeler Transform (BWT)

15 A Shifty Little BWT list i list s

16 Where Oh Where Is My Wavelet Tree? For each list from the previous slide, we store a wavelet tree to achieve 0 th order entropy –The collection of 0 th order compressors gives high- order entropy based on the context (not shown in this talk). Technical point: number of alphabet symbols cannot be more than text length –We “rerank” symbols to match this requirement (negligible extra cost in space, O(1) time)

17 Any questions?


Download ppt "Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following."

Similar presentations


Ads by Google