Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory.

Similar presentations


Presentation on theme: " Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory."— Presentation transcript:

1

2  Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory storage tradeoff

3  Grossi, Gupta and Vitter – 2003 110010100 10100 0101 00110001 01001 00010011101010011 010 10010 01 10

4  Grossi and Ottaviano - Wavelet trees based on Patricia trie  Brisaboa, Ladra, Navarro (IPM 2013) – Wavelet tree for Byte Codes  Kulekci (DCC 2014) - Elias and Rice code  P. Prochazka, J. Holub – (DCC 2014) compression for similar biological sequences

5  Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

6  Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

7 0 1 1 2 3 5 8 13 21 34 55 89 144 … Basis elements of a numeration system

8 1248163264128 Basis elements:111 73 = 001101 3412358132155 Fibonacci:000 73 = No adjacent 1’s00000

9 EExample: 19 = 101001 PProblem: Not instantaneous Solution: Reverse the codeword EExample: 19 = {{11, 011, 0011, 1011, 00011, 10011, 01011, 000011, 100011, 010011, 001011, 101011, 0000011, …} 1101001 1 1001011

10 SSet of strings ending in 11 with no other adjacent 1’s {{11, 011, 0011, 1011, 00011, 10011, 01011, 000011, 100011, 010011, 001011, 101011, 0000011, …}

11  Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

12  Given a bit vector B of length n  rank 1 (B,i) - (resp. rank 0 (B,i) ) - the number of 1s (resp. 0s) up to and including position i in B  select 1 (B,i) - (resp. select 0 (B,i) ) - returns the index of the i th 1 (resp. 0s)

13  rank 1 (B,i) = i-rank 0 (B,i) ›  compute only rank 1 (B,i)  Naive Solution: Store rank answers:  Example: 1234567891011121314151617181920 01000101100001111001 01111223444445678889

14  Store rank answers every lg 2 n bits of B. › Use lg n bits for each answer  Divide each chunk into ( lg n)/2 chunks,  Store rank answers relative to last sample every ( lg n)/2 bits › Use 2lglg n bits per sub-sample  Bottom Level – use a simple Lookup table. Space Complexity -

15 7041 blocks 21627... 613 950 Output = 7041+613+ 000…000 000…011 000…101 000…112 … 1111…0 1111…1

16  Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

17 1. E(T) compress T 2. Generate B of size |E(T)| so that: B[i] 1 iff E(T)[i] is the first bit of a codeword 3. Construct a rank/select data structure for B Space Complexity

18  Fibonacci Codes  Rank and Select  Random Access using auxiliary index  Random Access using Wavelet trees  Improved Wavelet trees for Random Access  Experimental Results

19  T = COMPRESSORS   = {C, M, P, E, O, R, S}  Occ = {1,1,1,1,2,2,3}  E(T)= 01011 0011 10011 00011 011 1011 11 11 0011 011 11 100101 101 011 00111 01 00100111001 1111 1 1 11 1

20 extract(V root, i){ code  v V root while v is not a leaf if B v [i] = 0; v left(v) codecode  0 i rank 0 (B v, i) else v right(v) codecode  1 i rank 1 (B v, i) return D(code)

21 select x (T, i){ w leaf corresponding to f(x) v father of w while v  V root if w is a left child of v iindex of the i th 0 in B v else iindex of the i th 1 in B v return i

22  Redundant information for single child nodes. › Similar to the collapsing strategy suffix trees

23 100101 101 011 00111 01 00100111001 1111 1 1 11 1 100101 101 011 00111 01 00100111001  E(T)= 01011 0011 10011 00011 011 1011 11 11 0011 011 11  E(T)= 01011 0011 10011 00011 011 1011 11 11 0011 011 11

24 if suffix of code = 0 codecode  11 if suffix of code  11 codecode  1 return D(code)

25  Recursive definition of a FWT of depth h+1  Assumption: if the tree is of depth h+1 then all the F h codewords of length h+1 are in the alphabet.

26  N h+1 =N h +N h-1 +3 ThTh T h-1 T h+1

27 23452345  N h+1 =N h +3F h  N h+1 =3F h+2 -3  P h-1 =2F h+2 -3  P h-1 /N h+1 =(2F h+2 -3)/3F h+2 -3 ⅔ h 

28

29  English Heaps – distribution of 26 characters and 371 bigram  Finnish – Pesonen- 29 letters  French – Tr é sor de la Langue Fran ç aise 26 letters  German Bauer & Goos– 30 letters  Hebrew and Aramaic The Responsa Retrieval Project– 30 letters, 735 bigrams  Italian – 26 letters  Spanish – 26 letters  Portuguese – 26 letters

30 File n HeightFWTPrunedHuffman English2684.904.434.19 Finnish2984.764.444.04 French2684.534.144.00 German3084.704.374.15 Hebrew3084.824.424.29 Italian2684.704.324.00 Portuguese2684.674.284.01 Spanish2684.714.304.05 Russian3285.134.764.47 English-2378148.788.567.44 Hebrew-2743159.138.978.04

31


Download ppt " Divide the encoded file into blocks of size b  Use an auxiliary bit vector to indicate the beginning of each block  Time – O(b)  Time vs. Memory."

Similar presentations


Ads by Google