Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Seminar in Data Structures

Similar presentations


Presentation on theme: "Advanced Seminar in Data Structures"— Presentation transcript:

1 Advanced Seminar in Data Structures
28/12/2004: An Analysis of the Burrows-Wheeler Transform (Giovanni Manzini) Presented by Assaf Oren Data Structures Seminar

2 Data Structures Seminar
Topics Introduction Burrows-Wheeler Transform Move–to–Front Empirical Entropy Order0 coder Analysis of the BW0 algorithm Run-Length encoding Analysis of BW0RL algorithm Data Structures Seminar

3 Data Structures Seminar
Introduction BWT-based algorithm: Takes the input string s Transforms it to bwt(s) |bwt(s)| = |s| Compress bwt(s) with compressor A The compressed string will be A(bwt(s)) Data Structures Seminar

4 Data Structures Seminar
Introduction (con’t) Notations Recording scheme: tra() A transformation with no compression Coding scheme : A() An algorithm which designed to reduce the size of the input Data Structures Seminar

5 BWT-based Alg’ Properties
Even when using a simple compression alg’, A(bwt(s)) will have a good compression ratio The very simple and clean alg’ from Nelson[1996], outperforms the PkZip package. Other, more advanced BWT compressors are Bzip [Seward 1997] and Szip [Schindler 1997]. BWT-based compressors achieve a very good compression ratio using relatively small resources Arnold and Bell [2000], Fenwick [1996a] Data Structures Seminar

6 Data Structures Seminar
nova 25% man bzip2 bzip2(1) bzip2(1) NAME bzip2, bunzip2 - a block-sorting file compressor, v1.0.2 bzcat - decompresses files to stdout bzip2recover - recovers data from damaged bzip2 files SYNOPSIS bzip2 [ -cdfkqstvzVL ] [ filenames ... ] bunzip2 [ -fkvsVL ] [ filenames ... ] bzcat [ -s ] [ filenames ... ] bzip2recover filename DESCRIPTION bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of sta­ tistical compressors. Data Structures Seminar

7 BWT-based Alg’ Properties (cont’)
Works very well in practice, but no satisfactory proof has been given for their compression ratio. Previous proofs were done: Assuming the input string is a finite-order Markov source Sadakane [1997;1998] To get bounds on the speed at which the average compression ratio approaches the entropy. Effros [1999] Data Structures Seminar

8 The Burrows-Wheeler Transform
Background Part of a research for DIGITAL released at 1994 Based on a previously unpublished transformation discovered by Wheeler in 1983 Technical The resulting output block contains exactly the same data elements that it started with Performed on an entire block of data at once Reversible Data Structures Seminar

9 The Burrows-Wheeler Transform (cont’)
Append # to the end of s # is unique and smaller then any other character Form a Matrix M whose rows are the cyclic shifts of s# Sort the rows right to left Data Structures Seminar

10 The Burrows-Wheeler Transform (cont’)
The output of BWT is the column F = “msspipissii” and the number 3 (the position of #) Data Structures Seminar

11 The Burrows-Wheeler Transform (cont’)
Observations: Every column of M is a permutation of s#. Each character in L is followed in s# by the corresponding character in F. For any character c, the ith occurrence of c in F corresponds the the ith occurrence of c in L. How to reconstruct s: Sort bwt(s) to get column L. (column F is bwt(s)) F1 is the first character of s. By applying observation3 we get that ‘m’ (is the same ‘m’ from L6), and obsetvation2 will tell us that F6 is the next character of s. Data Structures Seminar

12 The Burrows-Wheeler Transform (cont’)
Data Structures Seminar

13 The Burrows-Wheeler Transform (cont’)
Why this transform is so helpful ? BWT collects together the symbols following a given context. Formally: For each substring w of s, the characters following w in s are grouped together inside bwt(s) More formally!!! Data Structures Seminar

14 Data Structures Seminar
Move–to–Front (mtf ) Another recording scheme Suggested be B&W to be used after applying BWT on string s s` = mtf(bwt(s)) |mtf(bwt(s))| = |bwt(s)| = |s| If s is over {a1, a2, … , ah} then s` is over {0, 1, …, h-1} Data Structures Seminar

15 Move–to–Front (cont’)
For each letter (left-to-right): Write the number of other letters since the last time the current letter appeared. Example: a a b a c a a c c b a 1 1 2 1 1 2 2 a a b a c a a c c b a Data Structures Seminar

16 Move–to–Front (cont’)
Why this transform is helpful ? Transforms the local homogeneous of bwt(s) to global homogeneous Formally if we had After mtf both strings will probably have the same small numbers. Data Structures Seminar

17 Data Structures Seminar
Huffman coding Sets binary values to letters according to their frequency For example: A = {a, b, c} In our string the frequency is: The coding will be: a = 300 b = 150 c = 150 a = `0` b = `10` c = `11` Data Structures Seminar

18 Data Structures Seminar
Arithmetic coding Data Structures Seminar

19 The Empirical Entropy of a string
s = our string n = |s| A = our Alphabet h = |A| ni = number of occurrences of the symbol ai inside s H0(s) = the zeroth order empirical entropy of s Data Structures Seminar

20 Intuition for the Empirical Entropy
For each symbol For each appearance of this symbol in the text The number of bits that will be needed to represent it with an ultimate uniquely decodable code Data Structures Seminar

21 The kth order Empirical Entropy
We can achieve a greater compression if the codeword depends on the k symbols that precedes the coded symbol For example: s = “abcabcabd” the codeword for ‘ab’ could be abs = ccd And formally we can define: Data Structures Seminar

22 Data Structures Seminar
Examples of Hk(s) Example 1: K=1, s = mississippi ms = i, is = ssp, ss = sisi, ps = pi Example 2: K=1, s = cc(ab)n as = bn, bs = an-1, cs = ca 1 Data Structures Seminar

23 The modified Empirical Entropy
Modified in order to avoid cases of Data Structures Seminar

24 Empirical Entropy and BWT
We saw that…………… We know that………… If we had an Ideal algorithm A…………… We get: We reduced the problem of compressing up to kth order entropy to the problem of compressing distinct portions of the input string up to their zeroth order entropy. Data Structures Seminar

25 Data Structures Seminar
An Order0 coder A coder with a compression ratio that is close to the zeroth order empirical entropy. Formally: For static Huffman coding,  = 1 For a simple arithmetic coder,  =~10-2 Howard and Vitter [1992a] Data Structures Seminar

26 Analysis of the BW0 algorithm
BW0(s) = Order0(mtf(bwt(s))) We would like to achieve: For now lets assume Theorem 4.1 on mtf(s): Data Structures Seminar

27 Data Structures Seminar
Proof of BW0 We saw that if then for t ≤ hk For combined with theorem 4.1: With our knowledge on Order0:  Get get: Data Structures Seminar

28 Data Structures Seminar
Proof of Theorem 4.1 Lemma 4.3 Lemma 4.4 Data Structures Seminar

29 Proof of Theorem 4.1 (cont’)
Lemma 4.5 Lemma 4.6 Lemma 4.7 Data Structures Seminar

30 Proof of Theorem 4.1 (cont’)
Lemma 4.8 It is sufficient to prove that: Data Structures Seminar

31 Proof of Theorem 4.1 (cont’)
By applying Lemma 4.3 and 4.5 we get: And: Data Structures Seminar

32 Analysis of BW0RL algorithm
BW0RL(s) = Order0(RLE(mtf(bwt(s)))) RLE(s) Let 0 and 1 be two symbols that are not belong to the alphabet For m ≥ 1, B(m) = m+1 written in binary with 0 and 1, discarding the MSB B(1) = 0, B(2) = 1, B(3) = 00. B(4) = 01, B(5) = 10 … RLE(s) will replace 0m zeros in s with B(m) Given s = “ ”, RLE(s) = “ ” |RLE(s)| ≤ |s|, since log(m+1) ≤ m Data Structures Seminar

33 Analysis of BW0RL (cont’)
Theorem 5.1 Theorem 5.8 Data Structures Seminar

34 Analysis of BW0RL (cont’)
Locally -Optimal Algorithm For all t > 0, there exists a constant ct, that for any partition s1, s2, … , st of the string s we have: A locally -Optimal Algorithm combined with BWT is bounded by: Data Structures Seminar

35 Data Structures Seminar
A bit of practicality A nice article by Mark Nelson Includes source code + measurements Usage: RLE input-file | BWT | MTF | RLE | ARI > output-file UNARI input-file | UNRLE | UNMTF | UNBWT | UNRLE > output-file Data Structures Seminar

36 A bit of practicality (cont’)
BTW Bits/Byte BTW Size PKZIP Bits/Byte PKZIP Size Raw Size File Name 2.13 29,567 2.58 35,821 111,261 bib 2.87 275,831 3.29 315,999 768,771 book1 2.44 186,592 2.74 209,061 610,856 book2 4.85 62,120 5.38 68,917 102,400 geo 2.85 134,174 3.10 146,010 377,109 news 4.04 10,857 3.84 10,311 21,504 obj1 2.66 81,948 2.65 81,846 246,814 obj2 2.67 17,724 2.80 18,624 53,161 paper1 2.62 26,956 2.90 29,795 82,199 paper2 2.92 16,995 3.11 18,106 46,526 paper3 3.33 5,529 3.32 5,509 13,286 paper4 3.44 5,136 4,962 11,954 paper5 2.76 13,159 13,331 38,105 paper6 0.79 50,829 0.84 54,188 513,216 pic 2.69 13,312 13,340 39,611 progc 1.86 16,688 1.81 16,227 71,646 progl 1.85 11,404 1.82 11,248 49,379 progp 1.65 19,301 1.68 19,691 93,695 trans 2.41 978,122 2.64 1,072,986 3,251,493 total: Data Structures Seminar

37 A bit of practicality (cont’)
The End Data Structures Seminar


Download ppt "Advanced Seminar in Data Structures"

Similar presentations


Ads by Google