Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.

Similar presentations


Presentation on theme: "A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin."— Presentation transcript:

1 A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

2 Our Results 1.Improve the bounds of one of the main BWT based compression algorithms 2.New technique for worst case analysis of BWT based compression algorithms using the Local Entropy 3.Interesting results concerning compression of integer strings

3 The Burrows-Wheeler Transform (1994) Given a string S the Burrows-Wheeler Transform creates a permutation of S that is locally homogeneous. S BWT S’ is locally homogeneous

4 Empirical Entropy - Intuition The Problem – Given a string S encode each symbol in S using a fixed codeword…

5 Order-0 Entropy (Shannon 48) H 0 (s): Maximum compression we can get using only frequencies and no context information 0 1 0 1 1 0 1 0 Example: Huffman Code

6 Order-k entropy H k (s): Lower bound for compression with order-k contexts – the codeword representing each symbol depends on the k symbols preceding it MISSISSIPPI Context 1 for i “mssp” Context 1 for s “isis” Traditionally, compression ratio of compression algorithms measured using Hk(s)

7 History The Main Burrows-Wheeler Compression Algorithm (Burrows, Wheeler 1994): Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front ? RLE Run- Length encoding Order-0 Encoding

8 MTF Given a string S = baacb over alphabet = {a,b,c,d} b a a c b 11022 a b c d b a c d a b c d a b c d c a b d S = MTF(S) = b c a d

9 Main Bounds (Manzini 1999) g k is a constant dependant on the context k and the size of the alphabet these are worst-case bounds

10 Now we are ready to begin…

11 Some Intuition… H 0 – “measures” frequency H k – “measures” frequency and context → We want a statistic that measures local similarity in a string and specifically in the BWT of the string

12 Some Intuition… The more the contexts are similar in the original string, the more its BWT will exhibit local similarity… The more local similarity found in the BWT of the string the smaller the numbers we get in MTF… → The solution: Local Entropy

13 The Local Entropy- Definition We define: given a string s = “s 1 s 2 …s n ” The local entropy of s: (Bentley, Sleator, Tarjan, Wei, 86) MTF Original stringInteger sequence

14 The Local Entropy - Definition Note: LE(s) = number of bits needed to write the MTF sequence in binary. Example: MTF(s)= 311 → LE(s) = 4 → MTF(s) in binary = 1111 In Dream world… We would like to compress S to LE(S)…

15 The Local Entropy – Properties We use two properties of LE: 1.The entropy hierarchy 2.Convexity

16 The Local Entropy – Property 1 1.The entropy hierarchy: We prove: For each k: LE(BWT(s)) ≤ nH k (s) + O(1) → Any upper bound that we get for BWT with LE holds for H k (s) as well.

17 The Local Entropy – Properties 2 2.Convexity: → This means that a partition of a string s does not improve the Local Entropy of s.

18 Convexity Cutting the input string into parts doesn’t influence much: Only positions per part a a a b a ba b

19 Convexity – Why do we need it? Ferragina, Giancarlo, Manzini and Sciortino, JACM 2005: Compressed String S’ String S BWT Burrows- Wheeler transform BoosterRHC Variation of Huffman encoding BWT(S) Partition of BWT(S)

20 Using LE and its properties we get our bounds Theorem: For every where Our LE bound Our H k bound

21 Our bounds We get an improvement of the known bounds: As opposed to the known bounds (Manzini, 1999):

22 Our Test Results Manzini’s bound 8nH k (s)+ 0.08n + g k Our H k bound Our bound using LEbzip2File Name 2328219766940396813345568alice29.txt 2141646683171367874316552asyoulik.txt 2957141050336985861056cp.html 119210433792571324312fields.c 45134160541023410264grammar.lsp 586729119672401021440861184lcet10.txt 8198976246444013913101164360plrabn12.txt 64673223171385814096xargs.1 *The files are non-binary files from the Canterbury corpus. gzip results are also taken from the corpus. The size is indicated in bytes.

23 How is LE related to compression of integer sequences? We mentioned “dream world” but what about reality? How close can we come to ? Problem: Compress an integer sequence S close to its sum of logs: Notice for any s:

24 Compressing Integer Sequences Universal Encodings of Integers: prefix-free encoding for integers (e.g. Fibonacci encoding). Doing some math, it turns out that order-0 encoding is good. Not only good: It is best!

25 The order-0 math Theorem: For any string s of length n over the integer alphabet {1,2,…h} and for any, Strange conclusion… we get an upper-bound on the order-0 algorithm with a phrase dependant on the value of the integers. This is true for all strings but is especially interesting for strings with smaller integers.

26 A lower bound for SL Theorem: For any algorithm A and for any, and any C such that C < log(ζ(μ)) there exists a string S of length n for which: |A(S)| > μ∙SL(S) + C∙n

27 Our Results - Summary New improved bounds for BW MTF Local Entropy (LE) New bounds for compression of integer strings

28 Open Issues We question the effectiveness of. Is there a better statistic? ?

29 Anybody want to guess??

30 For each encoding unit (letter, in this example), associate a frequency (number of times it occurs) Create a binary tree whose children are the encoding units with the smallest frequencies –The frequency of the root is the sum of the frequencies of the leaves Repeat this procedure until all the encoding units are in the binary tree Creating a Huffman encoding

31 Example Assume that relative frequencies are: A: 40 B: 20 C: 10 D: 10 R: 20

32 Example, cont.

33 A = 0 B = 100 C = 1010 D = 1011 R = 11 Assign 0 to left branches, 1 to right branches Each encoding is a path from the root

34 n ana#b a a nana# b n a#ban a b anana # The Burrows-Wheeler Transform (1994) Given a string S = banana# banana# anana#b nana#ba ana#ban a#banan na#bana Sort the rows # banan a a #bana n a na#ba n #banana The Burrows- Wheeler Transform

35 Suffix Arrays and the BWT So all we need to get the BWT is the suffix array! n ana#b a a nana# b n a#ban a b anana # # banan a a #bana n a na#ba n 76421537642153 65317426531742 The Suffix Array Index of BWT


Download ppt "A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin."

Similar presentations


Ads by Google