Download presentation
Presentation is loading. Please wait.
1
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino raffaele@math.unipa.it mari@math.unipa.it University of Palermo ITALY
2
Outline of the Talk The Burrows-Wheeler Transform [BW94] abraca bacraa, 1 A New Class of Algorithms Combinatorial Dependency [BCCFM00, BFG02] Lower Bound on Compression Performance Conjecture by Manzini [M01] Universal Encoding of Integers [L68, E75] BWTMTFH/AC IO BWT Compression Algorithms
3
Burrows-Wheeler Transform TRANSFORM abraca bacraa, 1 ANTI-TRANSFORM bacraa, 1 abraca
4
The Transform INPUT: w = abraca Right-to-left lexicographically sorting of the cyclic rotations of w F L 0 b r a c a a 1 a b r a c a 2 c a a b r a 3 r a c a a b 4 a a b r a c 5 a c a a b r OUTPUT: BWT(w)=bacraa and the index I=1 I
5
The Transform Essential Properties: F L 0 b r a c a a 1 a b r a c a 2 c a a b r a 3 r a c a a b 4 a a b r a c 5 a c a a b r I i I the character L[i] is followed in w by F[i]; For each character x, the i-th occurrence of x in L corresponds to the i-th occurrence of x in F.
6
The Anti-Transform Given F=BWT(w)= bacraa and I=1: Construct L by lexicographically sorting the element of F F 0 b 1 a 2 c 3 r 4 a 5 a L a 0 a 1 a 2 b 3 c 4 r 5 : = 0 1 2 3 4 5 1 3 4 5 0 2 I w=a rbaca
7
Why Useful INTUITION Let us consider the effect on a single letter in a common word in a block of English text: w = … The…the… The… the…those…the…the…that…the… The characters following th are grouped together inside BWT(w). F L e t h a t h e T h e t h e t h e T h o t h e t h Extensive experimental work confirms this “clustering effect” [BW94, F96]
8
Why Useful “Clustering” of Symbols and MTF MoveToFront Coding (MFT) [BeSTaWe86]: Encodes an instance of a character x by an integer that counts the number of the distinct symbols seen after the latest occurrence of x. EXAMPLE abaaaabbbbbcccccaaaaa 01100010000200020000 BWT+MTF =many runs of zeroes good for order 0 encoders Relation between compressibility of files and high percentage of zeroes [F96]
9
Two Main Research Questions Is MTF an essential step for the successful use of BWT [F96] ? Experiments [AM97, BK98, WM01]; Theory ? Analysis of the compression performance of BWT-based algorithms. Experiments (see DCC ) Information Theory [Ef99, Sa98] Worst Case Setting Empirical Entropy of Strings [M01] - No Assumptions
10
Zero-th Order Empirical Entropy s is string over the alphabet ={a 1, a 2, …, a h } n i number of occurrences of a i in s. Assume that n i n i +1 The zero-th order empirical entropy of s: H 0 (s)= - The zero-th order modified empirical entropy [M01]:
11
k-th Order Empirical Entropy k the set of all strings of length k over k the set of all strings of length at most k Fixed an integer k 0, for any string y in k, y s is the string consisting of the characters following y in s. The k-th order empirical entropy of s is The k-th order modified empirical entropy: where T k denotes a set of strings in k such that each of them has a unique suffix in T k and such that among the sets having this property, T k is that one minimizing the right hand.
12
Results by Manzini Let BW0 be a BWT-based algorithm with Arithmetic coder as zero-th order compressor. Then, k 0 Let BW0 RL be a BWT-based algorithm using run-length encoding with Arithmetic coder as zero-th order compressor. Then, k 0 g k ’ 0 such that where =10 -2.
13
Insights by Manzini THEOREM (Manzini): Let s be a string. For each k 0, there exists an f h k and a partition s’ 1, s’ 2, …,s’ f of BWT(s) such that An analogous result holds for H k * (s). REMARK: If there existed an ideal compressor A such that, for any partition s 1,s 2,…,s p of a string s then A (BWT(s)) |s|H k (s). Analogously for H k * (s). We show that A does not exist. Fortunately we can approximate it
14
Open Problems by Manzini Conjectures by Manzini: No BWT-compression method can get to a bound of the form |s\H k * (s)+g k for k 0 and g k 0 constant. The ideal algorithm A does not exist. We prove that both conjectures are true
15
We provide a new class of BWT-based algorithms, based on partition of strings, that do not use MTF as a part of the compression process. We analyze two of those new methods in the worst case setting. We obtain better theoretic bounds than Manzini. Under a natural hypothesis on the inner working of the algorithm no BWT-compressor using that type of algorithm can achieve |w|H k * (w) + g k Our Contributions
16
Algorithms That Use Optimal Partitions of Strings (rather than MTF) Compute BWT(s); ##### Optimally Partition the transformed string; Compress each piece separately.
17
Combinatorial Dependency Techniques by Buchsbaum et al. [BCCFM00, BFG02] for Table Compression. Surprisingly, it specializes to strings Fix a data compressor C that adds a special end-of-string # before compressing the string. DEFINITION: Two strings x and y are combinatorial dependent with respect the data compressor C if |C(xy#)|<|C(x#)|+|C(y#)|. OPTIMAL PARTITION IN TERMS OF THE BASE COMPRESSOR C: By Dynamic Programming
18
The new class BWT OPT Given the input string s 1. Compute BWT(s); 2. Optimally partition of BWT(s) using C as the base compressor; 3. Compress separately each pieces of the partition. TIME COMPLEXITY of BWT OPT : It depends critically on that of C and it is (n 2 ). Fortunately, if C has a linear time decompression algorithm then BWT OPT also admits a linear time decompression algorithm. ASSUMPTIONS: Let C be a data compressor such that: given an input string x adds a special end-of-string # and compress x# either # is really appended at the end of the string or the length of x is explicitly stored as a prefix of the compressed string (universal encoding of integer).
19
A prefix code compressor HC # is an end-of-string marker The base compressor C is a modification of Huffman encoding so that we can encode # basically for free. THEOREM Consider a string s. Let p 1, p 2, …, p h be the empirical probability distribution of s. Then
20
A compressor RHC based on Prefix and Run Length Encoding THEOREM Consider a string s. Let p 1, p 2, …, p h be the empirical probability distribution of s. Then It combines Huffman encoding with Run length encoding. It use knowledge about the symbol frequency in a string. For low entropy string it is essential to use RLE. The RLE scheme we use depends critically on a variable length encoding of a sequence of integers. The solution we propose works well in conjunction with CD where the lengths of the strings we need to compress may even consists of few symbols. PROBLEM Given two positive integers t and w, t<w, and the increasing sequence of integers d 1,d 2,…,d t in [1,w], find an algorithm to produce a binary encoding of d 1,d 2,…,d t and w.
21
Lower Bound ASSUMPTIONS: Given a compressor C, we assume that {C(a n ) |n>0} is a codeword set for the integers. For technical reasons we also assume that |C(a n )| is non-decreasing function of n. The lower bound comes from a theorem in [Levenshtein,1968], which we restate in our notation: THEOREM There exists a countable number of string s such that |C(s)| |s|H k * (s)+ (|s|) where (n) is a diverging function of n. COROLLARY No compression algorithm satisfying previous assumptions can achieve the bound formulated in conjecture by Manzini, i.e. |s\H k * (s)+g k for k 0 and g k 0 constant. Such a result holds independently of whether or not BWT is applied as a preprocessing step.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.