Download presentation
Presentation is loading. Please wait.
Published byMelvyn Summers Modified over 9 years ago
1
BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007
2
Results Cannot show constant c<2 s.t. Similarly, no c<1.26 for BW RL no c<1.3 for BW DC Probabilistic technique.
3
Outline ● Part I: Definitions ● Part II: Results ● Part III: Proofs ● Part IV: Experimental Results
4
Part I: Definitions
5
BW0 The Main Burrows-Wheeler Compression Algorithm: Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front Order-0 Encoding Text with local uniformity Text in English (similar contexts -> similar character) Integer string with many small numbers
6
The BWT ● Invented by Burrows-and-Wheeler (‘94) ● Analogous to Fourier Transform (smooth!) string with context-regularity BWT string with spikes (close repetitions) mississippi ipssmpissii [Fenwick]
7
p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The BWT T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s FL=BWT(T) T BWT sorts the characters by their post-context
8
BWT Facts 1.permutes the text 2.(≤n+1)-to-1 function
9
Move To Front ● By Bentley, Sleator, Tarjan and Wei (’86) string with spikes (close repetitions) ipssmpissii integer string with small numbers 0,0,0,0,0,2,4,3,0,1,0 move-to-front
10
Move to Front a,b,r,c,dabracadabra
11
Move to Front a,b,r,c,d0abracadabra a,b,r,c,dabracadabra
12
Move to Front b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra
13
Move to Front r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra
14
Move to Front a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra
15
Move to Front c,a,r,b,d0,1,2,2,3abracadabra a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra
16
Move to Front a,c,r,b,d0,1,2,2,3,1abracadabra c,a,r,b,d0,1,2,2,3abracadabra a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra
17
Move to Front 0,1,2,2,3,1,4,1,4,4,2abracadabra a,c,r,b,d0,1,2,2,3,1abracadabra c,a,r,b,d0,1,2,2,3abracadabra a,r,b,c,d0,1,2,2abracadabra r,b,a,c,d0,1,2abracadabra b,a,r,c,d0,1abracadabra a,b,r,c,d0abracadabra a,b,r,c,dabracadabra
18
After MTF ● Now we have a string with small numbers: lots of 0s, many 1s, … ● Skewed frequencies: Run Arithmetic! Character frequencies
19
BW0 The Main Burrows-Wheeler Compression Algorithm: Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front Order-0 Encoding Text with local uniformity Text in English (similar contexts -> similar character) Integer string with many small numbers
20
BW RL (e.g. bzip) Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front ? RLE Run- Length encoding Order-0 Encoding
21
Many more BWT-based algorithms ● BW DC : Encodes using distance coding instead of MTF ● BW with inversion frequencies coding ● Booster-Based [Ferragina-Giancarlo- Manzini-Sciortino] ● Block-based compressor of Effros et al.
22
order-0 entropy Lower bound for compression without context information S=“ACABBA” 1/2 `A’s: Each represented by 1 bit 1/3 `B’s: Each represented by log(3) bits 1/6 `C’s: Each represented by log(6) bits 6*H 0 (S)=3*1+2*log(3)+1*log(6)
23
order-k entropy = Lower bound for compression with order-k contexts
24
order-k entropy mississippi: Context for i: “mssp” Context for s: “isis” Context for p: “ip”
25
Part II: Results
26
Measuring against H k ● When performing worst-case analysis of lossless text compressors, we usually measure against H k ● The goal – a bound of the form: |A(s)|≤ c×nH k (s)+ lower order term ● Optimal: |A(s)|≤ nH k (s)+ lower order term
27
Bounds lowerUpper BW02 [KaplanVerbin07] 3.33 [ManziniGagie07] BW DC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06] BW RL 1.26 [KaplanVerbin07] 5 [Manzini99] gzip11 PPM11
28
Bounds lowerUpper BW02 [KaplanVerbin07] 3.33 [ManziniGagie07] BW DC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06] BW RL 1.26 [KaplanVerbin07] 5 [Manzini99] gzip11 PPM11 a
29
Bounds lowerUpper BW02 [KaplanVerbin07] 3.33 [ManziniGagie07] BW DC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06] BW RL 1.26 [KaplanVerbin07] 5 [Manzini99] gzip11 PPM11
30
Bounds lower BW02 [KaplanVerbin07] BW DC 1.3 [KaplanVerbin07] BW RL 1.26 [KaplanVerbin07] gzip1 PPM1 Surprising!! Since BWT-based compressors work better than gzip in practice!
31
Possible Explanations 1.Asymptotics: and real compressors cut into blocks, so 2.English Text is not Markovian! Analyzing on different model might show BWT's superiority
32
Part III: Proofs
33
Lower bound ● Wish to analyze BW0=BWT+MTF+Order0 ● Need to show s s.t. ● Consider string s: 10 3 `a', 10 6 `b' Entropy of s ● BWT(s): same frequencies MTF(BWT(s)) has: 2*10 3 `1', 10 6 -10 3 `0‘ Compressed size: about need BWT(s) to have many isolated `a’s
34
many isolated `a’s ● Goal: find s such that in BWT(s), most `a’s are isolated ● Solution: probabilistic. BWT is (≤n+1)-to-1 function. ● A random string s’ has ≥1/(n+1) chance of being a BWT-image ● A random string has ≥1-1/n 2 chance of having “many” isolated `a’s Therefore, such a string exists
35
General Calculation ● s contains pn `a’s, (1-p)n `b’s. Entropy of s: ● MTF(BWT(s)) contains 2p(1-p)n `1’s, rest `0’s compressed size of MTF(BWT(s)): ● Ratio:
36
Lower bounds on BW DC, BW RL ● Similar technique. p infinitesimally small gives compressible string. So maximize ratio over p. ● Gives weird constants, but quite strong
37
Experimental Results Sanity Check: Picking texts from above Markov models really shows behavior in practice Picking text from “realistic” Markov sources also shows non-optimal behavior (“realistic” = generated from actual texts) On long Markov text, gzip works better than BWT
38
Bottom Line ● BWT compressors are not optimal (vs. order-k entropy) ● We believe that they are good since English text is not Markovian. ● Find theoretical justification! ● also improve constants, find BWT algs with better ratios,...
39
Thank You!
40
Additional Slides (taken out for lack of time)
41
BWT - Invertibility ● Go forward, one character at a time
42
Main Property: L F mapping ● The i th occurrence of c in L corresponds to the i th occurrence of c in F. ● This happens because the characters in L are sorted by their post-context, and the occurrences of character c in F are sorted by their post-context. p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m # mississipp i i #mississip p i ppi#missis s FL unknown
43
BW0 vs. Lempel-Ziv ● BW0 dynamically takes advantage of context-regularity ● Robust, smooth, alternative for Lempel-Ziv
44
BW0 vs. Statistical Coding ● Statistical Coding (e.g. PPM): Builds a model for each context Prediction -> Compression Exploits similarities between similar contexts Optimally models each context Explicit partitioning – produces a model for each context No explicit partitioning to contexts PPMBW0
45
Compressed Text Indexing ● Application of BWT ● Compressed representation of text, that supports: fast pattern matching (without decompression!) Partial decompression ● So, no need to ever decompress! space usage: |BW0(s)|+o(n) ● See more in [Ferragina-Manzini]
46
Musings ● On one hand: BWT based algorithms are not optimal, while Lempel-Ziv is. ● On the other hand: BWT compresses much better ● Reasons: 1. Results are Asymptotic. (EE reason) 2. English text was not generated by a Markov source (real reason?) ● Goal: Get a more honest way to analyze ● Use a statistic different than H k ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.