A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
15-583:Algorithms in the Real World
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007.
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Lecture04 Data Compression.
Huffman Encoding 16-Apr-17.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
SWE 423: Multimedia Systems
Optimal Merging Of Runs
Compression & Huffman Codes Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Data Structures – LECTURE 10 Huffman coding
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Huffman Codes Message consisting of five characters: a, b, c, d,e
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Source Coding-Compression
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
MES Genome Informatics I - Lecture V. Short Read Alignment
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1/20 A Novel Technique for Input Vector Compression in System-on-Chip Testing Student: Chien Nan Lin Satyendra Biswas, Sunil Das, and Altaf Hossain,” Information.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.
Huffman Encodings Section 9.4. Data Compression: Array Representation Σ denotes an alphabet used for all strings Each element in Σ is called a character.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 6 – Basics of Compression (Part 1) Klara Nahrstedt Spring 2011.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
Huffman’s Algorithm 11/02/ Weighted 2-tree A weighted 2-tree T is an extended binary tree with n external nodes and each of the external nodes is.
Bahareh Sarrafzadeh 6111 Fall 2009
ENTROPY & RUN LENGTH CODING. Contents What is Entropy coding? Huffman Encoding Huffman encoding Example Arithmetic coding Encoding Algorithms for arithmetic.
1Computer Sciences Department. 2 Advanced Design and Analysis Techniques TUTORIAL 7.
Multi-media Data compression
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
HUFFMAN CODES.
Data Coding Run Length Coding
Compression & Huffman Codes
Succinct Data Structures
Greedy Technique.
Algorithms in the Real World
Analysis & Design of Algorithms (CSCE 321)
Chapter 11 Data Compression
Data Structure and Algorithms
Greedy Algorithms Alexandra Stefan.
Advanced Seminar in Data Structures
Huffman Encoding.
Podcast Ch23d Title: Huffman Compression
CSE 589 Applied Algorithms Spring 1999
Analysis of Algorithms CS 477/677
Presentation transcript:

A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin

Our Results 1.Improve the bounds of one of the main BWT based compression algorithms 2.New technique for worst case analysis of BWT based compression algorithms using the Local Entropy 3.Interesting results concerning compression of integer strings

The Burrows-Wheeler Transform (1994) Given a string S the Burrows-Wheeler Transform creates a permutation of S that is locally homogeneous. S BWT S’ is locally homogeneous

Empirical Entropy - Intuition The Problem – Given a string S encode each symbol in S using a fixed codeword…

Order-0 Entropy (Shannon 48) H 0 (s): Maximum compression we can get using only frequencies and no context information Example: Huffman Code

Order-k entropy H k (s): Lower bound for compression with order-k contexts – the codeword representing each symbol depends on the k symbols preceding it MISSISSIPPI Context 1 for i “mssp” Context 1 for s “isis” Traditionally, compression ratio of compression algorithms measured using Hk(s)

History The Main Burrows-Wheeler Compression Algorithm (Burrows, Wheeler 1994): Compressed String S’ String S BWT Burrows- Wheeler Transform MTF Move-to- front ? RLE Run- Length encoding Order-0 Encoding

MTF Given a string S = baacb over alphabet = {a,b,c,d} b a a c b a b c d b a c d a b c d a b c d c a b d S = MTF(S) = b c a d

Main Bounds (Manzini 1999) g k is a constant dependant on the context k and the size of the alphabet these are worst-case bounds

Now we are ready to begin…

Some Intuition… H 0 – “measures” frequency H k – “measures” frequency and context → We want a statistic that measures local similarity in a string and specifically in the BWT of the string

Some Intuition… The more the contexts are similar in the original string, the more its BWT will exhibit local similarity… The more local similarity found in the BWT of the string the smaller the numbers we get in MTF… → The solution: Local Entropy

The Local Entropy- Definition We define: given a string s = “s 1 s 2 …s n ” The local entropy of s: (Bentley, Sleator, Tarjan, Wei, 86) MTF Original stringInteger sequence

The Local Entropy - Definition Note: LE(s) = number of bits needed to write the MTF sequence in binary. Example: MTF(s)= 311 → LE(s) = 4 → MTF(s) in binary = 1111 In Dream world… We would like to compress S to LE(S)…

The Local Entropy – Properties We use two properties of LE: 1.The entropy hierarchy 2.Convexity

The Local Entropy – Property 1 1.The entropy hierarchy: We prove: For each k: LE(BWT(s)) ≤ nH k (s) + O(1) → Any upper bound that we get for BWT with LE holds for H k (s) as well.

The Local Entropy – Properties 2 2.Convexity: → This means that a partition of a string s does not improve the Local Entropy of s.

Convexity Cutting the input string into parts doesn’t influence much: Only positions per part a a a b a ba b

Convexity – Why do we need it? Ferragina, Giancarlo, Manzini and Sciortino, JACM 2005: Compressed String S’ String S BWT Burrows- Wheeler transform BoosterRHC Variation of Huffman encoding BWT(S) Partition of BWT(S)

Using LE and its properties we get our bounds Theorem: For every where Our LE bound Our H k bound

Our bounds We get an improvement of the known bounds: As opposed to the known bounds (Manzini, 1999):

Our Test Results Manzini’s bound 8nH k (s)+ 0.08n + g k Our H k bound Our bound using LEbzip2File Name alice29.txt asyoulik.txt cp.html fields.c grammar.lsp lcet10.txt plrabn12.txt xargs.1 *The files are non-binary files from the Canterbury corpus. gzip results are also taken from the corpus. The size is indicated in bytes.

How is LE related to compression of integer sequences? We mentioned “dream world” but what about reality? How close can we come to ? Problem: Compress an integer sequence S close to its sum of logs: Notice for any s:

Compressing Integer Sequences Universal Encodings of Integers: prefix-free encoding for integers (e.g. Fibonacci encoding). Doing some math, it turns out that order-0 encoding is good. Not only good: It is best!

The order-0 math Theorem: For any string s of length n over the integer alphabet {1,2,…h} and for any, Strange conclusion… we get an upper-bound on the order-0 algorithm with a phrase dependant on the value of the integers. This is true for all strings but is especially interesting for strings with smaller integers.

A lower bound for SL Theorem: For any algorithm A and for any, and any C such that C < log(ζ(μ)) there exists a string S of length n for which: |A(S)| > μ∙SL(S) + C∙n

Our Results - Summary New improved bounds for BW MTF Local Entropy (LE) New bounds for compression of integer strings

Open Issues We question the effectiveness of. Is there a better statistic? ?

Anybody want to guess??

For each encoding unit (letter, in this example), associate a frequency (number of times it occurs) Create a binary tree whose children are the encoding units with the smallest frequencies –The frequency of the root is the sum of the frequencies of the leaves Repeat this procedure until all the encoding units are in the binary tree Creating a Huffman encoding

Example Assume that relative frequencies are: A: 40 B: 20 C: 10 D: 10 R: 20

Example, cont.

A = 0 B = 100 C = 1010 D = 1011 R = 11 Assign 0 to left branches, 1 to right branches Each encoding is a path from the root

n ana#b a a nana# b n a#ban a b anana # The Burrows-Wheeler Transform (1994) Given a string S = banana# banana# anana#b nana#ba ana#ban a#banan na#bana Sort the rows # banan a a #bana n a na#ba n #banana The Burrows- Wheeler Transform

Suffix Arrays and the BWT So all we need to get the BWT is the suffix array! n ana#b a a nana# b n a#ban a b anana # # banan a a #bana n a na#ba n The Suffix Array Index of BWT