Advanced Seminar in Data Structures

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
15-583:Algorithms in the Real World
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
An introduction to Data Compression
BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007.
Problem: Huffman Coding Def: binary character code = assignment of binary strings to characters e.g. ASCII code A = B = C =
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
SWE 423: Multimedia Systems
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
The Burrows-Wheeler Transform
Data Structures – LECTURE 10 Huffman coding
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
Variable-Length Codes: Huffman Codes
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Lossless Compression - I Hao Jiang Computer Science Department Sept. 13, 2007.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Data Compression and Huffman Trees (HW 4) Data Structures Fall 2008 Modified by Eugene Weinstein.
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Source Coding-Compression
Dr.-Ing. Khaled Shawky Hassan
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Jadavpur University Presentation on : Data Compression Using Burrows-Wheeler Transform Presented by: Suvendu Rup.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 6 – Basics of Compression (Part 1) Klara Nahrstedt Spring 2011.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
M5 research group, University of Central Florida Weifeng Sun 1 20 November 2002 StarNT: Dictionary-based Fast Transform Weifeng Sun Computer.
Huffman Codes. Overview  Huffman codes: compressing data (savings of 20% to 90%)  Huffman’s greedy algorithm uses a table of the frequencies of occurrence.
Multi-media Data compression
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Page 1 Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Burrows-Wheeler Transformation Review
3.3 Fundamentals of data representation
Succinct Data Structures
Information and Coding Theory
Algorithms in the Real World
Proving the Correctness of Huffman’s Algorithm
Arithmetic coding Let L be a set of items.
Huffman Coding.
Advanced Algorithms Analysis and Design
Chapter 11 Data Compression
Data Structure and Algorithms
Image Coding and Compression
IMPLEMENTATION OF A DIGITAL COMMUNUCATION SYSTEM
Huffman Coding Greedy Algorithm
Proving the Correctness of Huffman’s Algorithm
Analysis of Algorithms CS 477/677
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Advanced Seminar in Data Structures 28/12/2004: An Analysis of the Burrows-Wheeler Transform (Giovanni Manzini) Presented by Assaf Oren Data Structures Seminar

Data Structures Seminar Topics Introduction Burrows-Wheeler Transform Move–to–Front Empirical Entropy Order0 coder Analysis of the BW0 algorithm Run-Length encoding Analysis of BW0RL algorithm Data Structures Seminar

Data Structures Seminar Introduction BWT-based algorithm: Takes the input string s Transforms it to bwt(s) |bwt(s)| = |s| Compress bwt(s) with compressor A The compressed string will be A(bwt(s)) Data Structures Seminar

Data Structures Seminar Introduction (con’t) Notations Recording scheme: tra() A transformation with no compression Coding scheme : A() An algorithm which designed to reduce the size of the input Data Structures Seminar

BWT-based Alg’ Properties Even when using a simple compression alg’, A(bwt(s)) will have a good compression ratio The very simple and clean alg’ from Nelson[1996], outperforms the PkZip package. Other, more advanced BWT compressors are Bzip [Seward 1997] and Szip [Schindler 1997]. BWT-based compressors achieve a very good compression ratio using relatively small resources Arnold and Bell [2000], Fenwick [1996a] Data Structures Seminar

Data Structures Seminar nova 25% man bzip2 bzip2(1) bzip2(1) NAME bzip2, bunzip2 - a block-sorting file compressor, v1.0.2 bzcat - decompresses files to stdout bzip2recover - recovers data from damaged bzip2 files SYNOPSIS bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ] bunzip2 [ -fkvsVL ] [ filenames ... ] bzcat [ -s ] [ filenames ... ] bzip2recover filename DESCRIPTION bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of sta­ tistical compressors. Data Structures Seminar

BWT-based Alg’ Properties (cont’) Works very well in practice, but no satisfactory proof has been given for their compression ratio. Previous proofs were done: Assuming the input string is a finite-order Markov source Sadakane [1997;1998] To get bounds on the speed at which the average compression ratio approaches the entropy. Effros [1999] Data Structures Seminar

The Burrows-Wheeler Transform Background Part of a research for DIGITAL released at 1994 Based on a previously unpublished transformation discovered by Wheeler in 1983 Technical The resulting output block contains exactly the same data elements that it started with Performed on an entire block of data at once Reversible Data Structures Seminar

The Burrows-Wheeler Transform (cont’) Append # to the end of s # is unique and smaller then any other character Form a Matrix M whose rows are the cyclic shifts of s# Sort the rows right to left Data Structures Seminar

The Burrows-Wheeler Transform (cont’) The output of BWT is the column F = “msspipissii” and the number 3 (the position of #) Data Structures Seminar

The Burrows-Wheeler Transform (cont’) Observations: Every column of M is a permutation of s#. Each character in L is followed in s# by the corresponding character in F. For any character c, the ith occurrence of c in F corresponds the the ith occurrence of c in L. How to reconstruct s: Sort bwt(s) to get column L. (column F is bwt(s)) F1 is the first character of s. By applying observation3 we get that ‘m’ (is the same ‘m’ from L6), and obsetvation2 will tell us that F6 is the next character of s. Data Structures Seminar

The Burrows-Wheeler Transform (cont’) Data Structures Seminar

The Burrows-Wheeler Transform (cont’) Why this transform is so helpful ? BWT collects together the symbols following a given context. Formally: For each substring w of s, the characters following w in s are grouped together inside bwt(s) More formally!!! Data Structures Seminar

Data Structures Seminar Move–to–Front (mtf ) Another recording scheme Suggested be B&W to be used after applying BWT on string s s` = mtf(bwt(s)) |mtf(bwt(s))| = |bwt(s)| = |s| If s is over {a1, a2, … , ah} then s` is over {0, 1, …, h-1} Data Structures Seminar

Move–to–Front (cont’) For each letter (left-to-right): Write the number of other letters since the last time the current letter appeared. Example: a a b a c a a c c b a 1 1 2 1 1 2 2 a a b a c a a c c b a Data Structures Seminar

Move–to–Front (cont’) Why this transform is helpful ? Transforms the local homogeneous of bwt(s) to global homogeneous Formally if we had After mtf both strings will probably have the same small numbers. Data Structures Seminar

Data Structures Seminar Huffman coding Sets binary values to letters according to their frequency For example: A = {a, b, c} In our string the frequency is: The coding will be: a = 300 b = 150 c = 150 a = `0` b = `10` c = `11` Data Structures Seminar

Data Structures Seminar Arithmetic coding Data Structures Seminar

The Empirical Entropy of a string s = our string n = |s| A = our Alphabet h = |A| ni = number of occurrences of the symbol ai inside s H0(s) = the zeroth order empirical entropy of s Data Structures Seminar

Intuition for the Empirical Entropy For each symbol For each appearance of this symbol in the text The number of bits that will be needed to represent it with an ultimate uniquely decodable code Data Structures Seminar

The kth order Empirical Entropy We can achieve a greater compression if the codeword depends on the k symbols that precedes the coded symbol For example: s = “abcabcabd” the codeword for ‘ab’ could be abs = ccd And formally we can define: Data Structures Seminar

Data Structures Seminar Examples of Hk(s) Example 1: K=1, s = mississippi ms = i, is = ssp, ss = sisi, ps = pi Example 2: K=1, s = cc(ab)n as = bn, bs = an-1, cs = ca 1 Data Structures Seminar

The modified Empirical Entropy Modified in order to avoid cases of Data Structures Seminar

Empirical Entropy and BWT We saw that…………… We know that………… If we had an Ideal algorithm A…………… We get: We reduced the problem of compressing up to kth order entropy to the problem of compressing distinct portions of the input string up to their zeroth order entropy. Data Structures Seminar

Data Structures Seminar An Order0 coder A coder with a compression ratio that is close to the zeroth order empirical entropy. Formally: For static Huffman coding,  = 1 For a simple arithmetic coder,  =~10-2 Howard and Vitter [1992a] Data Structures Seminar

Analysis of the BW0 algorithm BW0(s) = Order0(mtf(bwt(s))) We would like to achieve: For now lets assume Theorem 4.1 on mtf(s): Data Structures Seminar

Data Structures Seminar Proof of BW0 We saw that if then for t ≤ hk For combined with theorem 4.1: With our knowledge on Order0:  Get get: Data Structures Seminar

Data Structures Seminar Proof of Theorem 4.1 Lemma 4.3 Lemma 4.4 Data Structures Seminar

Proof of Theorem 4.1 (cont’) Lemma 4.5 Lemma 4.6 Lemma 4.7 Data Structures Seminar

Proof of Theorem 4.1 (cont’) Lemma 4.8 It is sufficient to prove that: Data Structures Seminar

Proof of Theorem 4.1 (cont’) By applying Lemma 4.3 and 4.5 we get: And: Data Structures Seminar

Analysis of BW0RL algorithm BW0RL(s) = Order0(RLE(mtf(bwt(s)))) RLE(s) Let 0 and 1 be two symbols that are not belong to the alphabet For m ≥ 1, B(m) = m+1 written in binary with 0 and 1, discarding the MSB B(1) = 0, B(2) = 1, B(3) = 00. B(4) = 01, B(5) = 10 … RLE(s) will replace 0m zeros in s with B(m) Given s = “110022013000”, RLE(s) = “1112201300 ” |RLE(s)| ≤ |s|, since log(m+1) ≤ m Data Structures Seminar

Analysis of BW0RL (cont’) Theorem 5.1 … Theorem 5.8 Data Structures Seminar

Analysis of BW0RL (cont’) Locally -Optimal Algorithm For all t > 0, there exists a constant ct, that for any partition s1, s2, … , st of the string s we have: A locally -Optimal Algorithm combined with BWT is bounded by: Data Structures Seminar

Data Structures Seminar A bit of practicality A nice article by Mark Nelson http://www.dogma.net/markn/articles/bwt/bwt.htm Includes source code + measurements Usage: RLE input-file | BWT | MTF | RLE | ARI > output-file UNARI input-file | UNRLE | UNMTF | UNBWT | UNRLE > output-file Data Structures Seminar

A bit of practicality (cont’) BTW Bits/Byte BTW Size PKZIP Bits/Byte PKZIP Size Raw Size File Name 2.13 29,567 2.58 35,821 111,261 bib 2.87 275,831 3.29 315,999 768,771 book1 2.44 186,592 2.74 209,061 610,856 book2 4.85 62,120 5.38 68,917 102,400 geo 2.85 134,174 3.10 146,010 377,109 news 4.04 10,857 3.84 10,311 21,504 obj1 2.66 81,948 2.65 81,846 246,814 obj2 2.67 17,724 2.80 18,624 53,161 paper1 2.62 26,956 2.90 29,795 82,199 paper2 2.92 16,995 3.11 18,106 46,526 paper3 3.33 5,529 3.32 5,509 13,286 paper4 3.44 5,136 4,962 11,954 paper5 2.76 13,159 13,331 38,105 paper6 0.79 50,829 0.84 54,188 513,216 pic 2.69 13,312 13,340 39,611 progc 1.86 16,688 1.81 16,227 71,646 progl 1.85 11,404 1.82 11,248 49,379 progp 1.65 19,301 1.68 19,691 93,695 trans 2.41 978,122 2.64 1,072,986 3,251,493 total: Data Structures Seminar

A bit of practicality (cont’) The End Data Structures Seminar