Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,
Advertisements

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
Algoritmi per IR Dictionary-based compressors. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary.
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
15-583:Algorithms in the Real World
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Paolo Ferragina, Università di Pisa Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
Information Theory EE322 Al-Sanie.
BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
School of Computing Science Simon Fraser University
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Advanced Algorithms for Massive DataSets Data Compression.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Algorithms in the Real World
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Source Coding-Compression
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2011.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Lossless Compression(2)
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2012.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
SEAC-3 J.Teuhola Information-Theoretic Foundations Founder: Claude Shannon, 1940’s Gives bounds for:  Ultimate data compression  Ultimate transmission.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Linear Time Suffix Array Construction Using D-Critical Substrings
Burrows-Wheeler Transformation Review
15-853:Algorithms in the Real World
Succinct Data Structures
Information and Coding Theory
Advanced Algorithms for Massive DataSets
Algorithms in the Real World
13 Text Processing Hongfei Yan June 1, 2016.
Context-based Data Compression
CSE 589 Applied Algorithms Spring 1999
Problem with Huffman Coding
Advanced Seminar in Data Structures
Presentation transcript:

Lecture #1 From 0-th order entropy compression To k-th order entropy compression

Entropy (Shannon, 1948) For a source S emitting symbols with probability p(s), the self information of s is: bits Lower probability  higher information Entropy is the weighted average of i(s) H 0 = 0-th order empirical entropy (of a string, where p(s)=freq(s)) i(s)

Performance Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. or

Huffman Code Invented by Huffman as a class assignment in ‘50. Used in most compression algorithms gzip, bzip, jpeg (as option), fax compression,… Properties: Generates optimal prefix codes Fast to encode and decode We can prove that (n=|T|): n H(T) ≤ |Huff(T)| < n H(T) + n This means that it looses < 1 bit per symbol on avg ! Good or bad ?

Arithmetic coding Given a text of n symbols it takes nH 0 +2 bits vs. (nH 0 +n) bits of Huffman Used in PPM, JPEG/MPEG (as option), … More time costly than Huffman, but integer implementation is “not bad”.

Symbol interval Assign each symbol to an interval [0, 1). a =.2 c =.3 b =.5     f(a) =.0, f(b) =.2, f(c) =.7 e.g. the symbol interval for b is [.2,.7)

Encoding a sequence of symbols Coding the sequence: bac The final sequence interval is [.27,.3) a =.2 c =.3 b = a =.2 c =.3 b = a =.2 c =.3 b = ( )*0.3=0.15 ( )*0.5 = 0.05 ( )*0.3=0.03 ( )*0.2=0.02 ( )*0.2=0.1 ( )*0.5 = 0.25

The algorithm To code a sequence of symbols: P(a) =.2 P(c) =.3 P(b) = Pick a number inside

Decoding Example Decoding the number.49, knowing the input text to be decoded is of length 3: The message is bbc. a =.2 c =.3 b = a =.2 c =.3 b = a =.2 c =.3 b =

How do we encode that number? Binary fractional representation: FractionalEncode(x) 1.x = 2 * x 2.If x < 1 output 0, goto 1 3.x = x - 1; output 1, goto 1 2 * (1/3) = 2/3 < 1, output 0 2 * (2/3) = 4/3 > 1, output 1  4/3 – 1 = 1/3 Incremental Generation

Which number do we encode? Truncate the encoding to the first d =  log (2/s n )  bits Truncation gets a smaller number… how much smaller? Compression = Truncation l n + s n lnln l n + s n /2 =0

Bound on code length Theorem: For a text of length n, the Arithmetic encoder generates at most  log 2 (2/s n )  < 1 + log 2 2/s n = 1 + (1 - log 2 s n ) = 2 - log 2 (∏ i=1,n p(T i ) ) = 2 - ∑ i=1,n ( log 2 p(T i ) ) = 2 - ∑  =1,|  | occ(  ) log p(  ) = 2 + n * ∑  =1,|  | p(  ) log (1/p(  )) = 2 + n H 0 (T) bits nH n bits in practice because of rounding T = aaba  3 * log p(a) + 1 * log p(b)

Where is the problem ? Take the text T = a n b n, hence H 0 = (1/2) log (1/2) + log 2 2 = 1 bit so compression ratio would be 1/256 (ASCII) or, no compression if a,b already encoded in 1 bit We would like to deploy repetitions: Wherever they occur Whichever length they have Any  (T), even random, gets the same bound

Data Compression Can we use simpler repetition-detectors?

Simple compressors: too simple? Move-to-Front (MTF): As a freq-sorting approximator As a caching strategy As a compressor Run-Length-Encoding (RLE): FAX compression

 code for integer encoding x > 0 and Length =  log 2 x  +1 e.g., 9 represented as.   code for x takes 2  log 2 x  +1 bits (ie. factor of 2 from optimal) Length-1

Move to Front Coding Transforms a char sequence into an integer sequence, that can then be var-length coded Start with the list of symbols L=[a,b,c,d,…] For each input symbol s 1) output the position of s in L 2) move s to the front of L Properties: It is a dynamic code, with memory (unlike Arithmetic) X = 1 n 2 n 3 n… n n  Huff = O(n 2 log n), MTF = O(n log n) + n 2 In fact Huff takes log n bits per symbol being them equi-probable MTF uses O(1) bits per symbol occurrence but first one O(log n)

Run Length Encoding (RLE) If spatial locality is very high, then abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1) In case of binary strings  just numbers and one bit Properties: It is a dynamic code, with memory (unlike Arithmetic) X = 1 n 2 n 3 n… n n  Huff(X) = O(n 2 log n) > Rle(X) = O( n (1+log n) ) RLE uses log n bits per symb-block using  -code per its length.

Data Compression Burrows-Wheeler Transform

The big (unconscious) step...

p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s FL T

A famous example Much longer...

Compressing L seems promising... Key observation: l L is locally homogeneous L is highly compressible Algorithm Bzip : 1. Move-to-Front coding of L 2. Run-Length coding 3. Statistical coder  Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

BWT matrix #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m How to compute the BWT ? ipssm#pissiiipssm#pissii L SA L[3] = T[ ] We said that: L[i] precedes F[i] in T Given SA and T, we have L[i] = T[SA[i]-1] This is one of the main reasons for the number of pubblications spurred in ‘94-’10 on Suffix Array construction

p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m # mississipp i i #mississip p i ppi#missis s FL Take two equal L’s chars Can we map L’s chars onto F’s chars ?... Need to distinguish equal chars... Rotate rightward their rows Same relative order !! unknown A useful tool: L  F mapping Rank(char,pos) and Select(char,pos) key operations nowadays

T =.... # i #mississip p p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m The BWT is invertible # mississipp i i ppi#missis s FL unknown 1. LF-array maps L’s to F’s chars 2. L[ i ] precedes F[ i ] in T Two key properties: Reconstruct T backward: i p p i Several issues about efficiency in time and space

You find this in your Linux distribution

Suffix Array construction

Data Compression What about achieving high-order entropy ?

Recall that Compression ratio = #bits in output / #bits in input Compression performance: We relate entropy against compression ratio. or

The empirical entropy H k H k (T) = (1/|T|) ∑ |  | =k | T[  ] | H 0 ( T[  ] ) Example: Given T = “ mississippi ”, we have T[  ] = string of symbols that precede the substring  in T T[“is”] = ms Compress T up to H k (T)  compress each T[  ] up to its H 0 How much is this “operational” ? Use Huffman or Arithmetic The distinct substrings  for H 2 (T) are {i_ (1,p), ip (1,s), is (2,ms), pi (1,p), pp (1,i), mi (1,_), si (2,ss), ss (2,ii)} H 2 (T) = (1/11) * [1 * H 0 (p) + 1 * H 0 (s) + 2 * H 0 (ms) + 1 * H 0 (p) + …]

pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m #mississipp i i#mississip p ippi#missis s BWT versus H k Bwt(T) Compressing pieces in BWT up to their H 0, we achieve H 2 (T) Symbols preceding   but this is a permutation of T[  ] |T[  ]| * H 0 ( T[  ] ) |T| H 2 (T) = ∑ |  | =2  H 0 does not change !!! We have a workable way to approximate H k via bwt-partitions T = m i s s i s s i p p i # T[  =is] = “ms”

Let C be a compressor achieving H 0  Arithmetic (  ) ≤ |  | H 0 (  ) + 2 bits  An interesting approach:  Compute bwt(T), and get a partition  induced by k  Apply C on each piece of  The space is  The partition depends on k  The approximation of H k (T) depends on C and g k Operationally… Optimal partition   shortest |C(  )| O(n) time H k -bound holds simultaneously  k ≥ 0 Compression booster [J. ACM ‘05] = ∑ |  | =k |C(T[  ])| ≤ ∑ |  | =k ( |T[  ]| H 0 (T[  ]) + 2 ) ≤ |T| H k (T) + 2 g k