Lempel-Ziv methods.

Slides:



Advertisements
Similar presentations
More on Canonical Huffman coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a As we have seen canonical Huffman coding allows.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
An introduction to Data Compression
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Processing of large document collections
Algorithms for Data Compression
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lossless Compression - II Hao Jiang Computer Science Department Sept. 18, 2007.
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
Lempel-Ziv Compression Techniques Classification of Lossless Compression techniques Introduction to Lempel-Ziv Encoding: LZ77 & LZ78 LZ78 Encoding Algorithm.
Lempel-Ziv Compression Techniques
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
1 Lempel-Ziv algorithms Burrows-Wheeler Data Compression.
CS336: Intelligent Information Retrieval
Algorithm Programming Some Topics in Compression
Lempel-Ziv-Welch (LZW) Compression Algorithm
Lempel-Ziv Compression Techniques
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Data Compression Basics & Huffman Coding
Lossless Compression Multimedia Systems (Module 2 Lesson 3)
Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks Christopher M. Sadler and Margaret Martonosi In: Proc. of the 4th.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Source Coding-Compression
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Huffman Encoding Veronica Morales.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Fundamental Data Structures and Algorithms Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee LZW Compression.
1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Cryptographic Attacks on Scrambled LZ-Compression and Arithmetic Coding By: RAJBIR SINGH BIKRAM KAHLON.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Adaptive Huffman Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a Why Adaptive Huffman Coding? Huffman coding suffers.
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Page 1 Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB.
CSE 589 Applied Algorithms Spring 1999
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
HUFFMAN CODES.
COMP261 Lecture 22 Data Compression 2.
Data Coding Run Length Coding
Data Compression.
Succinct Data Structures
Information and Coding Theory
Andrzej Ehrenfeucht, University of Colorado, Boulder
Applied Algorithmics - week7
Lempel-Ziv Compression Techniques
Lempel-Ziv-Welch (LZW) Compression Algorithm
Advanced Algorithms Analysis and Design
Lempel-Ziv Compression Techniques
Chapter 11 Data Compression
CSE 589 Applied Algorithms Spring 1999
Huffman Coding Greedy Algorithm
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Lempel-Ziv methods

Dictionary models - I Dictionary-based compression methods use the principle of replacing substrings in a message with a codeword that identifies each substring in a dictionary, or codebook The dictionary contains a list of substrings and their associated codewords Unlike symbolwise methods, dictionary methods often use fixed codewords rather than explicit probability distribution Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

Dictionary models - II For example, we can insert into the dictionary the full set of 8-bit ASCII characters How many? and the 256 most common pairs of characters If we use fixed length codeword, how many bits does we need to index dictionary entries? SOL. 9 bits What about the performances in bits/character in the best and in the worst case? SOL. best:4.5b/char worst:9b/char!! Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

Dictionary models - III Another possibility is to use longer words in the dictionary, perhaps common words like the or and or common components of words like tion. This strings are the phrases of the dictionary A dictionary with a predefined set of phrases does not achieve good compression Performances are better if we tune the dictionary on input source, i.e. if we loose input indipendence Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

Dictionary models - IV For istance common phrases for an italian sport newspaper are very rare in a business management book To avoid the problem of dictionary being unsuitable for the text at hand we can build a new dictionary for each message to be compressed.... .... but there is a significant overhead for transmitting and storing it Deciding the size of the dictionary in order to maximize compression is a very difficult problem Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

The Lempel-Ziv methods The only efficient solution to the problem is to use an adaptive dictionary scheme Pratically all adaptive dictionary compression methods are based on one of the two methods developed by two israely researchers, Abraham Lempel and Jacob Ziv in 1977 e 1978, and called LZ77 and LZ78 "A Universal Algorithm for Sequential Data Compression" in the IEEE Transactions on Information Theory, May 1977 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

The key idea - I The key insight of the method is that it is possible to automatically build a dictionary of previously seen strings in the text being compressed The prior text makes a very good dictionary, since it has usually the same style and language of the upcoming text Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

The key idea - II The dictionary does not have to be transmitted with the compressed text, since the decompressor can build it the same way the compressor does The many variants of Lempel-Ziv methods differ in how pointers are represented and in the limitations on what the pointers are able to refer to The presence of so many variants is also caused by same patents, and by the disputes over patenting Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

The LZ77 family Quite easy to implement Fast decoding with little use of memory The output of the encoding consists of a series of triples the first component indicates how far back to look in the previously decoded text the second component is the length of the phrase the third is next character for the input Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

An example - encoding alphabet {a,b} a b a a b a b aabb <0,0,a> Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

An example - decoding <0,0,x><0,0,y><2,1,z><2,1,x><5,3,z> <6,3,z><5,2,z> SOL. x y xz xx yxzz xxyz zxz Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

A recursive example <0,0,a><0,0,c><2,1,a><4,2,b><1,10,a> Despite the recursive references, each character is available when needed a c aa acb ?? bbbbbbbbbba Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

Further details on LZ77 LZ77 algorithm places limitations on how far back a pointer can refer (i.e. on the length of the first component of the triple) and on the maximum size of the string referred to (i.e. on the length of the second component) For example, in English text there is no gain in using a sliding windows of more than a few thousand characters We can use a windows of 8.192 characters, i.e. 13 bits Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

Further details on LZ77 At the same time, the length of the match is rarely over 16 characters, so the extra cost to allow longer match usually is not justified Exercise: encode the sequence 010020$0110$$0111 with a sliding window of 7 symbols and a maximal match length of 3. Calculate the compression ratio SOL. <0,0,0><0,0,1><2,1,0><0,0,2><2,1,$><7,2,1><5,2,$>,<6,3,1> 0000000 0000001 0100100 0000010 0100111 1111001 1011011 1101101 C=(17*2)/7*8=0.607 << 1!!!

LZ77 - encoding Encode the text S[1..N] using LZ77, with a sliding window of W characters p=1 WHILE p<N { search for the longest match for S[p...] in S[p-W ... p-1]. Suppose the match occurs at position p-m, with length l output the triple < p-m, l, S[p+l ] > p=p + l + 1 } Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

LZ77 - decoding Decode the text S[1..N] using LZ77, with a sliding window of W characters p=1 FOREACH triple < f, l, c> { S[p ... p + l - 1] = S[ p - f ... p – f + l - 1] Suppose the match occurs at position p-m, with length l S[p+l ] = c p=p + l + 1 } Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

LZ77 - improvements The LZ77 has been gradually refined first component of the triple: it is useful to use variable length, assigning shorter codewords to recent matches (that are more common) second component of the triple: variable length codes that uses less bits to represent smaller numbers third component of the triple: in some variants it is added only when needed (when?), with a 1-bit flag to indicate the presence or the absence of this third component Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

gzip algorithm - I gzip is one of the more effective variants of LZ77 It is distributed by the Gnu Free Software Foundation home page of gzip project: www.gzip.org Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

gzip algorithm - II gzip uses a hash tables: the next 3 characters to be coded are hashed, and the return value is used as index to lookup a table entry This entry is the head of a list that contains the places of occurrence of the 3 characters in the window The list is searched for the longest match If there is no match the string is coded as raw characters Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

gzip algorithm - III If the match exists, we have a length and a distance, otherwise we have a zero length and a raw character the sliding window has dimension W=32KB, lengths are limited to 258 bytes List lengths are limited to avoid time consuming researches tradeoff accuracy/time  user’s choice Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

gzip algorithm - IV Lengths, distances and raw characters are coded with two Huffman trees, one for distances and the other for lengths and raw characters Huffman codes are generated processing blocks of up to 64KB (with canonical Huffman) so gzip it is not really one-pass. From a pratical point of view it is one-pass because blocks are small, so they are read only one time and kept in main memory Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

gzip - example abacbcaab <length, distance/character> <0,a><0,b><1,2><0,c><1,3><1,2><1,4><2,7> Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

gzip – best compression abacbcaab <length, distance/character> <0,a><0,b><1,2><0,c><1,3><1,2><1,4><2,7> two solutions ab + caab a + bcaab The first is greedy, it uses longest possible match. But sometimes the second is better. If best compression is selected, gzip takes a longer time but chooses the best of the two, eventually coding raw characters even if matches are possible, if it gives better compression in the long run ...... abcaab

LZ78 family it has restrictions on which substring can be referenced (but this avoids some inefficiency) decoding is slower than LZ77 and require more memory does not have a window to limit how far back substring can be referenced one of its variant, LZW, is widely used in many popular compression systems Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

Referentiable strings The text prior to the current coding position is parsed in substrings, and only parsed phrases can be referenced Previous phrases are numbered in sequence, and the output is a list of pairs <previous phrase, next character> This unseen combination is stored as a new phrase Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

An example a b a a b a b a a Phrases Output 0 <null> <0,a> Only this phrases can be referenced This avoids the inefficiency of having more than one coded representation for the same string, as usual in LZ77

How to store the phrases It is crucial for algorithm efficiency, to store the phrases in a clever way This can be obtained using a trie a b Phrases 0 <null> 1 2 1 a a a 2 b 3 4 3 aa a b 4 ba 5 baa 5 6

How to store the phrases The character of each phrase specify a path from the root to a leaf The character to be encoded are used to traverse the trie until the path is blocked The last node contains the phrase number to output A new node is added with next input character, to form a new phrase a b 1 2 a a 3 4 a b 5 6 b 7 baab <5,b>

A problem The trie data structure continues to grow during coding, and eventually growth must be stopped to avoid an eccessive use of memory There are various strategies the trie can be reinitialized from scratch the trie can be used as is, without further updates the trie can be partially rebuild using last part of the text (this avoids the penalties of starting form scratch) Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

LZ78 vs. LZ77 LZ78 encoding can be faster LZ78 decoding is slower because the decoder must also store the parsed phrases Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

LZ78 - exercise Code the sequence with LZ78 e show the trie that store the phrases 0100101110101001011 SOL. <0,0><0,1><1,0><2,0><2,1><4,1><1,1><3,1><7,1> 1 PHRASES 0 <null> 1 0 2 1 3 00 4 10 5 11 6 101 7 01 8 001 9 011 1 2 1 1 3 7 4 5 1 1 1 8 9 6

LZW variant - I One of the most popular variants of Lempel-Ziv coding (Welch 1984) It forms the basis for Unix utility compress and many other popular compressors The main difference between LZW and LZ78 is that LZW encodes only phrase numbers without any ending characters This scheme works fine because we initialize the dictionary with a phrase for each character of the source alphabet (e.g. the 256 characters of the 8-bit ASCII) Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

LZW variant - II A new phrase is constructed from a coded one by appending the first character of the next phrase Suppose we use 7-bit ASCII: dictionary is initialized with 128 phrases (0-127) Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

LZW - encoding input: a b a ab ab ba aba abaa output: 97 98 97 128 128 129 131 134 new phrases added 128 ab 129 ba 130 aa 131 aba 132 abb 133 baa 134 abaa Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

? LZW - decoding input: output: a b a ab ab ba aba aba? abaa 97 98 97 128 128 129 131 134 output: a b a ab ab ba aba aba? abaa it is not ready!! ? new phrases added 128 129 130 131 132 133 134 ab a? b? ba a? aa aba ab? ab? abb baa ba? aba? abaa abaa? The delay in phrase creation is not a problem unless the encoder uses the phrase immediately after its creation. In this case, if decoder inserts the phrases only when they are completed, cannot decode, because it doesn’t have phrase 134

LZW - exercise Hint. 048, 149, 250, $36 Code with LZW the sequence 0102$00$10111$02$ using 8-bit ASCII codes Hint. 048, 149, 250, $36 SOL. 48 49 48 50 36 48 48 36 257 49 265 260 259 PHRASES 256 01 257 10 258 02 259 2$ 260 $0 261 00 262 0$ 263 $1 264 101 265 11 11$ 267 $02 268 2$?

Lempel-Ziv methods: summary <pointer,length,character> gzip <length, distance/character> LZ78 <phrase,character> LZW <phrase> Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005

Lempel-Ziv methods: exercise Code the message using all studied methods abbb010cac0bb0abb10111b1a using a sliding window of 7 bits and a max match length of 7. No limit is given with respect to the dictionary dimension You can use the following 8-bit ASCII codes a97 b98 c99 048 149 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2004-2005