15-211 Fundamental Data Structures and Algorithms Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee LZW Compression.

Slides:



Advertisements
Similar presentations
Data Compression CS 147 Minh Nguyen.
Advertisements

Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Greedy Algorithms (Huffman Coding)
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Algorithms for Data Compression
296.3: Algorithms in the Real World
Data Compression Michael J. Watts
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Lempel-Ziv Compression Techniques Classification of Lossless Compression techniques Introduction to Lempel-Ziv Encoding: LZ77 & LZ78 LZ78 Encoding Algorithm.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
Lempel-Ziv Compression Techniques
A Data Compression Algorithm: Huffman Compression
Lempel-Ziv-Welch (LZW) Compression Algorithm
Lempel-Ziv Compression Techniques
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
Lecture 4 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lossless Compression Multimedia Systems (Module 2 Lesson 3)
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
x x x 1 =613 Base 10 digits {0...9} Base 10 digits {0...9}
8. Compression. 2 Video and Audio Compression Video and Audio files are very large. Unless we develop and maintain very high bandwidth networks (Gigabytes.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.
1 Lossless Compression Multimedia Systems (Module 2 Lesson 2) Summary:  Adaptive Coding  Adaptive Huffman Coding Sibling Property Update Algorithm 
Source Coding-Compression
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression.
1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 5.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Adaptive Huffman Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a Why Adaptive Huffman Coding? Huffman coding suffers.
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
Data Compression Meeting October 25, 2002 Arithmetic Coding.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lossless Compression(2)
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
LZW (Lempel-Ziv-welch) compression method The LZW method to compress data is an evolution of the method originally created by Abraham Lempel and Jacob.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Data Compression Michael J. Watts
CSE 589 Applied Algorithms Spring 1999
Textbook does not really deal with compression.
Data Coding Run Length Coding
Compression & Huffman Codes
Data Compression.
Data Compression.
Applied Algorithmics - week7
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
Lempel-Ziv-Welch (LZW) Compression Algorithm
Data Compression CS 147 Minh Nguyen.
Chapter 9: Huffman Codes
Data Compression Reduce the size of data.
Chapter 11 Data Compression
CSE 589 Applied Algorithms Spring 1999
Presentation transcript:

Fundamental Data Structures and Algorithms Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee LZW Compression

Last Time…

Problem: data compression  Convert a string into a shorter string.  Lossless – represents exactly the same information.  Lossy – approximates the original information.  Uses of compression:  Images over the web: JPEG  Music: MP3  General-purpose: ZIP, GZIP, JAR, …

Huffman trees

Huffman’s algorithm  Huffman’s algorithm gives the optimal prefix code.  For a nice online demo, see 

Huffman compression  Huffman trees provide a straightforward method for file compression.  1. Scan the file and compute frequencies  2. Build the code tree  3. Write code tree to the output file as a header  4. Scan input, encode, and write into the output file

Huffman decompression  Read the header in the compressed file, and build the code tree  Read the rest of the file, decode using the tree  Write to output

Beating Huffman  How about doing better than Huffman!  Impossible!  Huffman’s algorithm gives the optimal prefix code!  Right.  But who says we have to use a prefix code?

Example  Suppose we have a file containing  abcdabcdabcdabcdabcdabcd… abcdabcd  This could be expressed very compactly as  abcd^1000

Dictionary-Based Compression

Dictionary-based methods  Here is a simple idea:  Keep track of “words” that we have seen, and replace them with a code number when we see them again.  The code is typically shorter than the word  We can maintain dictionary entries  (word, code)  and make additions to the dictionary as we read the input file.

Lempel & Ziv (1977/78)

Fred Hacker’s algorithm…  Fred now knows what to do…  Create the dictionary: (, 1 )  Transmit 1, done.

Right?  Fred’s algorithm provides excellent compression, but…

Right?  Fred’s algorithm provides excellent compression, but…  …the receiver does not know what is in the dictionary!  And sending the dictionary is the same as sending the entire uncompressed file  Thus, we can’t decompress the “1”.

Hence…  …we need to build our dictionary in such a way that the receiver can rebuild the dictionary easily.

LZW Compression: The Binary Version LZW=variant of Lempel-Ziv Compression, by Terry Welch (1984)

Maintaining a dictionary  We need a way of incrementally building up a dictionary during compression in such a way that…  …someone who wants to uncompress can “rediscover” the very same dictionary  And we already know that a convenient way to build a dictionary incrementally is to use a trie

Binary LZW  In this method, we build up binary tries  In a binary trie, each node has two children  In addition, we will add the following:  each left edge is marked 0  each right edge is marked 1  each leaf has a label from the set {0,…,n}

A binary trie

Binary LZW: Compression 1.We start with a binary trie consisting of a root node and two children  left child labeled 0, and right labeled 1 2.We read the bits of the input file, and follow the trie 3.When a leaf is reached, we emit the label at the leaf 4.Then, add two new children to that leaf (converting it into an internal node)

Binary LZW: Compression, pt.2 5.The new left child takes the old label 6.The new right child takes a new label value that is one greater than the current maximum label value

Binary LZW: Compression example Input: ^ Dictionary: Output:

Binary LZW: Compression example Input: ^ Dictionary: Output:

Binary LZW: Compression example Input: ^ Dictionary: Output:

Binary LZW: Compression example Input: ^ Dictionary: Output:

Binary LZW: Compression example Input: ^ 0 1 Dictionary: Output:

Binary LZW: Compression example Input: ^ 0 1 Dictionary: Output:

Binary LZW: Compression example Input: ^ 0 1 Dictionary: Output:

Binary LZW output  So from the input   we get output   To represent this output we can keep track of the number of labels n each time we emit a code  and use log(n) bits for that code

Binary LZW output  We started with input  Encoded it as , for which we get the bit sequence  This looks like an expansion instead of a compression  But what if we have a larger input, with more repeating sequences?  Try it!

Binary LZW output  One can also use Huffman compression on the output…

Binary LZW termination  Note that binary LZW has a serious problem, in that the input might end while we are in the middle of the trie (instead of at a leaf node)  This is a nasty problem  which is why we won’t use this binary method  But this is still good for illustration purposes…

Binary LZW: Uncompress  To uncompress, we need to read the compressed file and rebuild the same trie as we go along  To do this, we need to maintain the trie and also the maximum label value

Binary LZW: Uncompress example Input: ^ 0 1 Dictionary: Output: 10

Binary LZW: Uncompress example Input: ^ 0 1 Dictionary: Output:

Binary LZW: Uncompress example Input: ^ 0 1 Dictionary: Output:

Binary LZW: Uncompress example Input: ^ 0 1 Dictionary: Output:

Binary LZW: Uncompress example Input: ^ 0 1 Dictionary: Output:

Binary LZW: Uncompress example Input: ^ 0 1 Dictionary: Output:

Binary LZW: Uncompress example Input: ^ 0 1 Dictionary: Output:

LZW Compression: The Byte Version

Byte method  The binary LZW method doesn’t really work  we show it for illustrative purposes  Instead, we use a slightly more complicated version that works on bytes or characters  We can think of each byte as a “character” in the range {0…255}

Byte method trie  Instead of a binary trie, we use a more general trie in which  each node can have up to n children (where n is the size of the alphabet), one for each byte/character  every node (not just the leaves) has an integer label from the set {0…m}, for some m except the root node, which has no label

Byte method LZW  We start with a trie that contains a root and n children  one child for each possible character  each child labeled 0…n  When we compress as before, by walking down the trie  but, after emitting a code and growing the trie, we must start from the root’s child labeled c, where c is the character that caused us to grow the trie

LZW: Byte method example  Suppose our entire character set consists only of the four letters:  {a, b, c, d}  Let’s consider the compression of the string  baddad

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd 1 4 a

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd 10 4 a 5 d

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd a 5 d 6 d

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd a 5 d 6 d 7 a

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd a 5 d 6 d 7 a

Byte LZW output  So, the input  baddad  compresses to   which again can be given in bit form, just like in the binary method…  …or compressed again using Huffman

Byte LZW: Uncompress example  The uncompress step for byte LZW is the most complicated part of the entire process, but is largely similar to the binary method

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd b

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd ba 4 a

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd bad 4 a 5 d

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd badd 4 a 5 d 6 d

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd baddad 4 a 5 d 6 d 7 a

LZW Byte method: An alternative presentation

Getting off the ground Suppose we want to compress a file containing only letters a, b, c and d. It seems reasonable to start with a dictionary a:0 b:1 c:2 d:3 At least we can then deal with the first letter. And the receiver knows how to start.

Growing pains Now suppose the file starts like so: a b b a b b … We scan the a, look it up and output a 0. After scanning the b, we have seen the word ab. So, we add it to the dictionary a:0 b:1 c:2 d:3 ab:4

Growing pains We already scanned the first b. a b b a b b … Then we get another b. We output a 1 for the first b, and add bb to the dictionary a:0 b:1 c:2 d:3 ab:4 bb:5

So? Right, so far zero compression. We already scanned the second b. a b b a b b … After scanning a, we output 1 for the b, and put ba in the dictionary … d:3 ab:4 bb:5 ba:6 Still zero compression.

But now… We already scanned a. a b b a b b … We scan the next b, and ab : 4 is in the dictionary. We scan the next b, output 4, and put abb into the dictionary. … d:3 ab:4 bb:5 ba:6 abb:7 We got compression, because 4 is shorter than ab.

We already scanned the last b a b b a b b … Suppose the input continues a b b a b b b b a … We scan the next b, and bb:5 is in the dictionary We scan the next b, output 5, and put bbb into the dictionary … ab:4 bb:5 ba:6 abb:7 bbb:8 And so on

More Hits As our dictionary grows, we are able to replace longer and longer blocks by short code numbers. a b b a b b b b a … And we increase the dictionary at each step by adding another word.

Summary where each prefix is in the dictionary. We stop when we fall out of the dictionary: a 1 a 2 a 3 …. a k b We scan a sequence of symbols a 1 a 2 a 3 …. a k

Summary (cont’d) We output the code for a 1 a 2 a 3 …. a k and put a 1 a 2 a 3 …. a k b into the dictionary. Then we set a 1 = b And start all over.

More importantly  Since we extend our dictionary in such a simple way, it can be easily reconstructed on the other end.  Start with the same initialization, then  Read one code number after the other, look up the each one in the dictionary, and extend the dictionary as you go along.

Sort of Let's take a closer look at an example. Assume alphabet {a,b,c}. The code for aabbaabb is The decoding starts with dictionary D: 0:a, 1:b, 2:c

Moving along The first 4 code words are already in D and produce output a a b b. As we go along, we extend D: 0:a, 1:b, 2:c, 3:aa, 4:ab, 5:bb For the code numbers 3 5, get a a b b

Done We have also added to D: 6:ba, 7:aab But these entries are never used. Everything is easy, since there is already an entry in D for each code number when we encounter it.

Is this it? Unfortunately, no. It may happen that we run into a code word without having an appropriate entry in D. But, it can only happen in very special circumstances, and we can manufacture the missing entry.

A Bad Run Consider input a a b b b a a ==> After reading 0 0 1, we output a a b and extend D with codes for aa and ab 0:a, 1:b, 2:c, 3:aa, 4:ab

Disaster We have read from the input The dictionary is 0:a, 1:b, 2:c, 3:aa, 4:ab The next code number to read is 5, but it’s not in D. How could this have happened? Can we recover?

… narrowly averted This problem only arises when on the compressor end: the input contains a substring …s  s  s … compressor read s , output code c for s , and added c+1: s  s to the dictionary. Here s is a single symbol, but  a (possibly empty) word.

… narrowly averted (pt. 2) On the decompressor end, D contains c: s  but does not contain c+1: s  s the decompressor has already output x = s  and is now looking at unknown code number c+1.

… narrowly averted (pt. 3) But then the fix is to output x + first(x) where x is the last decompressed word, and first(x) the first symbol of x. Because x=s  was already output, we get the required s  s  s We also update the dictionary to contain the new entry x+first(x) = s  s.

In our example we have read from the input The last decompressed word is b, and the next code number to read is 5. Thus s = b  = empty The next word to output and add to D is s  s = bb Example

Summary Let x be the last added word. Ordinarily, D contains a word y matching to the input code number. We output y and extend D with x+ first (y) But sometimes we immediately use x. Then it must be x = s  and we output x + first(x) = s  s

Example (extended)  aabbbaabbaaabaababb Input Output add to D 0 a 0+a3:aa 1+b4:ab 5-bb5:bb 3+aa6:bba 6+bba7:aab 7+aab8:bbaa 9-aaba9:aaba 5+bb10:aabab

Pseudo Code: Compression Initialize dictionary D to all words of length 1. Read all input characters: output code numbers from D, extend D whenever a new word appears. New code words: just an integer counter.

Less Pseudo initialize D; c = nextchar; // next input character W = c; // a string while( c = nextchar ) { if( W+c is in D ) // dictionary W = W + c; else output code(W); add W+c to D; W = c; } output code(W)

Pseudo Code: Decompression Initialize dictionary D with all words of length 1. Read all code numbers and - output corresponding words from D, - extend D at each step. This time the dictionary is of the form ( integer, word ) Keys are integers, values words.

Less Pseudo initialize D; pc = nextcode; // first code number x = word(pc); // corresponding word output x; First code number is easy: codes only a single symbol. Remember as pc (previous code) and x (previous word).

More Less Pseudo while ( c = nextcode ) { if ( c is in D ) { y = word(c); ww = x + first(y); insert ww in D; output y; } else {

The hard case else { y = x + first(x); insert y in D; output y; } pc = c; x = y; }

One more detail… One detail remains: how to build the dictionary for compression (decompression is easy). We need to be able to scan through a sequence of symbols and check if they form a prefix of a word already in the dictionary. We use tries for dictionaries.

Tries! a b 1032 cd 4 a 5 d 6 d a:0 b:1 c:2 d:3 ba:4 ad:4 dd:6 Corresponds to dictionary

Tries In the LZW situation, we can add the new word to the trie dictionary in O(1) steps after discovering that the string is no longer a prefix of a dictionary word. Just add a new leaf to the last node touched.

LZW details In reality, one usually restricts the code words to be 12 or 16 bit integers. Hence, one may have to flush the dictionary ever so often (i.e. proceed to compress the rest of the input with an empty dictionary). But we won’t bother with this.

LZW details Lastly, LZW generates as output a stream of integers. It makes perfect sense to try to compress these further, e.g., by Huffman.

Summary of LZW LZW is an adaptive, dictionary based compression method. Encoding is easy in LZW, but uses a special data structure (trie). Decoding is slightly complicated, requires no special data structures.