15-211 Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
CSCI 3280 Tutorial 6. Outline  Theory part of LZW  Tree representation of LZW  Table representation of LZW.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Greedy Algorithms (Huffman Coding)
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Algorithms for Data Compression
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
296.3: Algorithms in the Real World
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Lempel-Ziv Compression Techniques Classification of Lossless Compression techniques Introduction to Lempel-Ziv Encoding: LZ77 & LZ78 LZ78 Encoding Algorithm.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
Lempel-Ziv Compression Techniques
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
A Data Compression Algorithm: Huffman Compression
Lempel-Ziv-Welch (LZW) Compression Algorithm
Lempel-Ziv Compression Techniques
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
Lecture 4 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Fundamental Data Structures and Algorithms Klaus Sutner February 12, 2004 More LZW / Practicum.
Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks Christopher M. Sadler and Margaret Martonosi In: Proc. of the 4th.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Source Coding-Compression
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Data Structures Week 6: Assignment #2 Problem
Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression.
Fundamental Data Structures and Algorithms Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee LZW Compression.
1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 5.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Adaptive Huffman Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a Why Adaptive Huffman Coding? Huffman coding suffers.
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
Data Compression Meeting October 25, 2002 Arithmetic Coding.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
CSE 589 Applied Algorithms Spring 1999
HUFFMAN CODES.
Data Coding Run Length Coding
Compression & Huffman Codes
Data Compression.
Applied Algorithmics - week7
Algorithms for iSNE Dr. Kenneth Cosh Week 13.
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
Lempel-Ziv Compression Techniques
Lempel-Ziv-Welch (LZW) Compression Algorithm
Lempel-Ziv Compression Techniques
Chapter 11 Data Compression
Data Structure and Algorithms
CSE 589 Applied Algorithms Spring 1999
Presentation transcript:

Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression

Recap

Huffman Trees Huffman Trees can be used to construct an optimal prefix code. What does optimal mean? Greedy algorithm to assemble a Huffman tree. locally optimal steps to global optimization Requires symbol frequencies. read the file twice – counting and encoding

Huffman Encoding Process

Adaptive Huffman or Dynamic Huffman Clearly, having to read the data twice (first for frequency count, then for actual compression) is a bit cumbersome. Perhaps data is available in blocks (streaming data) Can build an adaptive Huffman tree that adjusts itself as more frequency data become available.

Adaptive Huffman ctd..  Mapping from source messages to code words based upon a running estimate of the source message probabilities  Change the tree to remain optimal for the current estimates  adaptive Huffman codes respond to locality  Requires only a single pass of the data

Beating Huffman How about beating the compression achieved by Huffman? Impossible! It produces an optimal prefix code. Right. But who says we have to use a prefix code?

Dictionary-Based Compression

Dictionary-based methods  Here is a simple idea:  Keep track of “words” that we have seen, and replace them with a code number when we see them again.  We can maintain dictionary entries  (word, code)  and make additions to the dictionary as we read the input file.

Lempel & Ziv (1977/78)

Fred Hacker’s algorithm…  Fred now knows what to do… (, 1 ) Transmit 1, done.

Right?  Fred’s algorithm provides excellent compression, but…  …the receiver does not know what is in the dictionary!  And sending the dictionary is the same as sending the entire uncompressed file  Thus, we can’t decompress the “1”.

Hence…  …we need to build our dictionary in such a way that the receiver can rebuild the dictionary easily.

LZW Compression: The Byte Version

Byte method LZW  We start with a trie that contains a root and n children  one child for each possible character  each child labeled 0…n  When we compress as before, by walking down the trie  but, after emitting a code and growing the trie, we must start from the root’s child labeled c, where c is the character that caused us to grow the trie

LZW: Byte method example  Suppose our entire character set consists only of the four letters:  {a, b, c, d}  Let’s consider the compression of the string  baddad

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd 1 4 a

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd 10 4 a 5 d

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd a 5 d 6 d

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd a 5 d 6 d 7 a

Byte LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd a 5 d 6 d 7 a

Byte LZW output  So, the input  baddad  compresses to   which again can be given in bit form, just like in the binary method…  …or compressed again using Huffman

Byte LZW: Uncompress example  The uncompress step for byte LZW is the most complicated part of the entire process, but is largely similar to the binary method

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd b

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd ba 4 a

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd bad 4 a 5 d

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd badd 4 a 5 d 6 d

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd baddad 4 a 5 d 6 d 7 a

LZW Byte method: An alternative presentation

Getting off the ground Suppose we want to compress a file containing only letters a, b, c and d. It seems reasonable to start with a dictionary a:0 b:1 c:2 d:3 At least we can then deal with the first letter. And the receiver knows how to start.

Growing pains Now suppose the file starts like so: a b b a b b … We scan the a, look it up and output a 0. After scanning the b, we have seen the word ab. So, we add it to the dictionary a:0 b:1 c:2 d:3 ab:4

Growing pains We output a 1 for the b. Then we get another b. a b b a b b … output 1, and add bb it to the dictionary a:0 b:1 c:2 d:3 ab:4 bb:5

So? Right, so far zero compression. But now we get a followed by b, and ab is in the dictionary a b b a b b … so we output 4, and put bab into the dictionary … d:3 ab:4 bb:5 ba:6 bab:7

And so on Suppose the input continues a b b a b b b b a … We output 5, and put bbb into the dictionary … ab:4 bb:5 ba:6 bab:7 bbb:8

More Hits As our dictionary grows, we are able to replace longer and longer blocks by short code numbers. a b b a b b b b a … And we increase the dictionary at each step by adding another word.

More importantly  Since we extend our dictionary in such a simple way, it can be easily reconstructed on the other end.  Start with the same initialization, then  Read one code number after the other, look up the each one in the dictionary, and extend the dictionary as you go along.

Again: Extending where each prefix is in the dictionary. We stop when we fall out of the dictionary: a 1 a 2 a 3 …. a k b We scan a sequence of symbols a 1 a 2 a 3 …. a k

Again: Extending We output the code for a 1 a 2 a 3 …. a k and put a 1 a 2 a 3 …. a k b into the dictionary. Then we set a 1 = b And start all over.

Sort of Let's take a closer look at an example. Assume alphabet {a,b,c}. The code for aabbaabb is The decoding starts with dictionary a:0, b:1, c:2

Moving along The first 4 code words are already in D and produce output a a b b. As we go along, we extend D: a:0, b:1, c:2, aa:3, ab:4, bb:5 For the rest we get a a b b

Done We have also added to D: ba:6, aab:7 But these entries are never used. Everything is easy, since there is already an entry in D for each code number when we encounter it.

Is this it? Unfortunately, no. It may happen that we run into a code word without having an appropriate entry in D. But, it can only happen in very special circumstances, and we can manufacture the missing entry.

A Bad Run Consider input a a b b b a a ==> After reading 0 0 1, D looks like this: a:0, b:1, c:2, aa:3, ab:4

Disaster The next code is 5, but it’s not in D. a:0, b:1, c:2, aa:3, ab:4 How could this have happened? Can we recover?

… narrowly averted This problem only arises when the input contains a substring …s  s … s  was just added to the dictionary. Here s is a single symbol, but  a (possibly empty) word.

… narrowly averted But then the fix is to output x + first(x) where x is the last decompressed word, and first(x) the first symbol of x. And, we also update the dictionary to contain this new entry.

Example In our example we had s = b w = empty The output and new dictionary word is bb.

Another Example aabbbaabbaaabaababb ==> Decoding (dictionary size: initial 3, final 11) a 0 a+0aa b+1ab bb-5bb aa+3bba bba+6aab aab+7bbaa aaba-9aaba bb+5aabab

The problem cases code position in D a 0 a+0aa 3 b+1ab 4 bb-5bb 5 aa+3bba 6 bba+6aab 7 aab+7bbaa 8 aaba-9aaba 9 bb+5aabab 10

Old vs. new Ordinarily, we use an old dictionary word for the next code word. But sometimes we immediately use what was last added to the dictionary. But then it must be of the form s  s and we can still decompress.

Pseudo Code: Compression Initialize dictionary D to all words of length 1. Read all input characters: output code words from D, extend D whenever a new word appears. New code words: just an integer counter.

Less Pseudo initialize D; c = nextchar; // next input character W = c; // a string while( c = nextchar ) { if( W+c is in D ) // dictionary W = W + c; else output code(W); add W+c to D; W = c; } output code(W)

Pseudo Code: Decompression Initialize dictionary D with all words of length 1. Read all code words and - output corresponding words from D, - extend D at each step. This time the dictionary is of the form ( integer, word ) Keys are integers, values words.

Less Pseudo initialize D; pc = nextcode; // first code word pw = word(pc); // corresponding word output pw; First code word is easy: codes only a single symbol. Remember as pc (previous code) and pw (previous word).

More Less Pseudo while( c = nextcode ) { if( c is in D ) { cw = word(c); pw = word(pc); ww = pw + first(cw); insert ww in D; output cw; } else {

The hard case else { pw = word(pc); cw = pw + first(pw); insert cw in D; output cw; } pc = c; }

Implementation - Tries Tries are the best way to implement LZW in the LZW situation, we can add the new word to the trie dictionary in O(1) steps after discovering that the string is no longer a prefix of a dictionary word. Just add a new leaf to the last node touched.

LZW details In reality, one usually restricts the code words to be 12 or 16 bit integers. Hence, one may have to flush the dictionary ever so often. But we won’t bother with this.

LZW details Lastly, LZW generates as output a stream of integers. It makes perfect sense to try to compress these further, e.g., by Huffman.

Summary of LZW LZW is an adaptive, dictionary based compression method. Encoding is easy in LZW, but uses a special data structure (trie). Decoding is slightly complicated, requires no special data structures.