15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression.

Slides:



Advertisements
Similar presentations
Data Compression CS 147 Minh Nguyen.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
CSCI 3280 Tutorial 6. Outline  Theory part of LZW  Tree representation of LZW  Table representation of LZW.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Greedy Algorithms (Huffman Coding)
Algorithms for Data Compression
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lecture04 Data Compression.
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
School of Computing Science Simon Fraser University
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Lempel-Ziv Compression Techniques Classification of Lossless Compression techniques Introduction to Lempel-Ziv Encoding: LZ77 & LZ78 LZ78 Encoding Algorithm.
Lempel-Ziv Compression Techniques
A Data Compression Algorithm: Huffman Compression
Lempel-Ziv-Welch (LZW) Compression Algorithm
Lempel-Ziv Compression Techniques
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
Lecture 4 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Spring 2015 Mathematics in Management Science Binary Linear Codes Two Examples.
CSE Lectures 22 – Huffman codes
8. Compression. 2 Video and Audio Compression Video and Audio files are very large. Unless we develop and maintain very high bandwidth networks (Gigabytes.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.
Source Coding-Compression
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Fundamental Data Structures and Algorithms Aleks Nanevski February 10, 2004 based on a lecture by Peter Lee LZW Compression.
Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 5.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
Data Compression Meeting October 25, 2002 Arithmetic Coding.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lossless Compression(2)
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
CSE 589 Applied Algorithms Spring 1999
HUFFMAN CODES.
Data Coding Run Length Coding
Compression & Huffman Codes
Data Compression.
Data Compression.
Applied Algorithmics - week7
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
Lempel-Ziv Compression Techniques
Lempel-Ziv-Welch (LZW) Compression Algorithm
Data Compression CS 147 Minh Nguyen.
Advanced Algorithms Analysis and Design
Lempel-Ziv Compression Techniques
Huffman Coding Greedy Algorithm
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

In this lecture  Recap of Huffman  Frequency based  LZW Compression  Dictionary based  Compression  decompression  Lossy Compression  Singular value decomposition

Compression so far  Device an algorithm that encode the characters according to their frequencies  That is, low frequency chars get longer codes and higher frequency characters get shorter codes  This is the idea of Huffman Algorithm

Huffman Compression Process charF(c) a10 b20 c35 d40 abbabbcbdba cddcccddaaa cccbbaadddd bbdbbabbccb count Build tree 105 d c ab code table charcode a000 b001 c01 d1 12a30b31c21d11 Write header 12a30b31c21d … Write data

Huffman Decompression Process 12a30b31c21d … Read header 12a30b31c21d … charcode a000 b001 c01 d1 Read data decode abbabbcbdba cddcccddaaa cccbbaadddd bbdbbabbccb Original file

Questions about Huffman  Is Huffman Tree unique?  How do we know we get the optimal compression using a huffman tree?  What are the compression ratios of following files when huffman compression is applied? (ignore header info)  1 MB file with all the same character  1 MB file made up of only two distinct characters  1 MB file with 4 distinct characters, all with same probability  1MB file with ASCII characters randomly distributed  Is Huffman the only way to compress a file?

Dictionary-Based Compression

Dictionary-based methods  Idea is simple:  Keep track of “words” that we have seen and assign them a unique code, when we see them again, simply replace them with the code.  When we see new “words”, expand the dictionary by adding the new words  We can maintain dictionary entries  (word, code)  and make new additions to the dictionary as we read the input file.  Selecting a data structure  What data structures are good for dictionaries?  What data structure is good if we don’t know in advance, words in the dictionary?

Lempel & Ziv (1977/78) LZW Compression

Lemple-Ziv-Welch(LZW) Algorithm  Suppose we have n possible characters in the dictionary, each labeled 1,2,…n.  We start with a trie that contains a root and n children  one child for each possible character  each child labeled 1…n  We read the file and when we see a new character we add that to the trie and emit a code for the new string  We continue this until the whole file is read and we have a dictionary of “words” and codewords.

LZW example  Suppose our alphabet consists only of the four letters:  {a, b, c, d}  We start by assigning a=1, b=2, c=3, d=4  how many bytes needed to encode a,b,c,d?  Let’s consider the compression of the string  baddad

LZW: Compression example baddad Input: ^ a b Dictionary: Output: 1032 cd

LZW: Compress example baddad Input: ^ a b Dictionary: Output: 1032 cd a 5 d 6 d 7 a

LZW output  So, the input  baddad  compresses to   which can be given in bit form,  …or compressed again using Huffman (cool idea!)

Extending the dictionary  So what if we continue to compress more of the string.  Suppose we have baddadbaddadbaddad   What is the encoded file?

LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd

Byte LZW: Uncompress example Input: ^ a b Dictionary: Output: 1032 cd baddad 4 a 5 d 6 d 7 a

LZW Algorithm An alternative presentation (w/o tries)

Getting off the ground Suppose we want to compress a file containing only letters a, b, c and d. It seems reasonable to start with a dictionary a:0 b:1 c:2 d:3 At least we can then deal with the first letter. And the receiver knows how to start.

Growing pains Now suppose the file starts like so: a b b a b b … We scan the a, look it up and output a 0. After scanning the b, we have seen the word ab. So, we add it to the dictionary a:0 b:1 c:2 d:3 ab:4

Growing pains We output a 1 for the b. Then we get another b. a b b a b b … output 1, and add bb it to the dictionary a:0 b:1 c:2 d:3 ab:4 bb:5

So? Right, so far zero compression. But now we get a followed by b, and ab is in the dictionary a b b a b b … so we output 4, and put bab into the dictionary … d:3 ab:4 bb:5 ba:6 bab:7

And so on Suppose the input continues a b b a b b b b a … We output 5, and put bbb into the dictionary … ab:4 bb:5 ba:6 bab:7 bbb:8

More Hits As our dictionary grows, we are able to replace longer and longer blocks by short code numbers. a b b a b b b b a … And we increase the dictionary at each step by adding another word.

More importantly  Since we extend our dictionary in such a simple way, it can be easily reconstructed on the other end.  Start with the same initialization, then  Read one code number after the other, look up the each one in the dictionary, and extend the dictionary as you go along.  So we don’t need to send the dictionary (or codes) with the compressed file (unlike in Huffman)

LZW Compression-Formally where each prefix is in the dictionary. We stop when we fall out of the dictionary: a 1 a 2 a 3 …. a k b We scan a sequence of symbols a 1 a 2 a 3 …. a k

Again: Extending We output the code for a 1 a 2 a 3 …. a k and put a 1 a 2 a 3 …. a k b into the dictionary. Then we set a 1 = b And start all over.

Another Example Let's take a closer look at an example. Assume alphabet {a,b,c}. The code for aabbaabb is The decoding starts with dictionary a:0, b:1, c:2

Moving along The first 4 code words are already in D and produce output a a b b. As we go along, we extend D: a:0, b:1, c:2, aa:3, ab:4, bb:5 For the rest we get a a b b

Done We have also added to D: ba:6, aab:7 But these entries are never used. Everything is easy, since there is already an entry in Dictionary for each code number when we encounter it.

Is this it? Unfortunately, no. It may happen that we run into a code word without having an appropriate entry in Dictionary. But, it can only happen in very special circumstances, and we can manufacture the missing entry.

A Bad Run Consider input a a b b b a a ==> After reading 0 0 1, dictionary looks like this: a:0, b:1, c:2, aa:3, ab:4

Disaster The next code is 5, but it’s not in D. a:0, b:1, c:2, aa:3, ab:4 How could this have happened? Can we recover?

… narrowly averted This problem only arises when the input contains a substring …s  s  s … … s  s was just added to the dictionary. Here s is a single symbol, but  a (possibly empty) word.

… narrowly averted But then the fix is to output x + first(x) where x is the last decompressed word, and first(x) the first symbol of x. And, we also update the dictionary to contain this new entry.

Example In our example we had s = b w = empty The output and new dictionary word is bb.

Another Example aabbbaabbaaabaababb ==> Decoding (dictionary size: initial 3, final 11) a 0 a+0aa b+1ab bb-5bb aa+3bba bba+6aab aab+7bbaa aaba-9aaba bb+5aabab

The problem cases code position in D a 0 a+0aa 3 b+1ab 4 bb-5bb 5 aa+3bba 6 bba+6aab 7 aab+7bbaa 8 aaba-9aaba 9 bb+5aabab 10

Pseudo Code: Compression Initialize dictionary D to all words of length 1. That is the alphabet Read all input characters: output code words from D, extend D whenever a new word appears. New code words: just an integer counter.

Compression Algorithm initialize D; c = nextchar; // next input character W = c; // a string while( c = nextchar ) { if( W+c is in D ) // dictionary W = W + c; else output code(W); add W+c to D; W = c; } output code(W)

Pseudo Code: Decompression Initialize dictionary D with all words of length 1. Read all code words and - output corresponding words from D, - extend D at each step. This time the dictionary is of the form ( integer, word ) Keys are integers, values words.

Decompression Algorithm initialize D; pc = nextcode; // first code word pw = word(pc); // corresponding word output pw; First code word is easy: codes only a single symbol. Remember as pc (previous code) and pw (previous word).

Decompression Algorithm while( c = nextcode ) { if( c is in D ) { cw = word(c); pw = word(pc); ww = pw + first(cw); insert ww in D; output cw; } else {

The hard case else { pw = word(pc); cw = pw + first(pw); insert cw in D; output cw; } pc = c; }

Implementation - Tries Tries are the best way to implement LZW In the LZW situation, we can add the new word to the trie dictionary in O(1) steps after discovering that the string is no longer a prefix of a dictionary word. Just add a new leaf to the last node touched.

LZW details In reality, one usually restricts the code words to be 12 or 16 bit integers. Hence, one may have to flush the dictionary ever so often. But we won’t bother with this.

LZW details Lastly, LZW generates as output a stream of integers. It makes perfect sense to try to compress these further, e.g., by Huffman.

Summary of LZW LZW is an adaptive, dictionary based compression method. Encoding is easy in LZW, but uses a special data structure (trie). Decoding is slightly complicated, requires no special data structures, just a trie. Further Reading at:

Lossy Compression with SVD

Data Compression We have studied two important lossless data compression algorithms  Huffman Code.  Lemple-Ziv Dictionary Method.  Lossy Compression  What if we can compress an image by “degrading” the image a bit?  Lossy compression techniques are used in jpeg and gif compression algorithms  Next we will discuss a method to do a lossy compression using a matrix decomposition method known as SVD

Singular Value Decomposition(SVD)  Suppose A is an mxn matrix  We can find a decomposition of the matrix A such that A = U S V T, where  U and V are orthonormal matrices (i.e. UU T = I and V V T = I, where I-identity matrix  S is a diagonal matrix such that S = diag(s 1, s 2, s 3, … s k, 0,0,…0), where s i ‘s are called the singular values of A and k is the rank of A. It is possible to choose U and V such that s 1 > s 1 > …. > s k Note: Do not worry about all this Math if you have not done linear algebra

Expressing A as a sum  A = s 1 U 1 V 1 T + s 2 U 2 V 2 T + ….+ s K U K V K T where U i and V i are i th column of U and V respectively  Bit of a knowledge about block matrix multiplication will convince you that this sum is indeed equal to A.  The key idea in SVD compression is that we can select any number of terms we need from the above sum to “approximate” A  Thinking about image as a matrix A, more terms we pick, more clarity we get with the image  Compression comes from saving as fewer vectors as possible to get a decent image.

Breaking down an image

Red, Green and Blue Images

The Red matrix representation of the image (16x16 matrix)  Apply SVD to this matrix and get a close enough approximation using as fewer columns of U and V as possible.  Do the same for Green and Blue parts and reconstruct the matrix

Implementation (compression) SVD COMPRESSION STEP Compressed file stores U, V and S for the Rank selected for each of the colors R, G and B and header bytes

Implementation (decompression) DECOMPRESSION Compressed file stores U, V and S for the Rank selected for each of the colors R, G and B and the bmp header

Some Samples (128x128) Original mage 49K Rank 1 approx 825 bytes

Samples ctd… Rank 16 approx 13K Rank 8 approx 7K

SVD compression using Matlab  A=imread(‘image.bmp’);  imagesc(A);  R=A(:,:,1); G=A(:,:,2); B=A(:,:,3);  [U,S,V]=svd(R);  Ar=sum(S(i,i)*U(:,i)*V(:,i) T, i=1…k); // similarly find Ag and Ab  A(:,:,1)=Ar; A(:,:,2)=Ag; A(:,:,3)=Ab;  imagesc(A); // rank k approximation