Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004

What is Data Compression? There is lossless and lossy compression, either way, file size is reduced This saves both time and space (premium) Data Compression Algorithms are more successful if they are based on statistical analysis of the frequency of the data and the accuracy needed to represent the data.

Examples in computers jpeg is a compressed image file mp3 is a compressed audio file zip is a compressed archive of files there are lots of encoding algorithms, we will look at Huffman’s Algorithm (see our textbook pp.357-362)

What is Greedy Algorithm Solve a problem in stages Make a locally optimum decision Algorithm is good if local optimum is equal is to the global optimum

Examples of Greedy Dijkstra, Prim, Kruskal Bin Packing problem Huffman Code

Problem with Greedy Greedy Algorithm does not always work with the set of data, there can be some conflicts What if all characters are equally distributed? What if characters are very unequally distributed? A problem from our text book: If we had such a thing as a 12cent coin, and we are asked to make 15cents change, Greedy Algorithm would produce : 1(12cent) + 3(penny) = 15 incorrect answer 1(dime) + 1(nickel) = 15 correct answer

David Huffman Paper published in 1952 “A Method for the Construction of Minimum Redundancy Codes” What we call “Data Compression” is what he termed “Minimum Redundancy”

ASCII Code 128 characters includes punctuation log 128 = 7 bits 1 byte = 8 bits All characters are 8 bits long “Fixed-Length Encoding” “Etaoin Shrdlu” most common letters!!!

Intro to Huffman Algorithm Method of construction for an encoding tree Full Binary Tree Representation Each edge of the tree has a value, (0 is the left child, 1 is the right child) Data is at the leaves, not internal nodes Result: encoding tree “Variable-Length Encoding”

Huffman Algorithm (English) 1. Maintain a forest of trees 2. Weight of tree = sum frequency of leaves 3. For 0 to N-1 –Select two smallest weight trees –Form a new tree

Huffman Algorithm (Technical) n  |C| Q  C For i  1 to n – 1 –Do z  AllocateNode() –x  left[z]  ExtractMin(Q) –y  right[z]  ExtractMin(Q) –f[z]  f[x] + f[y] –Insert(Q, z) Return Extract-Min(Q)

Ambiguity in using code? What if you have an encoded string: 000010101101011000110001110 How do you know where to break it up? Prefix Coding Rule –No code is a prefix of another –The way the tree is built disallows this –If there is a “00” code, there cannot be a “0”

Step0: 10 20 5 15 25 1 16 (q) (w) (e) (r) (t) (y) (u) Step1: ( )6 10 20 15 25 16 / \ (q) (w) (r) (t) (u) (y) (e) Step2: ( )16 / \ ( ) (q) 20 15 25 16 / \ (w) (r) (t) (u) (y) (e) Step3: ( )16 / \ ( )31 ( ) (q) 20 25 / \ / \ (w) (t) (r) (u) (y) (e)

Step3: ( )16 / \ ( )31 ( ) (q) 20 25 / \ / \ (w) (t) (r) (u) (y) (e) Step4 : ( )36 / \ ( ) (w) / \ ( )31 ( ) (q) 25 / \ / \ (t) (r) (u) (y) (e) Step5 : ( )36 / \ ( )56 ( ) (w) / \ / \ (t) ( ) ( ) (q) / \ / \ (r) (u) (y) (e) Step6 : ( )92 / \ ( ) ( ) / \ / \ ( ) (w) (t) ( ) / \ / \ ( ) (q) (r) (u) / \ (y) (e)

Step6 : ( )92 / \ ( ) ( ) / \ / \ ( ) (w) (t) ( ) / \ / \ ( ) (q) (r) (u) / \ (y) (e) When tree is used to encode a file it is written as a header above the body of the encoded bits of text. 0 is left, 1 is right edge use a stack to do this Table: 01 w 0000 y 10 t 0001 e 110 r 001 q 111 u Header: 0000y0001e001q01w10t110r111u

Proof: part 1 Lemma: –Let C be an alphabet in which each character c in C has frequency f[c] –Let x and y be two characters in C having lowest frequencies –There exists an optimal prefix code in C in which the codes for x and y have the same length and differ only in last bit

Proof: part 2 Lemma: –Let T be a full binary tree representing an optimal prefix code over an alphabet C –Let z be the parent of two leaves x and y –Then T” = T – {x,y} represents an optimal prefix code for C” = C – {x,y}U{z}

Lengths of Encoding Set root / \ / \ / \ / \ / \ / \ / \ 1 2 3 4 5 6 7 8 Length of set is: (8 nodes) * (3 edges) = 24bits This is what you would get if the nodes are mostly random and equal in probability

Lengths of Encoding Set root / \ / \ 8 / \ 7 / \ 6 / \ 5 / \ 4 / \ 3 1 2 Length of set is: 7+7+6+5+4+3+2+1 = 35bits This is what you would get if the nodes vary the most in probability.

Expected Value / character In example 1: 8 * (1/2^3) * 3) = 3 bits In example 2: 2 * (1/2^7 * 7) + (1/2^6 * 6) + (1/2^5 * 5) + (1/2^4 * 4) + (1/2^3 * 3) + (1/2^2 * 2) + (1/2^1 * 1) = 1.98 bits

Main Point Statistical methods work better when the symbols in the data set have varying probabilities. Otherwise you need to use a different method for compression. (Example jpeg)

Image Compression “Lossy” – meaning details are lost An approximation of original image is made where large areas of similar color are combined into a single block This introduces a certain amount of error, which is a tradeoff

Steps to Image Compression Specify requested output file size Divide image into several areas Divide file size by the # of areas Quantize each area (information lost here) Encode each area separately, write to file

Image Decomposition

References Data Structures & Algorithm Analysis - Mark Allen Weiss Introduction to Algorithms – Thomas H. Cormen

Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Similar presentations

Presentation on theme: "Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Similar presentations

Presentation on theme: "Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004."— Presentation transcript:

Similar presentations

About project

Feedback