4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.

Slides:



Advertisements
Similar presentations
Chapter 9 Greedy Technique. Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: b feasible - b feasible.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
Algorithm Design Techniques: Greedy Algorithms. Introduction Algorithm Design Techniques –Design of algorithms –Algorithms commonly used to solve problems.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms Greed is good. (Some of the time)
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Trees Chapter 8.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Fall 2007CS 2251 Trees Chapter 8. Fall 2007CS 2252 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
Binary Trees A binary tree is made up of a finite set of nodes that is either empty or consists of a node called the root together with two binary trees,
A Data Compression Algorithm: Huffman Compression
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
Data Structures – LECTURE 10 Huffman coding
Chapter 9: Huffman Codes
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Greedy Algorithms Huffman Coding
CS420 lecture eight Greedy Algorithms. Going from A to G Starting with a full tank, we can drive 350 miles before we need to gas up, minimize the number.
Huffman Codes Message consisting of five characters: a, b, c, d,e
CSE Lectures 22 – Huffman codes
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Huffman Encoding Veronica Morales.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Data Structures Week 6: Assignment #2 Problem
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
Spring 2010CS 2251 Trees Chapter 6. Spring 2010CS 2252 Chapter Objectives Learn to use a tree to represent a hierarchical organization of information.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Trees (Ch. 9.2) Longin Jan Latecki Temple University based on slides by Simon Langley and Shang-Hua Teng.
 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.
5.5.3 Rooted tree and binary tree  Definition 25: A directed graph is a directed tree if the graph is a tree in the underlying undirected graph.  Definition.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Trees (Ch. 9.2) Longin Jan Latecki Temple University based on slides by Simon Langley and Shang-Hua Teng.
Huffman Codes. Overview  Huffman codes: compressing data (savings of 20% to 90%)  Huffman’s greedy algorithm uses a table of the frequencies of occurrence.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
COMP261 Lecture 22 Data Compression 2.
B/B+ Trees 4.7.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Chapter 5 : Trees.
Proving the Correctness of Huffman’s Algorithm
The Greedy Method and Text Compression
Chapter 8 – Binary Search Tree
Chapter 9: Huffman Codes
Algorithms (2IL15) – Lecture 2
Greedy Algorithms Many optimization problems can be solved more quickly using a greedy approach The basic principle is that local optimal decisions may.
Chapter 11 Data Compression
Huffman Coding CSE 373 Data Structures.
Data Compression Section 4.8 of [KT].
Greedy: Huffman Codes Yin Tat Lee
Data Structure and Algorithms
Podcast Ch23d Title: Huffman Compression
1 Lecture 13 CS2013.
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Huffman Coding Greedy Algorithm
CSE 589 Applied Algorithms Spring 1999
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Proving the Correctness of Huffman’s Algorithm
Analysis of Algorithms CS 477/677
Presentation transcript:

4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd

Data Compression Q. Given a text that uses 32 symbols (26 different letters, space, and some punctuation characters), how can we encode this text in bits? Q. Some symbols (e, t, a, o, i, n) are used far more often than others. How can we use this to reduce our encoding? Q. How do we know when the next symbol begins? Ex. c(a) = 01 What is 0101? c(b) = 010 c(e) = 1

Data Compression Q. Given a text that uses 32 symbols (26 different letters, space, and some punctuation characters), how can we encode this text in bits? A. We can encode 25 different symbols using a fixed length of 5 bits per symbol. This is called fixed length encoding. Q. Some symbols (e, t, a, o, i, n) are used far more often than others. How can we use this to reduce our encoding? A. Encode these characters with fewer bits, and the others with more bits. Q. How do we know when the next symbol begins? A. Use a separation symbol (like the pause in Morse), or make sure that there is no ambiguity by ensuring that no code is a prefix of another one. Ex. c(a) = 01 What is 0101? c(b) = 010 c(e) = 1

Prefix Codes Definition. A prefix code for a set S is a function c that maps each x  S to 1s and 0s in such a way that for x,yS, x≠y, c(x) is not a prefix of c(y). Ex. c(a) = 11 c(e) = 01 c(k) = 001 c(l) = 10 c(u) = 000 Q. What is the meaning of 1001000001 ? Suppose frequencies are known in a text of 1G: fa=0.4, fe=0.2, fk=0.2, fl=0.1, fu=0.1 Q. What is the size of the encoded text?

Prefix Codes Definition. A prefix code for a set S is a function c that maps each xS to 1s and 0s in such a way that for x,yS, x≠y, c(x) is not a prefix of c(y). Ex. c(a) = 11 c(e) = 01 c(k) = 001 c(l) = 10 c(u) = 000 Q. What is the meaning of 1001000001 ? A. “leuk” Suppose frequencies are known in a text of 1G: fa=0.4, fe=0.2, fk=0.2, fl=0.1, fu=0.1 Q. What is the size of the encoded text? A. 2*fa + 2*fe + 3*fk + 2*fl + 4*fu = 2.4G

Optimal Prefix Codes Definition. The average bits per letter of a prefix code c is the sum over all symbols of: ( its frequency ) x (the number of bits of its encoding): GOAL: find a prefix code that is has the lowest possible average bits per letter. We can model a code in a binary tree…

Representing Prefix Codes using Binary Trees Ex. c(a) = 11 c(e) = 01 c(k) = 001 c(l) = 10 c(u) = 000 Q. How does the tree of a prefix code look? 1 1 1 a e l 1 u k

Representing Prefix Codes using Binary Trees Ex. c(a) = 11 c(e) = 01 c(k) = 001 c(l) = 10 c(u) = 000 Q. How does the tree of a prefix code look? A. Only the leaves have a label. Proof. An encoding of x is a prefix of an encoding of y iff the path of x is a prefix of the path of y. 1 1 1 a e l 1 u k

Representing Prefix Codes using Binary Trees Q. What is the meaning of 1110 10 001 1111 01 000 ? 1 1 1 e i 1 1 l m 1 p s

Representing Prefix Codes using Binary Trees Q. What is the meaning of 111010001111101000 ? A. “simpel” Q. How can this prefix code be made more efficient? 1 1 1 e i 1 1 l m 1 p s

Representing Prefix Codes using Binary Trees Q. What is the meaning of 111010001111101000 ? A. “simpel” Q. How can this prefix code be made more efficient? A. Change encoding of p and s to a shorter one. This tree is now full. 1 1 1 e i 1 1 l m s 1 p s

Representing Prefix Codes using Binary Trees Definition. A tree is full if every node that is not a leaf has two children. Claim. The binary tree corresponding to an optimal prefix code is full. Pf. w u v

Representing Prefix Codes using Binary Trees Definition. A tree is full if every node that is not a leaf has two children. Claim. The binary tree corresponding to the optimal prefix code is full. Proof. (by contradiction) Suppose T is binary tree of optimal prefix code and is not full. This means there is a node u with only one child v. Case 1: u is the root; delete u and use v as the root Case 2: u is not the root let w be the parent of u delete u and make v be a child of w in place of u In both cases the number of bits needed to encode any leaf in the subtree of v is decreased. The rest of the tree is not affected. Clearly this new tree T’ has a smaller ABL than T. Contradiction. w u v

Optimal Prefix Codes: False Start Q. Where should letters be placed with a high frequency in the tree of an optimal prefix code ? Q. How to greedily build a tree?

Optimal Prefix Codes: False Start Q. Where in the tree of an optimal prefix code should letters be placed with a high frequency? A. Near the top! Use recursive structure of trees. Greedy template. Create tree top-down, split S into two sets S1 and S2 with (almost) equal frequencies. Recursively build tree for S1 and S2. [Shannon-Fano, 1949] fa=0.32, fe=0.25, fk=0.20, fl=0.18, fu=0.05 S-F is not optimal not greedy! better! greedy! Q. How to greedily build a tree? a a e l k e 0.32 0.32 0.25 0.18 0.20 0.25 u k u l 0.05 0.20 0.05 0.18

Optimal Prefix Codes: Huffman Encoding Observation 1. Lowest frequency items should be at the lowest level in tree of optimal prefix code. Observation 2. For n > 1, the lowest level always contains at least two leaves (optimal trees are full!). Observation 3. The order in which items appear in a level does not matter. Claim 1. There is an optimal prefix code with tree T* where the two lowest-frequency letters are assigned to leaves that are brothers in T*.

Huffman Code Greedy template. [Huffman, 1952] Create tree bottom-up. a) Make two leaves for two lowest-frequency letters y and z. b) Recursively build tree for the rest using a meta-letter for yz.

Optimal Prefix Codes: Huffman Encoding Q. What is the time complexity? Huffman(S) { if |S|=2 { return tree with root and 2 leaves } else { let y and z be lowest-frequency letters in S S’ = S remove y and z from S’ insert new letter  in S’ with f=fy+fz T’ = Huffman(S’) T = add two children y and z to leaf  from T’ return T }

Optimal Prefix Codes: Huffman Encoding Q. What is the time complexity? A. T(n) = T(n-1) + O(n) ---> O(n2) Q. How to implement finding lowest-frequency letters efficiently? A. Use priority queue for S: T(n) = T(n-1) + O(log n) --> O(n log n) Huffman(S) { if |S|=2 { return tree with root and 2 leaves } else { let y and z be lowest-frequency letters in S S’ = S remove y and z from S’ insert new letter  in S’ with f=fy+fz T’ = Huffman(S’) T = add two children y and z to leaf  from T’ return T }

Huffman Encoding: Greedy Analysis Claim. Huffman code for S achieves the minimum ABL of any prefix code. Pf. by induction, based on optimality of T’ (y and z removed,  added) (see next page) Claim. ABL(T’)=ABL(T)-f Pf.

Huffman Encoding: Greedy Analysis Claim. Huffman code for S achieves the minimum ABL of any prefix code. Proof. by induction, based on optimality of T’ (y and z removed,  added) (see next page) Claim. ABL(T’) = ABL(T) - f Proof. T’ is efficienter omdat yz een level hoger zitten.

Huffman Encoding: Greedy Analysis Claim. Huffman code for S achieves the minimum ABL of any prefix code. Prooff. (by induction over n=|S|)

Huffman Encoding: Greedy Analysis Claim. Huffman code for S achieves the minimum ABL of any prefix code. Pf. (by induction over n=|S|) Base: For n=2 there is no shorter code than root and two leaves. Hypothesis: Suppose Huffman tree T’ for S’ of size n-1 with  instead of y and z is optimal. Step: (by contradiction)

Huffman Encoding: Greedy Analysis Claim. Huffman code for S achieves the minimum ABL of any prefix code. Pf. (by induction) Base: For n=2 there is no shorter code than root and two leaves. Hypothesis: Suppose Huffman tree T’ for S’ of size n-1 with  instead of y and z is optimal. (IH) Step: (by contradiction) Idea of proof: Suppose other tree Z of size n is better. Delete lowest frequency items y and z from Z creating Z’ Z’ cannot be better than T’ by IH.

Huffman Encoding: Greedy Analysis Claim. Huffman code for S achieves the minimum ABL of any prefix code. Pf. (by induction) Base: For n=2 there is no shorter code than root and two leaves. Hypothesis: Suppose Huffman tree T’ for S’ with  instead of y and z is optimal. (Inductive Hyp.) Step: (by contradiction) Suppose Huffman tree T for S is not optimal. So there is some tree Z such that ABL(Z) < ABL(T). Then there is also a tree Z for which leaves y and z exist that are brothers and have the lowest frequency (see Claim 1). Let Z’ be Z with y and z deleted, and their former parent labeled . Similar T’ is derived from S’ in our algorithm. We know that ABL(Z’)=ABL(Z)-f, as well as ABL(T’)=ABL(T)-f. But also ABL(Z) < ABL(T) --> ABL(Z’) < ABL(T’). Contradiction with IH.

Steps of the Proof Step: (by contradiction) Suppose Huffman tree T for S is not optimal. So there is some tree Z such that ABL(Z) < ABL(T). Then there is also a tree Z for which leaves y and z exist that are brothers and have the lowest frequency (see Obs. 1-2: fullness!). Let Z’ be Z with y and z deleted, and their former parent labeled . Similar T’ is derived from S’ in our algorithm. We know that ABL(Z’)=ABL(Z)-f, as well as ABL(T’)=ABL(T)-f. But also (Absurd Hyp) ABL(Z) < ABL(T), so ABL(Z’) < ABL(T’). Contradiction with IND IH.