Running Time of Kruskal’s Algorithm Huffman Codes Monday, July 14th
Outline For Today 1.Runtime of Kruskal’s Algorithm (Union-Find Data Structure) 2.Data Encodings & Finding An Optimal Prefix-free Encoding 3.Prefix-free Encodings Binary Trees 4.Huffman Codes
Outline For Today 1.Runtime of Kruskal’s Algorithm (Union-Find Data Structure) 2.Data Encodings & Finding An Optimal Prefix-free Encoding 3.Prefix-free Encodings Binary Trees 4.Huffman Codes
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9 Creates a cycle
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9 Creates a cycle
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9 Creates a cycle
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9 Creates a cycle
Recap: Kruskal’s Algorithm Simulation B C A DE F 3 G H 7 Final Tree! Same as T prim
Recap: Kruskal’s Algorithm Pseudocode procedure kruskal(G(V, E)): sort E in order of increasing weights rename E so w(e 1 ) < w(e 2 ) < … < w(e m ) T = {} // final tree edges for i = 1 to m: if T ∪ e i =(u,v) doesn’t create cycle add e i to T return T
Recap: For Correctness We Proved 2 Things 1.Outputs a Spanning Tree T krsk 2.T krsk is a minimum spanning tree
1: Kruskal Outputs a Spanning Tree (1) Need to prove T krsk is spanning AND is acyclic Acyclic is by definition of the algorithm. Why is T krsk spanning (i.e., connected)? Recall Empty Cut Lemma: A graph is not connected iff ∃ cut (X, Y) with no crossing edges If all cuts have a crossing edge -> graph is connected!
2: Kruskal is Optimal (by Cut Property) Let (u, v) be any edge added by Kruskal’s Algorithm. u and v are in different comp. (b/c Kruskal checks for cycles) u x y v t z w Claim: (u, v) is min- edge crossing this cut!
Kruskal’s Runtime procedure kruskal(G(V, E)): sort E in order of increasing weights rename E so w(e 1 ) < w(e 2 ) < … < w(e m ) T = {} // final tree edges for i = 1 to m: if T ∪ e i =(u,v) doesn’t create cycle add e i to T return T O(mlog(n)) m iterations ? Option 1: check if u v path exists! Run a BFS/DFS from u or v => O(|T| + n) = O(n) Can we speed up cycle checking? ***BFS/DFS Total Runtime: O(mn)***
Speeding Kruskal’s Algorithm Goal: Check for cycles in log(n) time. Observation: (u, v) creates a cycle iff u and v are in the same connected component Option 2: check if u’s component = v’s component More Specific Goal: check the component of each vertex in log(n) time
Union-Find Data Structure Operation 1: Maintain the component structure of T as we add new edges to it. Operation 2: Query component of each vertex v UnionFind
Kruskal’s With Union-Find (Conceptually) B C A DE F 3 G H 8 7 9
Kruskal’s With Union-Find (Conceptually) A C B E B C D A DE F F 3 G G H H , 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9 Find(A) = A Find(D) = D Union(A, D)
Kruskal’s With Union-Find (Conceptually) A C B E B C A A DE F F 3 G G H H Find(D) = A Find(E) = E Union(A, E) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually) A C B A B C A A DE F F 3 G G H H Find(C) = C Find(F) = F Union(C, F) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually) A C B A B C A A DE F C 3 G G H H Find(E) = A Find(F) = C Union(A, C) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually) A A B A B C A A DE F A 3 G G H H Find(A) = A Find(B) = B Union(A, B) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually) A A A A B C A A DE F A 3 G G H H Find(D) = A Find(C) = A Skip (D, C) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually) A A A A B C A A DE F A 3 G G H H Find(A) = A Find(C) = A Skip (A, C) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually) A A A A B C A A DE F A 3 G G H H Find(C) = A Find(H) = H Union(A, H) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually) A A A A B C A A DE F A 3 G G A H Find(F) = A Find(G) = G Union(A, G) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually) A A B A B C A A DE F A 3 A G A H Find(B) = A Find(C) = A Skip (B, C) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually) A A B A B C A A DE F A 3 A G A H Find(H) = A Find(G) = A Skip (H, G) 1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Union-Find Implementation Simulation A1A1 B1B1 C1C1 D1D1 E1E1 F1F1 G1G1 H1H1
A1A1 B1B1 C1C1 D1D1 E1E1 F1F1 G1G1 H1H1
A2A2 B1B1 C1C1 D E1E1 F1F1 G1G1 H1H1
A2A2 B1B1 C1C1 D E1E1 F1F1 G1G1 H1H1
A3A3 B1B1 C1C1 D F1F1 G1G1 H1H1 E
A3A3 B1B1 C1C1 D F1F1 G1G1 H1H1 E
A3A3 B1B1 C2C2 D G1G1 H1H1 EF
A3A3 B1B1 C2C2 D G1G1 H1H1 EF
A5A5 B1B1 C D G1G1 H1H1 E F
A5A5 B1B1 C D G1G1 H1H1 E F
A6A6 C D G1G1 H1H1 E F B
A6A6 C D G1G1 H1H1 E F B
A7A7 C D G1G1 E F B H
A7A7 C D G1G1 E F B H
A8A8 C D E F B H G
C A X7X7 WZ Y T Linked Structure Per Connected Component Leader
C AWZ Y T Union Operation FG X7X7 E3E3 Union: **Make Leader of Small Component Point to the leader of Large Component**
C AWZ Y T Union Operation FG X 10 E Cost: O(1) (1 pointer update, 1 increment) Union: **Make Leader of Small Component Point to the leader of Large Component**
C AWZ Y T Union Operation FG X 10 E
C AWZ Y T Find Operation FG X 10 E Find: “pointer chase” until the leader Cost: # pointers to leader ?
Cost of Find Operation Claim: For any v, #-pointers to leader(v) ≤ log 2 (|component(v)|) ≤ log 2 (n) Proof: Each time v’s path to leader increases by 1, the size of its component at least doubles! |component(v)| starts at 1, increases to n, therefore it can double at most log 2 (n) time!
Summary of Union-Find Initialization: Each v is a comp. of size 1 and points to itself. When we union two components, we make the leader of the smaller one point to the larger one (break ties arbitrarily). Find(v): Pointer chasing to the leader Cost: O(log 2 (|component|)) = O(log 2 (n)) Union(u, v): 1 pointer update, 1 increment => O(1)
Kruskal’s Runtime With Union-Find procedure kruskal(G(V, E)): sort E in order of increasing weights rename E so w(e 1 ) < w(e 2 ) < … < w(e m ) init Union-Find T = {} // final tree edges for i = 1 to m: e i =(u,v) if find(u) != find(v) add e i to T Union(find(u), find(v)) return T O(mlog(n)) m iterations log(n) ***Total Runtime: O(mlog(n))*** Same as Prim’s with heaps O(1) O(n)
Outline For Today 1.Runtime of Kruskal’s Algorithm (Union-Find Data Structure) 2.Data Encodings & Finding An Optimal Prefix-free Encoding 3.Prefix-free Encodings Binary Trees 4.Huffman Codes
Data Encodings and Compression All data in the digital world gets represented as 0s and 1s
Goal of Data Compression: Make the binary blob as small as possible, satisfying the protocol. Encoding-Decoding Protocol encoder decoder
Alphabet A = {a, b, c, …., z}, assume |A| = 32 a b … z … Option 1: Fixed Length Codes Each letter mapped to exactly 5 bits Example: ASCII encoding
catcat ab…zab…z … encoder decoder Example: Fixed Length Codes cat A = {a, b, c, …., z}
Output Size of Fixed Length Codes Input: Alphabet A, text document of length n Each letter is mapped to log 2 (|A|) bits Output Size: nlog 2 (|A|) Optimal if letters appear with same frequencies in text! In practice, letters appear with different frequencies Ex: In English, letters a, t, e are much more frequent than q, z, x Question: Can we do better?
Option 2: Variable Length Binary Codes Goal is to assign: Frequently appearing letters short bit strings Infrequently appearing ones long bit strings Hope: On average have ≤ nlog 2 (|A|) encoded bits for documents of size n (or ≤ log 2 (|A|) bits per letter)
Example 1: The Morse’s Code (not binary) Two Symbols: Dots (●) and Dash (−) or light and dark But end of a letter is indicated with a pause (effectively a third symbol) frequents: e => ●, t => −, a => ●− Infrequents: c => −●−●, j => ●−−− catcat encoder −●−●P −●−●P●−P−P cat decode ●−P −P
Can We Have a Morse Code with 2 Symbols? Goal: Same idea as the Morse code but with only 2 symbols. frequents: e => 0, t => 1, a => 01 Infrequents: c => 1010, j => 0111 catcat encoder decode 01 1 taeett? teteat? cat? **Decoding is Ambigous**
Why Was There Ambiguity? The encoding of one letter was a prefix of another letter. Ex: e => 0 is a prefix of a => 01 Goal: Use a “prefix-free” encoding, i.e. no letter’s encoding is a prefix of another! Note: Fixed-length encoding was naturally “prefix-free”.
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode c
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode caca
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode cabcab
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode d
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode dada
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode dacdac
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode daccdacc
Ex: Variable Length Prefix-free Encoding Ex: A = {a, b, c, d} abcdabcd decode daccadacca
Benefits of Variable Length Codes Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c: 10% d: 5% abcdabcd Variable Length Code abcdabcd Fixed Length Code A document of length 100K Fixed Length Code Variable Length Code 200K bits (2 bits/letter) a: 45K b: 80K c: 30K d: 15K Total: 170K bits (1.7 b/l)
Formal Problem Statement Input: An alphabet A, and frequencies of letters in A Output: a prefix-free encoding Ɣ, i.e. a mapping A -> {0,1}* that minimizes the average bits per letter
Outline For Today 1.Runtime of Kruskal’s Algorithm (Union-Find Data Structure) 2.Data Encodings & Finding An Optimal Prefix-free Encoding 3.Prefix-free Encodings Binary Trees 4.Huffman Codes
Prefix-free Encodings Binary Trees We can represent each prefix-free code Ɣ as a binary tree T as follows: abcdabcd Code 1 b cd 0 1 a Encoding of letter x = path from the root to the leaf with x
Prefix-free Encodings Binary Trees We can represent each prefix-free code Ɣ as a binary tree T as follows: abcdabcd Code 2 c d ab 01
Reverse is Also True Each labeled binary tree T corresponds to a prefix-free code for an alphabet A, where |A| = # leaves in T be a 0 1 c d 0 1 abcdeabcde Why is this code prefix-free?
Reverse is Also True Claim: Each labeled binary tree T corresponds to a prefix- free code for an alphabet A, where |A| = # leaves in T Proof: Take path P = {0,1}* to leaf x as x’ encoding Since each letter x is at a leaf, the path from the root to x is a dead-end and cannot be part of a path to another letter y.
Number of Bits for Letter x? b cd 0 1 a Let A be an alphabet, and T be a binary tree where letters of A are the leaves of T Answer: depth T (x) Question: What’s the number of bits for each letter x in the encoding corresponding to T?
Formal Problem Statement Restated Input: An alphabet A, and frequencies of letters in A Output: A binary tree T, where letters of A are the leaves of T, that has the minimum average bit length (ABL):
Outline For Today 1.Runtime of Kruskal’s Algorithm (Union-Find Data Structure) 2.Data Encodings & Finding An Optimal Prefix-free Encoding 3.Prefix-free Encodings Binary Trees 4.Huffman Codes
Observation 1 About Optimal T Claim: The optimal binary tree T is full, i.e., each non-leaf vertex u has exactly 2 children a 0 1 c 0 1 b 0 1 e 0 a 01 c 0 1 b 0 1 e Why? TT`
Claim: The optimal binary tree T is full, i.e., each non-leaf vertex u has exactly 2 children a 0 1 c 0 1 b 0 1 e 0 a 01 c 0 1 b 0 1 e Exchange Argument: Can replace u with its only child and decrease the depths of some leaves, giving a better tree T`. Observation 1 About Optimal T
Claim: The optimal binary tree T is full, i.e., each non-leaf vertex has exactly 2 children TT` c a b 1 c a b 1 Observation 1 About Optimal T
First Algorithm: Shannon-Fano Codes From 1948 Top-down Divide-Conquer type approach 1.Divide the alphabet into A 0 and A 1 s.t the frequencies of letters in A 0 and A 1 are roughly 50% 2.Find an encoding Ɣ 0 for A 0, and Ɣ 1 for A 1 3.Append 0 to the encodings of Ɣ 0 and 1 to Ɣ 1
First Algorithm: Shannon-Fano Codes Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c: 10% d: 5% A 0 = {a, d}, A 1 = {b, c} d 0 1 a c 0 1 b 0 1 Fixed-length encoding, which we saw was suboptimal!
Observation 2 About Optimal T Claim: In any optimal tree T if leaf x has depth i, and leaf y has depth j, s.t i f(x) ≥ f(y) Why? Exchange Argument: Replace x and y and get a better tree T`.
Observation 2 About Optimal T Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c: 10% d: 5% b ad 0 1 c b cd 0 1 a T => 2.4 bits/letterT` => 1.7 bits/letter
Corollary In any optimal tree T the two lowest frequency letters are both in the lowest level of the tree!
Huffman’s Key Insight Observation 1 => optimal Ts are full => each leaf has a sibling Corollary => 2 lowest freq. letters x, y are at the same level Changing letters across the same level does not change the cost of T b cd 0 1 a There is an optimal tree T, in which the two lowest frequency letters are siblings (in the lowest level of the tree).
Possible Greedy Algorithm Possible greedy algorithm: 1.If x, y are siblings, treat them as a single meta-letter xy 2.Find an optimal tree T* with A-{x, y} + {xy} 3.Expand xy back into x and y in T*
Possible Greedy Algorithm (Example) xy t 0 1 z 01 Ex: A = {x, y, z, t}, and let x, y be the two lowest freq. letters Let A` = {xy, z, t} t 0 1 z 01 xy 0 1 T* T
The weight of meta-letter? Q: What weight should be attached to the meta-letter xy? A: f(x) + f(y) procedure Huffman(A, ): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)} T* = Huffman(A`, `) expand x, y in T* to get T return T
Huffman’s Algorithm (1951) procedure Huffman(A, ): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)} T* = Huffman(A`, `) expand x, y in T* to get T return T
Huffman’s Algorithm Correctness (1) By induction on the |A| Base case: |A| = 2 => return simple full tree with 2 leaves IH: Assume true for all alphabets of size k-1 Huffman will get a T k-1 opt with meta-letter xy and expand xy
Huffman’s Algorithm Correctness (2) xy t 0 1 z 01 t 0 1 z 01 xy 0 1 T k-1 opt T f(xy)*depth(xy) =(f(x) + f(y))*depth(xy) (f(x) + f(y))*(depth(xy) + 1) Total diff = f(x) + f(y)
Huffman’s Algorithm Correctness (3) Take any optimal Z, we’ll argue ABL(T) ≤ ABL(Z) By corollary we can assume in Z x,y are also siblings at the lowest level. Consider Z` by merging them => Z` is valid prefix-code for A` of size k-1 ABL(Z) = ABL(Z`) + f(x) + f(y) ABL(T) = ABL(T`) + f(x) + f(y) By IH: ABL(T`) ≤ ABL(T`) => ABL(T) ≤ ABL(z) Q.E.D
Huffman’s Algorithm Runtime Exercise: Make Huffman run in O(|A|log(|A|))?