Data Structures and Algorithms

Slides:

Advertisements

Similar presentations

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:

Analysis of Algorithms

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??

Chapter 4: Trees Part II - AVL Tree

22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.

Bucket-Sort and Radix-Sort B 1, c7, d7, g3, b3, a7, e 

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.

MS 101: Algorithms Instructor Neelima Gupta

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2010.

CSCE 3110 Data Structures & Algorithm Analysis

Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.

1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)

1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.

Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.

Chapter 4: Trees Radix Search Trees Lydia Sinapova, Simpson College Mark Allen Weiss: Data Structures and Algorithm Analysis in Java.

FALL 2006CENG 351 Data Management and File Structures1 External Sorting.

1 CSC401 – Analysis of Algorithms Lecture Notes 9 Radix Sort and Selection Objectives  Introduce no-comparison-based sorting algorithms: Bucket-sort and.

Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.

Computer Algorithms Lecture 11 Sorting in Linear Time Ch. 8

CSE 373 Data Structures Lecture 15

Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.

Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?

ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.

Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.

The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok Department of Computer Science and Engineering.

CSC 213 Lecture 12: Quick Sort & Radix/Bucket Sort.

David Luebke 1 10/13/2015 CS 332: Algorithms Linear-Time Sorting Algorithms.

CSC 41/513: Intro to Algorithms Linear-Time Sorting Algorithms.

© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.

Sorting Fun1 Chapter 4: Sorting     29  9.

CSE332: Data Abstractions Lecture 14: Beyond Comparison Sorting Dan Grossman Spring 2012.

Analysis of Algorithms CS 477/677

TECH Computer Science Dynamic Sets and Searching Analysis Technique  Amortized Analysis // average cost of each operation in the worst case Dynamic Sets.

Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.

Introduction to Algorithms Chapter 16: Greedy Algorithms.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

Mudasser Naseer 1 11/5/2015 CSC 201: Design and Analysis of Algorithms Lecture # 8 Some Examples of Recursion Linear-Time Sorting Algorithms.

Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.

© 2004 Goodrich, Tamassia Bucket-Sort and Radix-Sort B 1, c7, d7, g3, b3, a7, e 

Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.

CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.

COSC 3101A - Design and Analysis of Algorithms 6 Lower Bounds for Sorting Counting / Radix / Bucket Sort Many of these slides are taken from Monica Nicolescu,

Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

CS6045: Advanced Algorithms Sorting Algorithms. Sorting So Far Insertion sort: –Easy to code –Fast on small inputs (less than ~50 elements) –Fast on nearly-sorted.

Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,

Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.

Top 50 Data Structures Interview Questions

Bucket-Sort and Radix-Sort

Introduction to Algorithms

Chapter 8 – Binary Search Tree

Radish-Sort 11/11/ :01 AM Quick-Sort     2 9  9

Bucket-Sort and Radix-Sort

Bucket-Sort and Radix-Sort

Quick-Sort 11/14/2018 2:17 PM Chapter 4: Sorting    7 9

Bucket-Sort and Radix-Sort

Quick-Sort 11/19/ :46 AM Chapter 4: Sorting    7 9

Bucket-Sort and Radix-Sort

Data Structures and Algorithms for Information Processing

Indexing and Hashing Basic Concepts Ordered Indices

Quick-Sort 2/23/2019 1:48 AM Chapter 4: Sorting    7 9

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Bucket-Sort and Radix-Sort

Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes

Bucket-Sort and Radix-Sort

Presentation transcript:

Data Structures and Algorithms Course slides: Radix Search, Radix sort, Bucket sort, Huffman compression

Radix Searching For many applications, keys can be thought of as numbers Searching methods that take advantage of digital properties of these keys are called radix searches Radix searches treat keys as numbers in base M (the radix) and work with individual digits Lecture 10: Searching

Radix Searching Provide reasonable worst-case performance without complication of balanced trees. Provide way to handle variable length keys. Biased data can lead to degenerate data structures with bad performance. Lecture 10: Searching

The Simplest Radix Search Digital Search Trees — like BSTs but branch according to the key’s bits. Key comparison replaced by function that accesses the key’s next bit. Lecture 10: Searching

Digital Search Example Lecture 10: Searching

Digital Search Trees Consider BST search for key K For each node T in the tree we have 4 possible results T is empty (or a sentinel node) indicating item not found K matches T.key and item is found K < T.key and we go to left child K > T.key and we go to right child Consider now the same basic technique, but proceeding left or right based on the current bit within the key

Digital Search Trees Call this tree a Digital Search Tree (DST) DST search for key K For each node T in the tree we have 4 possible results T is empty (or a sentinel node) indicating item not found K matches T.key and item is found Current bit of K is a 0 and we go to left child Current bit of K is a 1 and we go to right child Look at example on board

Digital Search Trees Run-times? Given N random keys, the height of a DST should average O(log2N) Think of it this way – if the keys are random, at each branch it should be equally likely that a key will have a 0 bit or a 1 bit Thus the tree should be well balanced In the worst case, we are bound by the number of bits in the key (say it is b) So in a sense we can say that this tree has a constant run-time, if the number of bits in the key is a constant This is an improvement over the BST

Digital Search Trees But DSTs have drawbacks Bitwise operations are not always easy Some languages do not provide for them at all, and for others it is costly Handling duplicates is problematic Where would we put a duplicate object? Follow bits to new position? Will work but Find will always find first one Actually this problem exists with BST as well Could have nodes store a collection of objects rather than a single object

Digital Search Trees Similar problem with keys of different lengths What if a key is a prefix of another key that is already present? Data is not sorted If we want sorted data, we would need to extract all of the data from the tree and sort it May do b comparisons (of entire key) to find a key If a key is long and comparisons are costly, this can be inefficient

Digital Search Requires O(log N) comparisons on average Requires b comparisons in the worst case for a tree built with N random b-bit keys Lecture 10: Searching

Digital Search Problem: At each node we make a full key comparison — this may be expensive, e.g. very long keys Solution: store keys only at the leaves, use radix expansion to do intermediate key comparisons Lecture 10: Searching

Radix Tries Used for Retrieval [sic] Internal nodes used for branching, external nodes used for final key comparison, and to store data Lecture 10: Searching

Radix Trie Example A 00001 S 10011 E 00101 R 10010 C 00011 H 01000 H E Lecture 10: Searching

Radix Tries Left subtree has all keys which have 0 for the leading bit, right subtree has all keys which have 1 for the leading bit An insert or search requires O(log N) bit comparisons in the average case, and b bit comparisons in the worst case Lecture 10: Searching

Radix Tries Problem: lots of extra nodes for keys that differ only in low order bits (See R and S nodes in example above) This is addressed by Patricia trees, which allow “lookahead” to the next relevant bit Practical Algorithm To Retrieve Information Coded In Alphanumeric (Patricia) In the slides that follow the entire alphabet would be included in the indexes Lecture 10: Searching

Radix Search Tries Benefit of simple Radix Search Tries Drawbacks Fewer comparisons of entire key than DSTs Drawbacks The tree will have more overall nodes than a DST Each external node with a key needs a unique bit-path to it Internal and External nodes are of different types Insert is somewhat more complicated Some insert situations require new internal as well as external nodes to be created We need to create new internal nodes to ensure that each object has a unique path to it See example

Radix Search Tries Run-time is similar to DST Since tree is binary, average tree height for N keys is O(log2N) However, paths for nodes with many bits in common will tend to be longer Worst case path length is again b However, now at worst b bit comparisons are required We only need one comparison of the entire key So, again, the benefit to RST is that the entire key must be compared only one time

Improving Tries How can we improve tries? Can we reduce the heights somehow? Average height now is O(log2N) Can we simplify the data structures needed (so different node types are not required)? Can we simplify the Insert? We will examine a couple of variations that improve over the basic Trie

Bucket-Sort Let be S be a sequence of n (key, element) entries with keys in the range [0, N - 1] Bucket-sort uses the keys as indices into an auxiliary array B of sequences (buckets) Phase 1: Empty sequence S by moving each entry (k, o) into its bucket B[k] Phase 2: For i = 0, …, N - 1, move the entries of bucket B[i] to the end of sequence S Analysis: Phase 1 takes O(n) time Phase 2 takes O(n + N) time Bucket-sort takes O(n + N) time Algorithm bucketSort(S, N) Input sequence S of (key, element) items with keys in the range [0, N - 1] Output sequence S sorted by increasing keys B  array of N empty sequences while S.isEmpty() f  S.first() (k, o)  S.remove(f) B[k].insertLast((k, o)) for i  0 to N - 1 while B[i].isEmpty() f  B[i].first() (k, o)  B[i].remove(f) S.insertLast((k, o)) Bucket-Sort and Radix-Sort

Bucket Sort Each element of the array is put in one of the N “buckets”

Bucket Sort Now, pull the elements from the buckets into the array At last, the sorted array (sorted in a stable way):

Example Sorting a sequence of 4-bit integers 1001 0010 1101 0001 1110 Bucket-Sort and Radix-Sort

Example Key range [0, 9] 7, d 1, c 3, a 7, g 3, b 7, e Phase 1 1 2 3 4 1 2 3 4 5 6 7 8 9 B 1, c 7, d 7, g 3, b 3, a 7, e  Phase 2 1, c 3, a 3, b 7, d 7, g 7, e Bucket-Sort and Radix-Sort

Properties and Extensions Key-type Property The keys are used as indices into an array and cannot be arbitrary objects No external comparator Stable Sort Property The relative order of any two items with the same key is preserved after the execution of the algorithm Extensions Integer keys in the range [a, b] Put entry (k, o) into bucket B[k - a] String keys from a set D of possible strings, where D has constant size (e.g., names of the 50 U.S. states) Sort D and compute the rank r(k) of each string k of D in the sorted sequence Put entry (k, o) into bucket B[r(k)] Bucket-Sort and Radix-Sort

Lexicographic Order A d-tuple is a sequence of d keys (k1, k2, …, kd), where key ki is said to be the i-th dimension of the tuple Example: The Cartesian coordinates of a point in space are a 3-tuple The lexicographic order of two d-tuples is recursively defined as follows (x1, x2, …, xd) < (y1, y2, …, yd)  x1 < y1  x1 = y1  (x2, …, xd) < (y2, …, yd) I.e., the tuples are compared by the first dimension, then by the second dimension, etc. Bucket-Sort and Radix-Sort

Lexicographic-Sort Algorithm lexicographicSort(S) Input sequence S of d-tuples Output sequence S sorted in lexicographic order for i  d downto 1 stableSort(S, Ci) Let Ci be the comparator that compares two tuples by their i-th dimension Let stableSort(S, C) be a stable sorting algorithm that uses comparator C Lexicographic-sort sorts a sequence of d-tuples in lexicographic order by executing d times algorithm stableSort, one per dimension Lexicographic-sort runs in O(dT(n)) time, where T(n) is the running time of stableSort Example: (7,4,6) (5,1,5) (2,4,6) (2, 1, 4) (3, 2, 4) (2, 1, 4) (3, 2, 4) (5,1,5) (7,4,6) (2,4,6) (2, 1, 4) (5,1,5) (3, 2, 4) (7,4,6) (2,4,6) (2, 1, 4) (2,4,6) (3, 2, 4) (5,1,5) (7,4,6) Bucket-Sort and Radix-Sort

Radix-Sort Radix-sort is a specialization of lexicographic-sort that uses bucket-sort as the stable sorting algorithm in each dimension Radix-sort is applicable to tuples where the keys in each dimension i are integers in the range [0, N - 1] Radix-sort runs in time O(d( n + N)) Algorithm radixSort(S, N) Input sequence S of d-tuples such that (0, …, 0)  (x1, …, xd) and (x1, …, xd)  (N - 1, …, N - 1) for each tuple (x1, …, xd) in S Output sequence S sorted in lexicographic order for i  d downto 1 bucketSort(S, N) Bucket-Sort and Radix-Sort

Radix-Sort for Binary Numbers Consider a sequence of n b-bit integers x = xb - 1 … x1x0 We represent each element as a b-tuple of integers in the range [0, 1] and apply radixsort with N = 2 This application of the radixsort algorithm runs in O(bn) time For example, we can sort a sequence of 32-bit integers in linear time Algorithm binaryRadixSort(S) Input sequence S of b-bit integers Output sequence S sorted replace each element x of S with the item (0, x) for i  0 to b - 1 replace the key k of each item (k, x) of S with bit xi of x bucketSort(S, 2) Bucket-Sort and Radix-Sort

Does it Work for Real Numbers? What if keys are not integers? Assumption: input is n reals from [0, 1) Basic idea: Create N linked lists (buckets) to divide interval [0,1) into subintervals of size 1/N Add each input element to appropriate bucket and sort buckets with insertion sort Uniform input distribution  O(1) bucket size Therefore the expected total time is O(n) Distribution of keys in buckets similar with …. ?

Radix Sort What sort will we use to sort on digits? Bucket sort is a good choice: Sort n numbers on digits that range from 1..N Time: O(n + N) Each pass over n numbers with d digits takes time O(n+k), so total time O(dn+dk) When d is constant and k=O(n), takes O(n) time

Radix Sort Example Problem: sort 1 million 64-bit numbers Treat as four-digit radix 216 numbers Can sort in just four passes with radix sort! Running time: 4( 1 million + 216 )  4 million operations Compare with typical O(n lg n) comparison sort Requires approx lg n = 20 operations per number being sorted Total running time  20 million operations

Radix Sort In general, radix sort based on bucket sort is Asymptotically fast (i.e., O(n)) Simple to code A good choice Can radix sort be used on floating-point numbers?

Summary: Radix Sort Radix sort: Assumption: input has d digits ranging from 0 to k Basic idea: Sort elements by digit starting with least significant Use a stable sort (like bucket sort) for each stage Each pass over n numbers with 1 digit takes time O(n+k), so total time O(dn+dk) When d is constant and k=O(n), takes O(n) time Fast, Stable, Simple Doesn’t sort in place

Multiway Tries RST that we have seen considers the key 1 bit at a time This causes a maximum height in the tree of up to b, and gives an average height of O(log2N) for N keys If we considered m bits at a time, then we could reduce the worst and average heights Maximum height is now b/m since m bits are consumed at each level Let M = 2m Average height for N keys is now O(logMN), since we branch in M directions at each node

Multiway Tries Let's look at an example Consider 220 (1 meg) keys of length 32 bits Simple RST will have Worst Case height = 32 Ave Case height = O(log2[220])  20 Multiway Trie using 8 bits would have Worst Case height = 32/8 = 4 Ave Case height = O(log256[220])  2.5 This is a considerable improvement Let's look at an example using character data We will consider a single character (8 bits) at each level Go over on board

Multiway Tries So what is the catch (or cost)? Memory Multiway Tries use considerably more memory than simple tries Each node in the multiway trie contains M pointers/references In example with ASCII characters, M = 256 Many of these are unused, especially During common paths (prefixes), where there is no branching (or "one-way" branching) Ex: through and throughout At the lower levels of the tree, where previous branching has likely separated keys already

Patricia Trees Idea: Save memory and height by eliminating all nodes in which no branching occurs See example on board Note now that since some nodes are missing, level i does not necessarily correspond to bit (or character) i So to do a search we need to store in each node which bit (character) the node corresponds to However, the savings from the removed nodes is still considerable

Patricia Trees Also, keep in mind that a key can match at every character that is checked, but still not be actually in the tree Example for tree on board: If we search for TWEEDLE, we will only compare the T**E**E However, the next node after the E is at index 8. This is past the end of TWEEDLE so it is not found Run-time? Similar to those of RST and Multiway Trie, depending on how many bits are used per node

Patricia Trees So Patricia trees Reduce tree height by removing "one-way" branching nodes Text also shows how "upwards" links enable us to use only one node type TEXT VERSION makes the nodes homogeneous by storing keys within the nodes and using "upwards" links from the leaves to access the nodes So every node contains a valid key. However, the keys are not checked on the way "down" the tree – only after an upwards link is followed Thus Patricia saves memory but makes the insert rather tricky, since new nodes may have to be inserted between other nodes See text

PATRICIA TREE A particular type of “trie” Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.

PATRICIA TREE Therefore, PATRICIA TREE will have the following attributes in its internal nodes: Index bit (check bit) Child pointers (each node must contain exactly 2 children) On the other hand, leave nodes must be storing actual content for final comparison

SISTRING Sistring is the short form of ‘Semi-Infinite String’ String, no matter what they actually are, is a form of binary bit pattern. (e.g. 11001) One of the sistring in the above example is 11001000… There are totally 5 sistrings in this example

SISTRING Sistrings are theoretically of infinite length 110010000… 10010000… 0010000… 010000… 10000… Practically, we cannot store it infinite. For the above example, we only need to store each sistrings up to 5 bits long. They are descriptive enough distinguish each from one another.

SISTRING Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea! e.g. CUHK Corresponding sistrings would be CUHK000… UHK000… HK000… K000… We require each should be at least 4 characters long. (Why we pad 0/NULL at the end of sistring?)

SISTRING (USAGE) SISTRINGs are efficient in storing substring information. A string with n characters will have n(n+1)/2 sub-strings. Since the longest one is with size n. Storage requirement for sub-strings would be O(n3) e.g. ‘CUHK’ is 4 character long, which consist of 4(5)/2 = 10 different sub-strings: C, U, …, CU, UK, …, CUH, UHK, CUHK. Storage requirement is O(n2)max(length) -> O(n3)

SISTRING (USAGE) We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage. CUHK <- represent C CU CUH CUHK at the same time UHK0 <- represent U UH UHK at the same time HK00 <- represent H HK at the same time K000 <- represent K only A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings. Conclusion, sistrings is better representation for storing sub-string information.

PAT Tree Now it is time for PAT Tree again PAT Tree is a PATRICIA TREE store every sistrings of a document What if the document is now contain simply ‘CUHK’? We like character at this moment, but PATRICIA is working on bits, therefore, we have to know the bit pattern of each sistrings in order to know the actual figure of the PAT tree result It looks frustrating for even small example, but it is how PAT tree works!

PAT Tree (Example) By digitalizing the string, we can manually visualize how the PAT Tree could be. Following is the actual bit pattern of the four sistrings Once we understand how the PAT-tree work, we won’t detail it in later examples.

PAT Tree In a document, we don’t view it as a packed string of characters. A document consist of words. e.g. “Hello. This is a simple document.” In this case, sistrings can be applied in ‘document level’; the document is treated as a big string, we may tokenize it word-by-word, instead of character-by- character.

PAT Tree (Example) This works! BUT… We still need O(n2) memory for storing those sistrings We may reduce the memory to O(n) by making use of points.

PAT Tree (Actual Structure) We need to maintain only the document itself The PAT Tree acts as an index structure Memory requirement Document, O(n) PAT Tree index, O(n) Leaves pointers, O(n) Therefore, PAT Tree is a linear data structure that contains sub-strings, O(n3), information

Structure modification We can see that node structure for internal node and leave node are not the same tree will be more flexible if their nodes are generic (have a universal node structure) Trade off: generic node structure will enlarge the individual node size But.. Memory are cheap now Even the low end computer can support hundreds MB of RAM The modified tree is still a O(n) structure

Structure of the modified node Check Bit Frequency Count Link to a sistring Pointers to the child nodes

Conclusion PAT tree is a O(n) data structure for document indexing PAT tree is good for solving sub-string matching problem Chinese PAT tree has sistrings in sentence level. Frequency count is introduced to overcome the duplicate sistrings problem On generalizing the node structure, the modified version increase the pat tree capability for varies applications

Huffman Compression Background: Huffman works with arbitrary bytes, but the ideas are most easily explained using character data, so we will discuss it in those terms Consider extended ASCII character set: 8 bits per character BLOCK code, since all codewords are the same length 8 bits yield 256 characters In general, block codes give: For K bits, 2K characters For N characters, log2N bits are required Easy to encode and decode

Huffman Compression What if we could use variable length codewords, could we do better than ASCII? Idea is that different characters would use different numbers of bits If all characters have the same frequency of occurrence per character we cannot improve over ASCII What if characters had different freqs of occurrence? Ex: In English text, letters like E, A, I, S appear much more frequently than letters like Q, Z, X Can we somehow take advantage of these differences in our encoding?

Huffman Compression First we need to make sure that variable length coding is feasible Decoding a block code is easy – take the next 8 bits Decoding a variable length code is not so obvious In order to decode unambiguously, variable length codes must meet the prefix property No codeword is a prefix of any other See example on board showing ambiguity if PP is not met Ok, so now how do we compress? Let's use fewer bits for our more common characters, and more bits for our less common characters

Huffman Compression

Huffman Compression

Huffman Compression Huffman Algorithm: Assume we have K characters and that each uncompressed character has some weight associated with it (i.e. frequency) Initialize a forest, F, to have K single node trees in it, one tree per character, also storing the character's weight while (|F| > 1) Find the two trees, T1 and T2, with the smallest weights Create a new tree, T, whose weight is the sum of T1 and T2 Remove T1 and T2 from the F, and add them as left and right children of T Add T to F

Huffman Compression Huffman Issues: See example on board Is the code correct? Does it satisfy the prefix property? Does it give good compression? How to decode? How to encode? How to determine weights/frequencies?

Huffman Compression Is the code correct? Based on the way the tree is formed, it is clear that the codewords are valid Prefix Property is assured, since each codeword ends at a leaf all original nodes corresponding to the characters end up as leaves Does it give good compression? For a block code of N different characters, log2N bits are needed per character Thus a file containing M ASCII characters, 8M bits are needed

Huffman Compression Given Huffman codes {C0,C1,…CN-1} for the N characters in the alphabet, each of length |Ci| Given frequencies {F0,F1,…FN-1} in the file Where sum of all frequencies = M The total bits required for the file is: Sum from 0 to N-1 of (|Ci| * Fi) Overall total bits depends on differences in frequencies The more extreme the differences, the better the compression If frequencies are all the same, no compression See example from board

Huffman Compression How to decode? This is fairly straightforward, given that we have the Huffman tree available start at root of tree and first bit of file while not at end of file if current bit is a 0, go left in tree else go right in tree // bit is a 1 if we are at a leaf output character go to root read next bit of file Each character is a path from the root to a leaf If we are not at the root when end of file is reached, there was an error in the file

Huffman Compression How to encode? This is trickier, since we are starting with characters and outputing codewords Using the tree we would have to start at a leaf (first finding the correct leaf) then move up to the root, finally reversing the resulting bit pattern Instead, let's process the tree once (using a traversal) to build an encoding TABLE. Demonstrate inorder traversal on board

Huffman Compression How to determine weights/frequencies? 2-pass algorithm Process the original file once to count the frequencies, then build the tree/code and process the file again, this time compressing Ensures that each Huffman tree will be optimal for each file However, to decode, the tree/freq information must be stored in the file Likely in the front of the file, so decompress first reads tree info, then uses that to decompress the rest of the file Adds extra space to file, reducing overall compression quality

Huffman Compression Using a static Huffman tree Overhead especially reduces quality for smaller files, since the tree/freq info may add a significant percentage to the file size Thus larger files have a higher potential for compression with Huffman than do smaller ones However, just because a file is large does NOT mean it will compress well The most important factor in the compression remains the relative frequencies of the characters Using a static Huffman tree Process a lot of "sample" files, and build a single tree that will be used for all files Saves overhead of tree information, but generally is NOT a very good approach

Huffman Compression Adaptive single-pass algorithm There are many different file types that have very different frequency characteristics Ex: .cpp file vs. .txt containing an English essay .cpp file will have many ;, {, }, (, ) .txt file will have many a,e,i,o,u,., etc. A tree that works well for one file may work poorly for another (perhaps even expanding it) Adaptive single-pass algorithm Builds tree as it is encoding the file, thereby not requiring tree information to be separately stored Processes file only one time We will not look at the details of this algorithm, but the LZW algorithm and the self-organizing search algorithm we will discuss next are also adaptive

Huffman Shortcomings What is Huffman missing? Although OPTIMAL for single character (word) compression, Huffman does not take into ac- count patterns / repeated sequences in a file Ex: A file with 1000 As followed by 1000 Bs, etc. for every ASCII character will not compress AT ALL with Huffman Yet it seems like this file should be compressable We can use run-length encoding in this case (see text) However run-length encoding is very specific and not generally effective for most files (since they do not typically have long runs of each character)