CSE Lectures 22 – Huffman codes

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Greedy Algorithms Amihood Amir Bar-Ilan University.
Data Structures: A Pseudocode Approach with C 1 Chapter 6 Objectives Upon completion you will be able to: Understand and use basic tree terminology and.
Greedy Algorithms (Huffman Coding)
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Compression & Huffman Codes
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
Huffman Coding: An Application of Binary Trees and Priority Queues
A Data Compression Algorithm: Huffman Compression
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
17 File Processing. OBJECTIVES In this chapter you will learn:  To create, read, write and update files.  Sequential file processing.  Random-access.
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Data Structures – LECTURE 10 Huffman coding
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Greedy Algorithms Huffman Coding
Lossless Data Compression Using run-length and Huffman Compression pages
Data Compression Basics & Huffman Coding
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Data Structures Week 6: Assignment #2 Problem
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Image Compression (Chapter 8) CSC 446 Lecturer: Nada ALZaben.
UTILITIES Group 3 Xin Li Soma Reddy. Data Compression To reduce the size of files stored on disk and to increase the effective rate of transmission by.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
File I/O. fstream files File: similar to vector of elements Used for input and output Elements of file can be –character (text)struct –object (non-text.
CSCE350 Algorithms and Data Structure Lecture 19 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
1 Algorithms CSCI 235, Fall 2015 Lecture 30 More Greedy Algorithms.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Learners Support Publications Working with Files.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
Programming with ANSI C ++
3.3 Fundamentals of data representation
Podcast Ch23e Title: Implementing Huffman Compression
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
COMP261 Lecture 22 Data Compression 2.
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Chapter 8 – Binary Search Tree
Advanced Algorithms Analysis and Design
Chapter 11 Data Compression
Huffman Coding CSE 373 Data Structures.
Huffman Encoding Huffman code is method for the compression for standard text documents. It makes use of a binary tree to develop codes of varying lengths.
Chapter 12: Advanced File Operations.
Data Structure and Algorithms
Podcast Ch23d Title: Huffman Compression
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Analysis of Algorithms CS 477/677
Presentation transcript:

CSE 30331 Lectures 22 – Huffman codes Binary files Bit operations Using trees, bit ops and binary files Huffman compression

File Structure A text file contains ASCII characters with a newline sequence separating lines. A binary file consists of data objects that vary from a single character (byte) to more complex structures that include integers, floating point values, programmer-generated class objects, and arrays. each data object in a file is a record

Direct File Access The functions seekg() and seekp() reposition the read and write position, respectively. They take an offset argument indicating the number of bytes from the beginning (beg), ending (end), or current position (cur) in the file. The functions tellg() and tellp() return the current read and write position.

Reading & writing To read from a binary file Use read(char *p, int num); This reads num bytes of data from the file beginning at the current read position in the file Example: //read 5th accountType record out of file accountType acct; int n = 5; ifstream infile; infile.open(“accounts.dat”, ios::in | ios::binary); infile.seekg(n*sizeof(accountType), ios::beg); infile.read((char *)&acct, sizeof(accountType));

Reading & writing To write to a binary file Use write(char *p, int num); This writes num bytes of data from the file beginning at the current write position in the file Example: //write 5th accountType record out of file accountType acct; int n = 5; ofstream outfile; outfile.open(“accounts.dat”, ios::out | ios::binary); outfile.seekp(n*sizeof(accountType), ios::beg); outfile.write((char *)&acct, sizeof(accountType));

Bit operations (a reminder) Bitwise ops And ( & ) 0101 & 0110 -> 0100 Or ( | ) 0101 | 0110 -> 0111 Xor ( ^ ) 0101 ^ 0110 -> 0011 Not ( ~ ) ~0101 -> 1010

Implementing a bitVector Class bitMask() returns an unsigned character value containing a 1 in the bit position representing i.

Lossless Compression Data compression loses no information Original data can be recovered exactly from the compressed data Normally applied to "discrete" data, such as text, word processing files, computer applications, and so forth

Lossy Compression Loses some information during compression and the data cannot be recovered exactly Shrinks the data further than lossless compression techniques Sound files often use this type of compression

Huffman Compression A lossless compression technique Counts occurrences of eight bit characters in data Uses counts to construct variable length codes shorter for more frequently occurring characters Each code has a unique prefix The encoding (compression) process creates an “optimal” binary tree representing these prefix codes Uses a “greedy approach” Makes use of data on hand to choose best option Example: Dijkstra’s algorithm is a greedy approach Achieves compression ratios of at least 1.8 (45% reduction) on text not as good on binary data

Example Huffman Tree Internal nodes contain sum of its children’s frequencies Edge to left child is a 0 bit and to a right child is a 1 57 21 36 c:8 13 e:20 a:16 Codes a 11 b 0111 c 00 d 010 e 10 f 0110 Leaves contain original letters and their frequencies d:6 7 f:3 b:4

Building Huffman Code Trees Read file and determine frequencies of each letter Store nodes (letters and frequencies) in a minimum priority queue Probably implemented as a heap, with ordering based on frequencies Loop until only one node left in queue Remove two smallest valued nodes from queue Make them the two children of new root node with value equaling their sum Add new node to queue Result is tree rooted at last node remaining in queue Codes all have unique prefixes Derived for each letter (leaf node) based on traversing links in the tree from root to leaf Left is 0 bit in code – Right is 1 bit in code The length of each code is depth of the leaf in the tree So … Shortest codes for most frequently occurring data value Longest codes for least frequently occurring data values

Huffman tree The Huffman code tree is optimal in this sense All internal nodes have two children and so there are no unused unique prefixes So, the number of shorter codes is the maximum number possible given the frequencies in the data The degree of compression (size of compressed data) is … Where f(ch) is the frequency of ch and d(ch) is the number of bits in its code

Building a Huffman Tree

Building a Huffman Tree (after first pass) (f:3) and (b:4) were lowest frequency nodes, so they were joined to a parent (7), which was then added back to the queue

Building a Huffman Tree (after second pass) :3 b:4 7 13 (d:6) and (7) were lowest frequency nodes, so they were joined to a parent (13), which was then added back to the queue a:16 e :20 c :8 Priority Queue

Building a Huffman Tree (after third pass) c :8 d:6 f :3 b:4 7 13 21 (c:8) and (13) were lowest frequency nodes, so they were joined to a parent (21), which was then added back to the queue a:16 e :20 Priority Queue

Building a Huffman Tree (after fourth pass) c :8 d:6 f :3 b:4 7 13 21 (e:20) and (a:16) were lowest frequency nodes, so they were joined to a parent (21), which was then added back to the queue a:16 e :20 36 Priority Queue

Building a Huffman Tree (after last pass) 57 (21) and (36) were lowest frequency nodes, so they were joined to a parent (57), which was then added back to the queue 36 c :8 d:6 f :3 b:4 7 13 21 e :20 a:16 Priority Queue

The Huffman tree in memory ID ch freq pID left right code A 16 9 -1 11 1 B 4 6 0111 2 C 8 00 3 D 7 010 E 20 10 5 F 0110 Int 13 21 36 57 c:8 d:6 f:3 b:4 7 13 21 a:16 e:20 36 57 Sample compression “face” = 0110 11 00 10 # of bits 4*8=32 vs. 4+2+2+2=10

The Huffman tree in file ID ch freq pID left right code A 16 9 -1 11 1 B 4 6 0111 2 C 8 00 3 D 7 010 E 20 10 5 F 0110 Int 13 21 36 57 Only the gray fields are written to store the tree in the compressed file Tree can then be rebuilt from ch and left and right child indices read from file. Last node is root and codes can be rediscovered as bits are read from file and tree is followed from root to leaf

Format of compressed file There are four parts Size of tree The Tree – vector of (ch,leftID,rightID) data Size of compressed data The compressed data

Uncompressing tree in file ID ch left right code A -1 11 1 B 0111 2 C 00 3 D 010 4 E 10 5 F 0110 6 Int 7 8 9 Read size of tree Read tree from file into vector or array Read size of compressed data Start at root (node[0]) For each bit (b) read If (b==0) move to left child If (b==1) move to right child If now at a leaf append leaf’s letter to uncompressed data, and return to root

Uncompressing “face” ID ch left right code A -1 11 1 B 0111 2 C 00 3 D A -1 11 1 B 0111 2 C 00 3 D 010 4 E 10 5 F 0110 6 Int 7 8 9 Bit data: 0110110010 Bit node letter 10 0 8 1 7 1 6 0 5 ‘f’ 1 9 1 0 ‘a’ 0 2 ‘c’ 0 4 ‘e’

Summary Binary File A sequence of 8-bit characters without the requirement that a character be printable and with no concern for a newline sequence that terminates lines Often organized as a sequence of records: record 0, record 1, record 2, ..., record n-1. Used for both input and output, and the C++ file <fstream> contains the operations to support these types of files. The open() function must use the attribute ios::binary

Summary Binary File (Cont…) For direct access to a file record, use the function seekg(), which moves the file pointer to a file record Accepts an argument that specifies motion from the beginning of the file (ios::beg), from the current position of the file pointer (ios::cur), and from the end of the file (ios::end) Use read() function to inputs a sequence of bytes from the file into block of memory and write() function to output from a block of memory to a binary file

Summary Bit Manipulation Operators | (OR), & (AND), ^ (XOR), ~ (NOT), << (shift left), and >> (shift right) Use to perform operations on specific bits within a character or integer value. The class, bitVector, use operator overloading treat a sequence of bits as an array, with bit 0 the left-most bit of the sequence bit(), set(), and clear() allow access to specific bits The class has I/O operations for binary files and the stream operator << that outputs a bit vector as an ASCII sequence of 0 and 1 values.

Summary File Compression Algorithm Encodes a file as sequence of characters that consume less disk space than the original file. Two types of compression algorithms: 1) lossless compression Restores the original file. Approach: count the frequency of occurrence of each character in the file and assign a prefix bit code to each character File size: the sum of the products of each bit-code length and the frequency of occurrence of the corresponding character.

Summary File Compression Algorithm (Cont…) 2) lossy compression Loses some information during compression and the data cannot be recovered exactly Normally used with sound and video files The Huffman compression algorithm is a lossless process that builds optimal prefix codes by constructing a tree with the … most frequently occurring characters and shorter bit codes as leaves close to the root less frequently occurring characters and longer bit codes as farther from the root.

Summary File Compression Algorithm (Cont…) If the file contains n distinct characters, the loop concludes after n-1 iterations, having built the Huffman Tree containing n-1 internal nodes. Implementation requires the use of a minimum priority queue (heap), bit operations, and binary files The use of the bitVector class simplifies the construction of the classes hCompress and hDecompress, which perform Huffman compression and decompression. Works better with textfiles; they tend to have fewer unique characters than binary files.