Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture3.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Greedy Algorithms (Huffman Coding)
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Lecture04 Data Compression.
Compression & Huffman Codes
Compression JPG compression, Source: Original 10:1 Compression 45:1 Compression.
A Data Compression Algorithm: Huffman Compression
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
Chapter 9: Huffman Codes
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Lossless Data Compression Using run-length and Huffman Compression pages
Huffman code uses a different number of bits used to encode characters: it uses fewer bits to represent common characters and more bits to represent rare.
CSE Lectures 22 – Huffman codes
CS 46B: Introduction to Data Structures July 30 Class Meeting Department of Computer Science San Jose State University Summer 2015 Instructor: Ron Mak.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
CS-2852 Data Structures LECTURE 13B Andrew J. Wozniewicz Image copyright © 2010 andyjphoto.com.
Data Structures Week 6: Assignment #2 Problem
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Image Compression (Chapter 8) CSC 446 Lecturer: Nada ALZaben.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
Huffman Encodings Section 9.4. Data Compression: Array Representation Σ denotes an alphabet used for all strings Each element in Σ is called a character.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
1 Huffman Codes. 2 ASCII use same size encoding for all characters. Variable length codes can produce shorter messages than fixed length codes Huffman.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
COMP261 Lecture 22 Data Compression 2.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Data Coding Run Length Coding
Tries 07/28/16 11:04 Text Compression
Assignment 6: Huffman Code Generation
Tries 5/27/2018 3:08 AM Tries Tries.
COMP261 Lecture 21 Data Compression.
Applied Algorithmics - week7
ISNE101 – Introduction to Information Systems and Network Engineering
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Chapter 8 – Binary Search Tree
Chapter 9: Huffman Codes
Chapter 11 Data Compression
Huffman Coding CSE 373 Data Structures.
Huffman Encoding Huffman code is method for the compression for standard text documents. It makes use of a binary tree to develop codes of varying lengths.
Trees Addenda.
Data Structure and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Podcast Ch23d Title: Huffman Compression
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

Compression and Huffman Coding

Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless compression only possible if there is redundancy in the original –eg, repeated patterns –compression identifies and removes some of the redundant elements. Original text/image/sound compress compressed text/image/sound

Lempel-Ziv Lossless compression. LZ77 = simple compression, using repeated patterns –basis for many later, more sophisticated compression schemes. Key idea: –If you find a repeated pattern, replace the later occurrences by a link to the first: a contrived text containing riveting contrasting a contrived text [15,5] ain [2,2] g [22,4] t [9,3][35,5] as [12,4]

Lempel-Ziv a contrived text containing riveting contrasting t  [0, 0, a][0, 0, ][0, 0, c][0, 0, o][0, 0, n][0, 0, t][0, 0, r][0, 0, i][0, 0, v][0, 0, e][0, 0, d][10, 1, t] [4, 1, x][3, 1, ][15, 4, a][15, 1, n][2, 2, g][11, 1, r][22, 3, t][9, 4, c][35, 4, a][0, 0, s][12, 5, t] a contrived text containing riveting contrasting a contrived text [15,5] ain [2,2] g [22,4] t [9,3][35,5] as [12,4]

Lempel-Ziv 77 Outputs a string of tuples –tuple = [offset, length, nextCharacter] or [0,0,character] Moves a cursor through the text one character at a time –cursor points at the next character to be encoded. Drags a "sliding window" behind the cursor. –searches for matches only in this sliding window Expands a lookahead buffer from the cursor –this is the string it wants to match in the sliding window. Searches for a match for the longest possible lookahead –stops expanding when there isn't a match Insert tuple of match point, length, and next character

Lempel-Ziv 77 skljsadf lkjhwep oury d dmsmesjkh fjdhfjdfjdpppdjkhf sdjkh fjdhfjds fjksdh kjjjfiuiwe dsd fdsf sdsa Algorithm cursor  0 windowSize  100 // some suitable size while cursor < text.size lookahead  0 prevMatch  0 loop match  stringMatch(text[cursor.. cursor+lookahead], text[(cursor<windowSize)?0:cursor-windowSize.. cursor-1] ) if match succeeded then prevMatch = match lookahead++ else output( [suitable value for prevMatch, lookahead - 1, text[cursor+lookahead - 1]]) cursor  cursor + lookahead break Cursor – WindowSize should never point before 0, cursor+lookahead mustn't go past end of text

Decompression a contrived text containing riveting contrasting t  [0, 0, a][0, 0, ][0, 0, c][0, 0, o][0, 0, n][0, 0, t][0, 0, r][0, 0, i][0, 0, v][0, 0, e][0, 0, d][10, 1, t] [4, 1, x][3, 1, ][15, 4, a][15, 1, n][2, 2, g][11, 1, r][22, 3, t][9, 4, c][35, 4, a][0, 0, s][12, 5, t] Decompression: Decode each tuple in turn: cursor  0 for each tuple [0, 0, ch ]  output[cursor++]  ch [offset, length,ch ]  for j = 0 to length-1 output [cursor++]  output[cursor-offset ] output[cursor++]  ch

Huffman Encoding Problem: –Given a set of symbols/messages –encode them as bit strings –minimising the total number of bits. Messages: –characters –numbers.

Equal Length Codes Equal length codes: msg: code: msg:abcdefg…z_ … N different messages, log 2 N bits per message 10 numbers, message length = 4 26 letters, message length = 5 If there are many repeated messages, can we do better?

Frequency based encoding msg: code: Suppose 0 occurs 50% of the time, 1 occurs 20% of time, 2-5 5% each, % encode with variable length code: 0 by '0'most common 1by '1'msgs have 2by '10'shorter codes 3by '11' 4by '100' 5by '101' 10by '1010'

Variable length encoding More efficient to have variable length codes Problem: where are the boundaries? Need codes that tell you where the end is: msg: code: "Prefix" coding scheme no code is the prefix of another code.

Example: Building the tree 0: 50% 1: 20% 2: 5% 3: 5% 4: 5% 5: 5% 6: 2% 7: 2% 8: 2% 9: 2% 10: 2% 9 2% 10 2% a 4% a: 4% 7 2% 8 2% b 4% b: 4% 5 5% d 9% d: 9% 6 2% c 6% c: 6% 4 5% 3 5% e 10% e: 10% 2 5% f 11% f: 11% g 19% g: 19% h 30% h: 30% 1 20% j 50% j: 50% 0 50% View the powerpoint animation! New nodes added in the order indicated by their letters! The letters don't mean anything

Example: Assigning the codes 0: 50% 1: 20% 2: 5% 3: 5% 4: 5% 5: 5% 6: 2% 7: 2% 8: 2% 9: 2% 10: 2% average msg length = (1*.5)+(2*.2)+(4*.05)+(5*.17)+(6*.08) = 2.43 bits 9 2% 10 2% a 4% 7 2% 8 2% b 4% 5 5% d 9% 6 2% c 6% 4 5% 3 5% e 10% 2 5% f 11% g 19% h 30% 1 20% j 50% 0 50% Assign parent code + 0 to left child parent code + 1 to right child Code

Huffman Coding Generates the best set of codes, given frequencies/probabilities on all the messages. Creates a binary tree to construct the codes. Construct a leaf node for each message with the given probability Create a priority queue of messages, (lowest probability = highest priority) while there is more than one node in the queue: remove the top two nodes create a new tree node with these two nodes as children. node probability = sum of two nodes add new node to the queue final node is root of tree. Traverse tree to assign codes: if node has code c, assign c0 to left child, c1 to right child Video on YouTube: Text compression with Huffman coding