COMP261 Lecture 22 Data Compression 2.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms (Huffman Coding)
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Compression & Huffman Codes
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
A Data Compression Algorithm: Huffman Compression
Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.
Chapter 9: Huffman Codes
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Lossless Data Compression Using run-length and Huffman Compression pages
Huffman code uses a different number of bits used to encode characters: it uses fewer bits to represent common characters and more bits to represent rare.
x x x 1 =613 Base 10 digits {0...9} Base 10 digits {0...9}
Huffman Codes Message consisting of five characters: a, b, c, d,e
CSE Lectures 22 – Huffman codes
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Data Compression1 File Compression Huffman Tries ABRACADABRA
Huffman Encoding Veronica Morales.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Data Structures Week 6: Assignment #2 Problem
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Huffman Coding. Huffman codes can be used to compress information –Like WinZip – although WinZip doesn’t use the Huffman algorithm –JPEGs do use Huffman.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Huffman’s Algorithm 11/02/ Weighted 2-tree A weighted 2-tree T is an extended binary tree with n external nodes and each of the external nodes is.
Foundation of Computing Systems
1 Algorithms CSCI 235, Fall 2015 Lecture 30 More Greedy Algorithms.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
1 Huffman Codes. 2 ASCII use same size encoding for all characters. Variable length codes can produce shorter messages than fixed length codes Huffman.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Design & Analysis of Algorithm Huffman Coding
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
HUFFMAN CODES.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Data Coding Run Length Coding
Compression & Huffman Codes
Tries 07/28/16 11:04 Text Compression
Assignment 6: Huffman Code Generation
Digital Image Processing Lecture 20: Image Compression May 16, 2005
COMP261 Lecture 21 Data Compression.
ISNE101 – Introduction to Information Systems and Network Engineering
The Greedy Method and Text Compression
The Greedy Method and Text Compression
Optimal Merging Of Runs
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Chapter 9: Huffman Codes
Chapter 11 Data Compression
Huffman Coding CSE 373 Data Structures.
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd.
Huffman Encoding Huffman code is method for the compression for standard text documents. It makes use of a binary tree to develop codes of varying lengths.
Trees Addenda.
Data Structure and Algorithms
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Podcast Ch23d Title: Huffman Compression
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Analysis of Algorithms CS 477/677
Presentation transcript:

COMP261 Lecture 22 Data Compression 2

Data/Text Compression Reducing the memory required to store some information. Original text/image/sound compressed text/image/sound compress

Coding Problem: Messages: Given a set of symbols/messages Encode them as bit strings Need a separate code for each message Try to minimising the total number of bits. Messages: Characters Numbers ….

Equal Length Codes With N bits, we can have up to 2N different codes. Use the same number of bits for every value/message to be encoded. E.g. digits: msg: 0 1 2 3 4 5 6 7 8 9 code: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 E.g. letters: msg: a b c d e f g … z code: 00001 00010 00011 00100 00101 00110 00111 … 11 With N bits, we can have up to 2N different codes.

Equal Length Codes How many bits are needed? Digits: msg: 0 1 2 3 4 5 6 7 8 9 code: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 10 messages, 4 bits Letters: msg: a b c d e f g … z code: 00001 00010 00011 00100 00101 00110 00111 … 11010 26 messages, 5 bits N different messages, log2N bits per message 10 numbers, message length = 4 26 letters, message length = 5 If there are many repeated messages, can we do better?

Frequency based encoding Vary the code length, using fewer bits for more common messages Eg digits: msg: 0 1 2 3 4 5 6 7 8 9 10 code: 0 1 10 11 100 101 110 111 1000 1001 1010 Suppose: 0 occurs 50% of the time, 1 occurs 20% of time, 2-5 5% each, 6-10 2% encode: 0 by '0' by '1' by '10' by '11' by '100' by '101‘ …. by '1010'

Variable length encoding More efficient to have variable length codes Problem: where are the boundaries? 000010010101001101001100001111000111100001 Need codes that tell you where the end is: msg: 0 1 2 3 4 5 6 7 8 9 code: 0 10 1100 11101 11100 11111 11010 110110 .... Prefix-free/instantaneous coding: no code is the prefix of another code.

Example: Building the tree 1 20% j 50% j: 50% View the powerpoint animation! 0: 50% 1: 20% 2: 5% 3: 5% 4: 5% 5: 5% 6: 2% 7: 2% 8: 2% 9: 2% 10: 2% 0 50% h 30% h: 30% g 19% g: 19% 6 2% c 6% c: 6% 2 5% f 11% f: 11% 4 5% 3 5% e 10% e: 10% 5 5% d 9% d: 9% 9 2% 10 2% a 4% a: 4% 7 2% 8 2% b 4% b: 4% New nodes added in the order indicated by their letters! The letters don't mean anything

Example: Assigning the codes Assign parent code + 0 to left child parent code + 1 to right child 0: 50% 1: 20% 2: 5% 3: 5% 4: 5% 5: 5% 6: 2% 7: 2% 8: 2% 9: 2% 10: 2% average msg length = (1*.5)+(2*.2)+(4*.05)+(5*.17)+(6*.08) = 2.43 bits 10 1 1100 0 50% j 50% 11101 10 11 11100 1 20% h 30% 11111 110 111 f 11% g 19% 11010 1100 1101 1110 1111 110110 2 5% c 6% e 10% d 9% 110111 11010 11011 11100 11101 11110 11111 111100 6 2% b 4% 4 5% 3 5% a 4% 5 5% 111101 110110 110111 111100 111101 7 2% 8 2% 9 2% 10 2%

Huffman Coding Generates the best set of codes, given frequencies/probabilities on all the messages. Creates a binary tree to construct the codes. Construct a leaf node for each message with the given probability Create a priority queue of messages/nodes, (lowest probability = highest priority) while there is more than one node in the queue: (i.e. more than one tree) remove the top two nodes create a new tree node with these two nodes as children. node probability = sum of two nodes add new node to the queue final node is root of tree. Traverse tree to assign codes: if node has code c, assign c0 to left child, c1 to right child This is a “greedy” algorithm – can always choose the nodes that lead to best code. See video on YouTube: Text compression with Huffman coding

Huffman Coding To decode, we need a table of the codes used. If we label the edges of the tree with 0’s and 1’s, as added at each level, we get a trie which can be used like a scanner to split the coded string/file into separate codes to be decoded. Last 3 slides added alter and covered at start of lecture 23.

Lempel-Ziv 77 revisited Basic idea is to store a pointer (offset,length) to a maximal substring that has occurred earlier – within a given window size. Basic algorithm is: for cursor = 0 to maxcursorval: look for longest prefix of text[cursor..text.length] occurring in text[max(cursor-windowsize,0)..curor-1] if found, added [offset,length,text[cursor]] to output else add [0,0, text[cursor]] to output Can use various approaches to find the substring

Lempel-Ziv 77 Note corrections! cursor  0 Cursor – WindowSize should never point before 0, cursor+lookahead mustn't go past end of text cursor  0 windowSize  100 // some suitable size while cursor < text.size lookahead  0 prevMatch  0 loop match  stringMatch( text[cursor.. cursor+lookahead], text[(cursor<windowSize)?0:cursor-windowSize .. cursor-1] ) if match succeeded then prevMatch  match lookahead  lookahead + 1 else output( [suitable value for prevMatch, lookahead, text[cursor+lookahead ]]) cursor  cursor + lookahead + 1 break This looks for an occurrence of text[cursor..cursor+lookahead] in text[start..cursor-1], for increasing values of lookahead, until none is found, then outputs a triple. This is wasteful – we know there is no match before prevMatch, so there’s no point looking there again! Could we improve by starting from prevMatch? Or find longest match starting at each position in window and record longest? Note corrections!