Compsci 100, Fall 2009 14.1 Review: Why Compression l What gets compressed?  Save on storage, why is this a good idea?  Save on data transmission, how.

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Greedy Algorithms (Huffman Coding)
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Lecture04 Data Compression.
Compression & Huffman Codes
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Huffman Coding: An Application of Binary Trees and Priority Queues
Fall 2007CS 2251 Trees Chapter 8. Fall 2007CS 2252 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information.
A Data Compression Algorithm: Huffman Compression
Compression & Huffman Codes Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Data Structures – LECTURE 10 Huffman coding
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Chapter 9: Huffman Codes
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
Data Compression Basics & Huffman Coding
Huffman code uses a different number of bits used to encode characters: it uses fewer bits to represent common characters and more bits to represent rare.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Algorithm Design & Analysis – CS632 Group Project Group Members Bijay Nepal James Hansen-Quartey Winter
Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2004 Simonas Šaltenis E1-215b
CompSci 100 Prog Design and Analysis II
Data Compression1 File Compression Huffman Tries ABRACADABRA
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Data Structures Week 6: Assignment #2 Problem
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CPS 100, Fall PF-AT-EOS l Data compression  Seriously important, why? l Priority Queue  Seriously important, why? l Assignment overview  Huff.
Huffman Coding. Huffman codes can be used to compress information –Like WinZip – although WinZip doesn’t use the Huffman algorithm –JPEGs do use Huffman.
UTILITIES Group 3 Xin Li Soma Reddy. Data Compression To reduce the size of files stored on disk and to increase the effective rate of transmission by.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Trees (Ch. 9.2) Longin Jan Latecki Temple University based on slides by Simon Langley and Shang-Hua Teng.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
CPS Heaps, Priority Queues, Compression l Compression is a high-profile application .zip,.mp3,.jpg,.gif,.gz, …  What property of MP3 was a significant.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
CompSci 100e Program Design and Analysis II April 7, 2011 Prof. Rodger CompSci 100e, Spring p 1 h 1 2 e 1 r 1 4 s 1 * 2 7 g 3 o
CPS 100, Spring Huffman Coding l D.A Huffman in early 1950’s l Before compressing data, analyze the input stream l Represent data using variable.
Trees (Ch. 9.2) Longin Jan Latecki Temple University based on slides by Simon Langley and Shang-Hua Teng.
1 Algorithms CSCI 235, Fall 2015 Lecture 30 More Greedy Algorithms.
CompSci 100e 8.1 Plan for the Course! l Understand Huffman Coding  Data compression  Priority Queues  Bits and Bytes  Greedy Algorithms l Algorithms.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
CPS Heaps, Priority Queues, Compression l Compression is a high-profile application .zip,.mp3,.jpg,.gif,.gz, …  Why is compression important?
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
بسم الله الرحمن الرحيم My Project Huffman Code. Introduction Introduction Encoding And Decoding Encoding And Decoding Applications Applications Advantages.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
CompSci 100e 10.1 Binary Digits (Bits) l Yes or No l On or Off l One or Zero l
TWTG What work is still to come APT, Huff, APT
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
Compression & Huffman Codes
Tries 07/28/16 11:04 Text Compression
Assignment 6: Huffman Code Generation
Tries 5/27/2018 3:08 AM Tries Tries.
CompSci 201 Data Representation & Huffman Coding
Heaps, Priority Queues, Compression
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Chapter 8 – Binary Search Tree
Chapter 9: Huffman Codes
Huffman Coding CSE 373 Data Structures.
Data Structure and Algorithms
Podcast Ch23d Title: Huffman Compression
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Scoreboard What else might we want to do with a data structure?
Presentation transcript:

Compsci 100, Fall Review: Why Compression l What gets compressed?  Save on storage, why is this a good idea?  Save on data transmission, how and why?  Where does lots of data come from? l What is information, how is it compressible?  Exploit redundancy, without that, hard to compress l Represent information: code (Morse cf. Huffman)  Dots and dashes v. 0’s and 1’s, how to construct code?

Compsci 100, Fall Typoglycemia: Wikipedia (orphan)Wikipedia (orphan) l In a puiltacibon of New Scnieitst you could ramdinose all the letetrs, keipeng the first two and last two the same, and reibadailty would hadrly be aftcfeed. My ansaylis did not come to much beucase the thoery at the time was for shape and senqeuce retigcionon. Saberi's work sugsegts we may have some pofrweul palrlael prsooscers at work.The resaon for this is suerly that idnetiyfing coentnt by paarllel prseocsing speeds up regnicoiton. We only need the first and last two letetrs to spot chganes in meniang. Graham Rawlinson, PhD thesis 1976 l Scrmabling, an anagram of scrambling, is the verb for rearranging the letters in a word, but leaving the first and last letters as is.

Compsci 100, Fall Huffman Coding l D.A Huffman in early 1950’s: story of invention  Analyze and process data before compression  Not developed to compress data “on-the-fly” l Represent data using variable length codes  Each letter/chunk assigned a codeword/bitstring  Codeword for letter/chunk is produced by traversing the Huffman tree  Property: No codeword produced is the prefix of another  Frequent letters/chunk have short encoding, while those that appear rarely have longer ones l Huffman coding is optimal per-character coding method

Compsci 100, Fall Coding/Compression/Concepts l For ASCII we use 8 bits, for Unicode 16 bits  Minimum number of bits to represent N values?  Representation of genomic data (a, c,g, t)?  What about noisy genomic data? l We can use a variable-length encoding, e.g., Huffman  How do we decide on lengths? How do we decode?  Values for Morse code encodings, why?Morse code encodings  … …

Compsci 100, Fall Mary Shaw l Software engineering and software architecture  Tools for constructing large software systems  Development is a small piece of total cost, maintenance is larger, depends on well-designed and developed techniques l Interested in computer science, programming, curricula, and canoeing, health-care costs l ACM Fellow, Alan Perlis Professor of Compsci at CMU

Compsci 100, Fall Huffman coding: go go gophers l choose two smallest weights  combine nodes + weights  Repeat  Priority queue? l Encoding uses tree:  0 left/1 right  How many bits? ASCII 3 bits Huffman g ?? o ?? p h e r s sp goers* 33 h p 1 h 1 2 e 1 r 1 3 s 1 * 2 2 p 1 h 1 2 e 1 r 1 4 g 3 o p

Compsci 100, Fall Huffman coding: go go gophers l Encoding uses tree/trie:  0 left/1 right  How many bits? 37!!  Savings? Worth it? ASCII 3 bits Huffman g o p h e r s sp s 1 * 2 2 p 1 h 1 2 e 1 r 1 4 g 3 o p 1 h 1 2 e 1 r 1 4 s 1 * 2 7 g 3 o

Compsci 100, Fall Building a Huffman tree l Begin with a forest of single-node trees/tries (leaves)  Each node/tree/leaf is weighted with character count  Node stores two values: character and count l Repeat until there is only one node left: root of tree  Remove two minimally weighted trees from forest  Create new tree/internal node with minimal trees as children, Weight is sum of children’s weight (no char) l How does process terminate? Finding minimum?  Remove minimal trees, hummm… …

Compsci 100, Fall How do we create Huffman Tree/Trie? l Insert weighted values into priority queue  What are initial weights? Why?  l Remove minimal nodes, weight by sums, re-insert  Total number of nodes? PriorityQueue pq = new PriorityQueue (); for(int k=0; k < freq.length; k++){ pq.add(new TreeNode(k,freq[k],null,null)); } while (pq.size() > 1){ TreeNode left = pq.remove(); TreeNode right = pq.remove(); pq.add(new TreeNode(0,left.weight+right.weight, left,right)); } TreeNode root = pq.remove();

Compsci 100, Fall Creating compressed file l Once we have new encodings, read every character  Write encoding, not the character, to compressed file  Why does this save bits?  What other information needed in compressed file? l How do we uncompress?  How do we know foo.hf represents compressed file?  Is suffix sufficient? Alternatives? l Why is Huffman coding a two-pass method?  Alternatives?

Compsci 100, Fall Uncompression with Huffman l We need the trie to uncompress  l As we read a bit, what do we do?  Go left on 0, go right on 1  When do we stop? What to do? l How do we get the trie?  How did we get it originally? Store 256 int/counts How do we read counts?  How do we store a trie? 20 Questions relevance? Reading a trie? Leaf indicator? Node values?

Compsci 100, Fall Other Huffman Issues l What do we need to decode?  How did we encode? How will we decode?  What information needed for decoding? l Reading and writing bits: chunks and stopping  Can you write 3 bits? Why not? Why?  PSEUDO_EOF  BitInputStream and BitOutputStream: API l What should happen when the file won’t compress?  Silently compress bigger? Warn user? Alternatives?

Compsci 100, Fall Huffman Complexities l How do we measure? Size of input file, size of alphabet  Which is typically bigger? l Accumulating character counts: ______  How can we do this in O(1) time, though not really l Building the heap/priority queue from counts ____  Initializing heap guaranteed l Building Huffman tree ____  Why? l Create table of encodings from tree ____  Why? l Write tree and compressed file _____

Compsci 100, Fall Good Compsci 100 Assignment? l Array of character/chunk counts, or is this a map?  Map character/chunk to count, why array? l Priority Queue for generating tree/trie  Do we need a heap implementation? Why? l Tree traversals for code generation, uncompression  One recursive, one not, why and which? l Deal with bits and chunks rather than ints and chars  The good, the bad, the ugly l Create a working compression program  How would we deploy it? Make it better? l Benchmark for analysis  What’s a corpus ?

Compsci 100, Fall Other methods l Adaptive Huffman coding l Lempel-Ziv algorithms  Build the coding table on the fly while reading document  Coding table changes dynamically  Protocol between encoder and decoder so that everyone is always using the right coding scheme  Works well in practice ( compress, gzip, etc.) l More complicated methods  Burrows-Wheeler ( bunzip2 )  PPM statistical methods

Compsci 100, Fall Robert Tarjan l Graph algorithms  LCA, Connected Components, Planarity, persistence  Turing Award,1986, with Hopcroft; Nevanlinna Prize, 1983; Kanellakis Theory/Practice,1999 l Interaction is very important and having cooperative spirit. One of the greatest joys of being a professor is having graduate students… it/s very motivating because you’ve got people whose minds are fresh and open and eager to learn: Shasha, Out of Their Minds