Programming Abstractions Cynthia Lee CS106X. Topics:  Today we’re going to be talking about your next assignment: Huffman coding › It’s a compression.

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Greedy Algorithms (Huffman Coding)
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Lecture04 Data Compression.
Huffman Encoding 16-Apr-17.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Huffman Coding: An Application of Binary Trees and Priority Queues
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Connecting with Computer Science, 2e
A Data Compression Algorithm: Huffman Compression
Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.
Computer Science 335 Data Compression.
8 November Forms and JavaScript. Types of Inputs Radio Buttons (select one of a list) Checkbox (select as many as wanted) Text inputs (user types text)
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Chapter 9: Huffman Codes
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Huffman code uses a different number of bits used to encode characters: it uses fewer bits to represent common characters and more bits to represent rare.
x x x 1 =613 Base 10 digits {0...9} Base 10 digits {0...9}
Huffman Codes Message consisting of five characters: a, b, c, d,e
Connecting with Computer Science 2 Objectives Learn why numbering systems are important to understand Refresh your knowledge of powers of numbers Learn.
Dale & Lewis Chapter 3 Data Representation
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
MA/CSSE 473 Day 31 Student questions Data Compression Minimal Spanning Tree Intro.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Data Compression1 File Compression Huffman Tries ABRACADABRA
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
 The amount of data we deal with is getting larger  Not only do larger files require more disk space, they take longer to transmit  Many times files.
1 i206: Lecture 2: Computer Architecture, Binary Encodings, and Data Representation Marti Hearst Spring 2012.
Communication Technology in a Changing World Week 2.
CS 111 – Sept. 10 Quiz Data compression –text –images –sounds Commitment: –Please read rest of chapter 1. –Department picnic next Wednesday.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
1 Data Representation Characters, Integers and Real Numbers Binary Number System Octal Number System Hexadecimal Number System Powered by DeSiaMore.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
CPS 100, Spring Huffman Coding l D.A Huffman in early 1950’s l Before compressing data, analyze the input stream l Represent data using variable.
Huffman’s Algorithm 11/02/ Weighted 2-tree A weighted 2-tree T is an extended binary tree with n external nodes and each of the external nodes is.
Chapter 3 Data Representation. 2 Compressing Files.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Information in Computers. Remember Computers Execute algorithms Need to be told what to do And to whom to do it.
Programming Abstractions Cynthia Lee CS106B. Topics:  Continue discussion of Binary Trees › So far we’ve studied two types of Binary Trees: Binary Heaps.
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
COMP261 Lecture 22 Data Compression 2.
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
Data Encoding Characters.
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
Chapter 8 – Binary Search Tree
Chapter 9: Huffman Codes
Huffman Coding CSE 373 Data Structures.
Huffman Encoding Huffman code is method for the compression for standard text documents. It makes use of a binary tree to develop codes of varying lengths.
Digital Encodings.
How Computers Store Data
Trees Addenda.
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Huffman Encoding.
Podcast Ch23d Title: Huffman Compression
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

Programming Abstractions Cynthia Lee CS106X

Topics:  Today we’re going to be talking about your next assignment: Huffman coding › It’s a compression algorithm › It’s provably optimal (take that, Pied Piper) › It involves binary tree data structures, yay! › (assignment goes out Wednesday) 2

Getting Started on Huffman Before we talk about the algorithm, let’s set the scene a bit and talk about BINARY

In a computer, everything is numbers! Specifically, everything is binary  Images (gif, jpg, png): binary numbers  Integers (int): binary numbers  Non-integer real numbers (double): binary numbers  Letters and words (ASCII, Unicode): binary numbers  Music (mp3): binary numbers  Movies (streaming): binary numbers  Doge pictures ( ): binary numbers  messages: binary numbers Encodings are what tell us how to translate › “if we interpret these binary digits as an image, it would look like this” › “if we interpret these binary digits as a song, it would sound like this”

ASCII is an old-school encoding for characters  The “char” type in C++ is based on ASCII  You interacted with this a bit in WordLadder (e.g., ‘A’+1 = ‘B’)  Leftover from C in the 1970’s  Doesn’t play nice with other languages, and today’s software can’t afford to be so America-centric, so Unicode is more common  ASCII is simple so we use it for this assignment

DECOCTHEXBINSymbol ! " # $ % & ' ( ) A * B C , D E F / DECOCTHEXBINSymbol A : B ; C < D = E > F ? A B C D E F G H I A J ASCII Table Notice each symbol is encoded as 8 binary digits (8 bits) There are 256 unique sequences of 8 bits, so numbers each correspond to one character (this only shows 32-74) = ‘<’

ASCII Example “happy hip hop” = (decimal) Or this in binary: FAQ: Why does 104 = ‘h’? Answer: it’s arbitrary, like most encodings. Some people in the 1970s just decided to make it that way.

[Aside] Unplugged programming : The Binary Necklace DECOCTHEXBINSymbol A B C D E F G H I A J …  Choose one color to represent 0’s and another color to represent 1’s  Write your name in beads by looking up each letter’s ASCII encoding  For extra bling factor, this one uses glow-in- the dark beads as delimiters between letters

ASCII  ASCII’s uniform encoding size makes it easy › Don’t really need those glow-in-the-dark beads as delimiters, because we know every 9 th bead starts a new 8-bit letter encoding  Key insight: also a bit wasteful (ha! get it? a “bit”) › What if we took the most commonly used characters (according to Wheel of Fortune, some of these are RSTLNE) and encoded them with just 2 or 3 bits each? › We let seldom-used characters, like &, have encodings that are longer, say 12 bits. › Overall, we would save a lot of space!

Non-ASCII (variable-length) encoding example “happy hip hop” = The variable-length encoding scheme makes a MUCH more space-efficient message than ASCII:

Huffman encoding  Huffman encoding is a way of choosing which characters are encoded which ways, customized to the specific file you are using  Example: character ‘#’ › Rarely used in Shakespeare (code could be longer, say ~10 bits) › If you wanted to encode a Twitter feed, you’d see # a lot (maybe only ~4 bits) #contextmatters #thankshuffman  We store the code translation as a tree:

Your turn What would be the binary encoding of “hippo” using this Huffman encoding tree? A B C D E.Other/none/more than one

Okay, so how do we make the tree? 1.Read your file and count how many times each character occurs 2.Make a collection of tree nodes, each having a key = # of occurrences and a value = the character › Example: “c aaa bbb” › For now, tree nodes are not in a tree shape › We actually store them in a Priority Queue (yay!!) based on highest priority = LOWEST # of occurrences › Next: Dequeue two nodes and make them the two children of a new node, with no character and # of occurrences is the sum, Enqueue this new node Repeat until PQ.size() == 1

Your turn If we start with the Priority Queue above, and execute one more step, what do we get? (A) (B) (C)

Last two steps

Now assign codes We interpret the tree as:  Left child = 0  Right child = 1 What is the code for “c”? A.00 B.010 C.101 D.Other/none cab

Key question: How do we know when one character’s bits end and another’s begin? cab Huffman needs delimiters (like the glow-in-the-dark beads), unlike ASCII, which is always 8 bits (and didn’t really need the beads). A.TRUE B.FALSE Discuss/prove it: why or why not?