Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Greedy Algorithms Amihood Amir Bar-Ilan University.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Greedy Algorithms (Huffman Coding)
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Huffman Encoding 16-Apr-17.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
A Data Compression Algorithm: Huffman Compression
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
Chapter 9: Huffman Codes
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Lossless Data Compression Using run-length and Huffman Compression pages
Data Compression Basics & Huffman Coding
x x x 1 =613 Base 10 digits {0...9} Base 10 digits {0...9}
Huffman Codes Message consisting of five characters: a, b, c, d,e
Dale & Lewis Chapter 3 Data Representation
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
MA/CSSE 473 Day 31 Student questions Data Compression Minimal Spanning Tree Intro.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Huffman Encoding Veronica Morales.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Lecture 18 Tree Traversal CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
 The amount of data we deal with is getting larger  Not only do larger files require more disk space, they take longer to transmit  Many times files.
Communication Technology in a Changing World Week 2.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Huffman Coding. Huffman codes can be used to compress information –Like WinZip – although WinZip doesn’t use the Huffman algorithm –JPEGs do use Huffman.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Programming Abstractions Cynthia Lee CS106X. Topics:  Today we’re going to be talking about your next assignment: Huffman coding › It’s a compression.
Foundation of Computing Systems
1 Algorithms CSCI 235, Fall 2015 Lecture 30 More Greedy Algorithms.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
CS 101 – Sept. 11 Review linear vs. non-linear representations. Text representation Compression techniques Image representation –grayscale –File size issues.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
3.3 Fundamentals of data representation
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Design & Analysis of Algorithm Huffman Coding
Binary 1 Basic conversions.
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
HUFFMAN CODES.
COMP261 Lecture 22 Data Compression 2.
CSC317 Greedy algorithms; Two main properties:
Representing Sets (2.3.3) Huffman Encoding Trees (2.3.4)
ISNE101 – Introduction to Information Systems and Network Engineering
Data Encoding Characters.
Data Compression If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so.
The Huffman Algorithm We use Huffman algorithm to encode a long message as a long bit string - by assigning a bit string code to each symbol of the alphabet.
Chapter 9: Huffman Codes
Advanced Algorithms Analysis and Design
Greedy Algorithms Many optimization problems can be solved more quickly using a greedy approach The basic principle is that local optimal decisions may.
Huffman Coding CSE 373 Data Structures.
Communication Technology in a Changing World
Communication Technology in a Changing World
Presenting information as bit patterns
Huffman Encoding Huffman code is method for the compression for standard text documents. It makes use of a binary tree to develop codes of varying lengths.
Data Structure and Algorithms
Huffman Encoding.
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

Is ASCII the only way?

For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a set of instructions that specify each step the computer should take to solve a problem or perform a task. Data is information which is stored and processed. Data may be numeric or alphabetical ("string" in computerese) or a combination of those (alphanumeric - more computerese).

How can the computer store data? It has no eyes and hasn't been to school to learn the meaning of "4", or to understand that the symbols 4, four, and IV, represent the same idea. Can computers think, anyway?(Well, that's a discussion for another day. Possibly a research topic for someone's final paper?) (First come, first served.)) Since computers can only work with 2 states, on/off or 1/0, all data must be transmitted and stored as some combination of the 2 states. Electronically, we are talking about on/off. (That we will leave for the computer engineer majors.) In computer science, we abstract a bit, we use the 1/0 representation.

Lets focus on fixed length bit strings. “Bit”is more computerese, formed from parts of the phrase "binary digit" referring to 1 and 0. In a binary number system, only 2 digits are needed, 1and 0, unlike our decimal system which needs 10 (0,1,2....9).

For many years, the computer world agreed to use the American Standard Code for Information Interchange referred to as ASCII (pronounced "as ski"). Part of the code is below:

Char Dec A B C D E F G H ASCII (partial table) Each string of 8 bits is called a byte. (Remember, computers don't bite!)

Question: What is the maximum number of codes that can be created using 8 bits, starting from to ? That is, how many characters can be represented by 1 byte? (That sounds like mathematics, Oh no! Oh yes, there is a considerable amount of discrete mathematics - that's the fun type - in computer science.)

Note: today, another coding system, entitled Unicode is used to store and transmit data. ASCII is a subset of Unicode and therefore, is still used. Unicode might be another research topic, but it is a narrow topic, which might take some creativity to make interesting and grade- worthy.

A fixed length code, such as ASCII, wastes time and space. For example, consider the following sentence: THE QUICK GREY FOX JUMPED OVER THE LAZY COWS We will use all capital letters and no punctuation for simplicity. Why is this a great sentence for studying codes? How many characters are in this sentence?

THE QUICK GREY FOX JUMPED OVER THE LAZY BROWN COWS

The code for our sentence takes 50 bytes, or 400 bits. We know where each new character starts by counting every 8 bits, since each character takes the same amount of space. This method is wasteful, and can be significantly compressed. One approach, commonly used for text, is to reduce the number of bits for the most frequent characters. In english, E is the most common letter, so it is reasonable to use the fewest number of bits for E, and to use a large number of bits for a letter seldom used, such as Z. This produces a variable length code that takes less space.

Data compression is important in many situations such as sending data over the Internet, where transmission can take a long time, especially in a dial-up connection. In a sentence such as: SUSIE SAYS IT IS EASY; E is not the most common character. In the Java language, the semi-colon (;) appears most often. Therefore, to achieve optimal compression for each message we need to create a new code. This was the method used by David Huffman in Depending on the message, 20% to 90% savings is achieved.

Here is Huffman's algorithm: 1. Make a table showing the frequency of each letter in the message Ex (for the Susie sentence) A-2 E-2 I-3 S-6 T-1 U-1 Y-2 space-4 linefeed-1

2. Set the table by frequency in descending order S-6 space-4 I-3 A-2 E-2 Y-2 T-1 U-1 linefeed-1

3. Give the smallest code to the character with the highest frequency. Starting with 2 bits, increase the number of bits as needed. One method to encode the Susie sentence is : S-10 space-00 I-110 A-010 E-1111 Y-1110 T-0110 U linefeed-01110

Encode the Susie sentence using ASCII and encode the message again using the above Huffman code. What is the % savings in the number of bits?

Solution ASCII - (8 bits)*(21 characters)=168 bits Huffman = 60 bits Difference in bits = =108 % savings = (108/168)*100 = 64%

Question: How do we know where one character ends and another character begins? To decode a message, we use a device called a Huffman Tree. Trees are data structures that are used in many ways in computer science. Trees are usually drawn upside down

(sp)5 6(S)7 2(A) 3 3(I)4 1(T) 2 2(Y) 2(E) 1(lf) 1(U) Here is the Huffman Tree needed to decode the Susie message

The characters of the message appear in the terminal (or leaf) nodes of the tree. The number outside the terminal node is the frequency of each character. (The number outside the non-terminal node will be explained when we learn to create a Huffman tree) In the coded message, a '0' means go left, and a '1' means go right. Use the tree to decode the Susie message. Each time a leaf is reached, record the character. Return to the root (top!) of the tree and begin following the next path.

Since a Huffman code is different for each message, the tree for that code must be transmitted with the message. (Programmers create a tree with the programming language they use, but first they must create it with paper and pencil. That is how we will do it.) 1. Make a node for each character, and list them in increasing order lf(1) U(1) T(1) Y(2) E(2) A(2) I(3) sp(4) S(6)

Here is the algorithm to create a Huffman Tree: 1. Make a node for each character, and list them in increasing order Ex: lf(1) U(1) T(1) Y(2) E(2) A(2) I(3) sp(4) S(6) 2. Remove the 2 lowest frequency nodes and create a tree with them, labeling the root node with the sum of those frequencies 2 (lf)1 U(1)

3. Re-insert the new tree in the list of nodes in the postion that keeps the list ordered. (T)1 2 (Y)2 (E)2 (A)2 (I)3 (sp)4 (S)6 (lf)1 U(1) 4. Repeat these steps until there is one large tree.

2 3 (T)1 2 (lf)1 U(1) 22 (Y)2 (E)2 (A)2 3 (I)3 (sp)4 (S)6 (T)1 2 (lf)1 U(1) Continue until you have the tree we saw earlier.

Once a Huffman tree is complete, it can be searched for each character: When “00” is read, go left then left again - ‘space’ is found. When “10” is read, go right then left - ‘S’ is found. When “010” is read, go left, then right, then left - ‘A’ is found. etc.

A Huffman tree is made for each message, but the tree is not necessarily unique. Some variations in the tree are possible for a given coding scheme. However, there is an important restriction on the coding scheme, because a Huffman code is a prefix code.

Suppose ‘01’ was used for ‘S’ and ‘11’is used for ‘space’ instead of what we used. What problem would arise in attempting to decode the ‘Susie’ message? Since ‘01’ is the beginning of the code for ‘A’ as well as the code for ’T’, the system would not work.

Assignment: Write a 4 or 5 word sentence on a sheet of paper. Code the sentence with ASCII on another sheet of paper. Code the sentence with a Hoffman code on a third paper. Include the Hoffman tree on the third paper.