AP CSP: Bytes, File Sizes, and Text Compression

Slides:



Advertisements
Similar presentations
Data Transfer Chapter 10. File conversion When we upgrade a file after a big time of use, usually it is necessary to change the format of the file. For.
Advertisements

Reducing File Sizes. File Formats In this lesson, we will be looking at: How do we measure file size? Why are some files bigger than others? Why should.
 Caesar used to encrypt his messages using a very simple algorithm, which could be easily decrypted if you know the key.  He would take each letter.
Optimizing picture file size. Three things you can do to lower file size  Lower the resolution  Crop the picture  Save with a file format that uses.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Bits & Bytes Created by Chris McAbee For AAMU AGB199 Extra Credit Created from information copied and pasted from
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Chapter 1 Background 1. In this lecture, you will find answers to these questions Computers store and transmit information using digital data. What exactly.
Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.
Sound (analogue signal). time Sound (analogue signal) time.
Online Documents – File Compression File size can be a big deal Like when you want more music on your phone Or work on your USB stick Or when you.
AP CSP: Sending Binary Messages
DATA REPRESENTATION - TEXT
GCSE COMPUTER SCIENCE Data
Day 6 - Encoding and Sending Formatted Text
Encryption with Keys and Passwords
Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.
AP CSP: Lossy Compression and File Formats
AP CSP: Creating Functions & Top-Down Design
GCSE COMPUTER SCIENCE Topic 3 - Data 3.3 Data Storage and Compression.
AP CSP: The Need for Addressing
The Study of Computer Science Chapter 0
3.3 Fundamentals of data representation
JavaScript/ App Lab Programming:
The Art of Computational Thinking…Binary Code
Vocabulary byte - The technical term for 8 bits of data.
Please take out the homework - viewing sheet fro the movie
Vocabulary Big Data - “Big data is a broad term for datasets so large or complex that traditional data processing applications are inadequate.” Moore’s.
AP CSP: Sending Binary Messages with the Internet Simulator
AP CSP: The Need for Programming Languages and Algorithms
Lesson 1-15 AP Computer Science Principles
Vocabulary byte - The technical term for 8 bits of data.
UNIT 2 – LESSON 3 Encoding B&W Images.
UNIT 2 – CHAPTER 1 – LESSON 1 DIGITAL INFORMATION.
3.3 Fundamentals of data representation
Vocabulary byte - The technical term for 8 bits of data.
THE NEED FOR DNS DOMAIN NAME SYSTEM
Day 6 - Encoding and Sending Formatted Text
Lesson 2-2 AP Computer Science Principles
Vocabulary byte - The technical term for 8 bits of data.
Functions and Top-Down Design
Lossy Compression and File Formats
UNIT 2 – LESSON 6 ENCODE AN EXPERIENCE.
Unit 2- Lesson 1 & 2- Bytes and File Sizes / Text Compression
Unit 2- Lesson 1 & 2- Bytes and File Sizes / Text Compression
UNIT 2 – LESSON 2 TEXT COMPRESSION.
Packets and Making a Reliable Internet
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
Lossy Compression and File Formats
UNIT 3 – LESSON 5 Creating Functions.
The Study of Computer Science Chapter 0
Vocabulary byte - The technical term for 8 bits of data.
The Need for Programming Languages
Vocabulary byte - The technical term for 8 bits of data.
Lesson 1-13 AP Computer Science Principles
Vocabulary byte - The technical term for 8 bits of data.
Sending Bits on the Internet
AP CSP: Making a reliable Internet & DNS
Digital Image Formats: An Explanation
Data Storage In today’s lesson we will look at:
The Study of Computer Science
Fundamentals of Data Representation
Fundamentals of Data Representation
Introduction to Programming Part 2
The Study of Computer Science
Do Now! Convert the following sequence of bits into an image using the protocol we discussed (first 8 bits are lengthxwidth, Then fill in the rows pixel.
Unit 2- Lesson 1 & 2- Bytes and File Sizes / Text Compression
Why You Should Learn to Type
The Study of Computer Science Chapter 0
CSE 326: Data Structures Lecture #14
Presentation transcript:

AP CSP: Bytes, File Sizes, and Text Compression

Size Terminology: Recall that a single character of ASCII text requires 8 bits. The technical term for 8 bits of data is a Byte. A byte is the standard fundamental unit (or “chunk size”) underlying most computing systems today. “megabyte", "kilobyte", "gigabyte” Now let’s compare how data is stored and saved depending on the type of file!

Plain Text v. MS Word Doc We learned that in addition to the actual text of a document, it is usually necessary to store the formatting information that allows the text to be displayed correctly. How much extra information, i.e. how many extra bytes, we need to store when we include all of this formatting.  If a single ASCII character is one byte then how many bytes are required to store the word hello in a .txt(Notepad) file vs a .docx(MS Word) file. Make a prediction now! Now we’ll find out and see if your prediction was close!

Bytes and File Sizes Worksheet: Complete this worksheet for homework and make sure you understand the difference between the main file size types and bits v. bytes Think about this problem too and we will talk about it next class more: If you want to transmit a lot of data you are limited by the speed of your internet connection. Even if you have a fast Internet connection there is a physical limit to how fast you can transmit bits. What if the data you want to send is big enough that it takes an unreasonable amount of time to transmit it, even with a really fast internet connection. Assuming you can't make the Internet connection any faster, could you still transmit the data faster somehow?

Compression Warmup: At some point we reach a physical limit of how fast we can send bits and if we want to send a large amount of information faster, we have to find a way to represent the same information with fewer bits - we must compress the data. When you send text messages to a friend, do you spell every word correctly? Do you use abbreviations for common words? List as many as you can. Why do you use these abbreviations? What is the benefit?

Compression: Same Data, Fewer Bits: We compress data to save time and space. The art and science of compression is about figuring out how to represent the SAME DATA with FEWER BITS. Why is this important? One reason is that storage space is limited and you'd always prefer to use fewer bits if you could. A much more compelling reason is that there is an upper limit to how fast bits can be transmitted over the Internet. Now you’re going to practice decompressing a message before you try and figure out your own encoding scheme! DECODE THE MESSAGE

The full compressed text includes BOTH the compressed text and the key to solve it. Thus, you must account for the total number of characters in the message plus the total number of characters in the key to see how much you've compressed it over the original.

Compressing Songs: You and a partner will try to compress a song by finding patterns you can exploit. Compare your results with other groups that tried to compress the same song As you are compressing songs, try to develop a general strategy that will lead to a good compression rate every time. Think about What makes doing this compression hard?

Compression Challenges You can start in lots of different ways. Early choices affect later ones. Once you find one set of patterns, others emerge. There is a tipping point: you might be making progress compressing, but at some point the scale tips and the dictionary starts to get so big that you lose the benefit of having it. But then you might start re-thinking the dictionary to tweak some bits out. Do we think that these compression amounts that we’ve found are the best? Is there a way to know what the best compression is? BRUTE FORCE Is there a process a person can follow to find the best (or a pretty good) compression for a piece of text?

Developing a Heuristic:  Heuristic a problem solving approach (typically an algorithm) to find a satisfactory solution where finding an optimal or exact solution is impractical or impossible. Continue working on compressing your poem using the Text Compression Widget. As you do so, develop a set of rules, or a “heuristic” that generally seems to provide good results. Record your heuristic as a list of steps that someone else unfamiliar with the problem could follow and still end up with decent compression. Don’t be specific to the song you are compressing

Heuristic Discussion: Do you think it’s possible to describe (or write) a specific set of instructions that a person could follow that would always result in better text compression than your heuristic? Why or why not? Some compression programs (like zip) do a great job if the file is sufficiently large and has reasonable amounts of repetition. However, it is also possible to create a “compressed file” that is larger than the original because the heuristic does work in every single case. Is there a way to know that a compressed piece of text is compressed the most possible? If yes, describe how you could determine it. If no, why not? No perfect solution. The size and shape of the data will determine what the “best” answer is and we often cannot even be sure it is the best answer (only that it is better than other answers we have tried.)

LZW Compression(ZIP): There is a compression algorithm called LZW compression upon which the common “zip” utility is based. Zip compression does something very similar to what you did today with the text compression widget. Zip works really well for text, but only on large files. If you try to compress the simple hello.txt file we used in a previous lesson, you'll see the resulting file is actually bigger. Zip is meant for text. It might not work well on non-text files very well because they are already compressed or don’t have the same kinds of embedded patterns that text documents do.

Wrap-Up: You should recognize that we can exploit patterns to compress our data. When we find patterns we can use abstraction(patterns referring to other patterns) to compress our message/data even further. What is Lossless Compression v. Lossy Compression? What is a heuristic?