Presentation is loading. Please wait.

Presentation is loading. Please wait.

AP CSP: Bytes, File Sizes, and Text Compression

Similar presentations


Presentation on theme: "AP CSP: Bytes, File Sizes, and Text Compression"— Presentation transcript:

1 AP CSP: Bytes, File Sizes, and Text Compression

2 Size Terminology: Recall that a single character of ASCII text requires 8 bits. The technical term for 8 bits of data is a Byte. A byte is the standard fundamental unit (or “chunk size”) underlying most computing systems today. “megabyte", "kilobyte", "gigabyte” Now let’s compare how data is stored and saved depending on the type of file!

3 Plain Text v. MS Word Doc We learned that in addition to the actual text of a document, it is usually necessary to store the formatting information that allows the text to be displayed correctly. How much extra information, i.e. how many extra bytes, we need to store when we include all of this formatting.  If a single ASCII character is one byte then how many bytes are required to store the word hello in a .txt(Notepad) file vs a .docx(MS Word) file. Make a prediction now! Now we’ll find out and see if your prediction was close!

4 Bytes and File Sizes Worksheet:
Complete this worksheet for homework and make sure you understand the difference between the main file size types and bits v. bytes Think about this problem too and we will talk about it next class more: If you want to transmit a lot of data you are limited by the speed of your internet connection. Even if you have a fast Internet connection there is a physical limit to how fast you can transmit bits. What if the data you want to send is big enough that it takes an unreasonable amount of time to transmit it, even with a really fast internet connection. Assuming you can't make the Internet connection any faster, could you still transmit the data faster somehow?

5 Compression Warmup: At some point we reach a physical limit of how fast we can send bits and if we want to send a large amount of information faster, we have to find a way to represent the same information with fewer bits - we must compress the data. When you send text messages to a friend, do you spell every word correctly? Do you use abbreviations for common words? List as many as you can. Why do you use these abbreviations? What is the benefit?

6 Compression: Same Data, Fewer Bits:
We compress data to save time and space. The art and science of compression is about figuring out how to represent the SAME DATA with FEWER BITS. Why is this important? One reason is that storage space is limited and you'd always prefer to use fewer bits if you could. A much more compelling reason is that there is an upper limit to how fast bits can be transmitted over the Internet. Now you’re going to practice decompressing a message before you try and figure out your own encoding scheme! DECODE THE MESSAGE

7 The full compressed text includes BOTH the compressed text and the key to solve it.
Thus, you must account for the total number of characters in the message plus the total number of characters in the key to see how much you've compressed it over the original.

8 Compressing Songs: You and a partner will try to compress a song by finding patterns you can exploit. Compare your results with other groups that tried to compress the same song As you are compressing songs, try to develop a general strategy that will lead to a good compression rate every time. Think about What makes doing this compression hard?

9 Compression Challenges
You can start in lots of different ways. Early choices affect later ones. Once you find one set of patterns, others emerge. There is a tipping point: you might be making progress compressing, but at some point the scale tips and the dictionary starts to get so big that you lose the benefit of having it. But then you might start re-thinking the dictionary to tweak some bits out. Do we think that these compression amounts that we’ve found are the best? Is there a way to know what the best compression is? BRUTE FORCE Is there a process a person can follow to find the best (or a pretty good) compression for a piece of text?

10 Developing a Heuristic:
 Heuristic a problem solving approach (typically an algorithm) to find a satisfactory solution where finding an optimal or exact solution is impractical or impossible. Continue working on compressing your poem using the Text Compression Widget. As you do so, develop a set of rules, or a “heuristic” that generally seems to provide good results. Record your heuristic as a list of steps that someone else unfamiliar with the problem could follow and still end up with decent compression. Don’t be specific to the song you are compressing

11 Heuristic Discussion:
Do you think it’s possible to describe (or write) a specific set of instructions that a person could follow that would always result in better text compression than your heuristic? Why or why not? Some compression programs (like zip) do a great job if the file is sufficiently large and has reasonable amounts of repetition. However, it is also possible to create a “compressed file” that is larger than the original because the heuristic does work in every single case. Is there a way to know that a compressed piece of text is compressed the most possible? If yes, describe how you could determine it. If no, why not? No perfect solution. The size and shape of the data will determine what the “best” answer is and we often cannot even be sure it is the best answer (only that it is better than other answers we have tried.)

12 LZW Compression(ZIP):
There is a compression algorithm called LZW compression upon which the common “zip” utility is based. Zip compression does something very similar to what you did today with the text compression widget. Zip works really well for text, but only on large files. If you try to compress the simple hello.txt file we used in a previous lesson, you'll see the resulting file is actually bigger. Zip is meant for text. It might not work well on non-text files very well because they are already compressed or don’t have the same kinds of embedded patterns that text documents do.

13 Wrap-Up: You should recognize that we can exploit patterns to compress our data. When we find patterns we can use abstraction(patterns referring to other patterns) to compress our message/data even further. What is Lossless Compression v. Lossy Compression? What is a heuristic?


Download ppt "AP CSP: Bytes, File Sizes, and Text Compression"

Similar presentations


Ads by Google