Data Representation :: Compression jamie@drfrostmaths.com www.drfrostmaths.com @DrFrostMaths Last modified: 19th July 2019
www.drfrostmaths.com ? Everything is completely free. Why not register? Registering on the DrFrostMaths platform allows you to save all the code and progress in the various Computer Science mini-tasks. It also gives you access to the maths platform allowing you to practise GCSE and A Level questions from Edexcel, OCR and AQA. With Computer Science questions by: Your code on any mini-tasks will be preserved. Note: The Tiffin/DFM Computer Science course uses JavaScript as its core language. Most code examples are therefore in JavaScript. Using these slides: Green question boxes can be clicked while in Presentation mode to reveal. Slides are intentionally designed to double up as revision notes for students, while being optimised for classroom usage. The Mini-Tasks on the DFM platform are purposely ordered to correspond to these slides, giving your flexibility over your lesson structure. ?
Learning Objectives Directly from the OCR specification: Not in the syllabus: Compression algorithms, i.e. how data is compressed (although we will touch upon these just for your interest)
The need for compression Compression is reducing the amount of data needed for a file/data stream. There are many reasons why we’d want to use compression: 1 Web pages load more quickly Some of the larger JavaScript files on DrFrostMaths I run through a tool at www.jscompress.com. This removes whitespace, renames variables to single letters and uses various clever programming syntax to reduce code length. The file size ends up being more than 50% less. As a convention we name ‘minified’ JavaScript files to end with .min.js
The need for compression Compression is reducing the amount of data needed for a file/data stream. There are many reasons why we’d want to use compression: 2 Files take up less storage space ‘zip’ files are a compressed collection of files. It has the advantage of treating a directory as a single file (making it easier to send), but also takes up less overall space.
The need for compression Compression is reducing the amount of data needed for a file/data stream. There are many reasons why we’d want to use compression: 3 Files/data takes up less bandwidth Bandwidth is the amount of data transferred in a fixed amount of time. Having to download less data may save on your mobile phone bill! Chrome on Android phones uses a ‘compression proxy’. All requests for web data goes via Google’s servers, which compresses the data before delivering it to your phone.
The need for compression Compression is reducing the amount of data needed for a file/data stream. There are many reasons why we’d want to use compression: 4 Emails have limited attachment size While email standards size as MIME don’t have a theoretical maximum file attachment size, in practice most email services have a limit.
Lossy vs Lossless Compression For some data, any compression must allow the full original data to be reconstructed, e.g. Compressed code would not function correctly if we lost code. Compressed files similarly might be corrupted if we couldn’t recover some of the original data after uncompressing. ! Lossless compression allows the original data to be reconstructed in full. Data is only temporarily removed while the data is in compressed form. However, for audio or image files, sometimes we tolerate some of the original data to be lost at the expense quality. Reduce audio sample size time ! Lossy compression permanently discards some of the data.
Lossy vs Lossless Compression Advantages Disadvantages Example audio/visual file types Lossless No reduction in quality: image will look exactly the same/audio sound exactly the same. Relatively small reduction in file size. png (image) gif (image) wav (audio) Lossy Larger reduction file size/reduced bandwidth. Commonly used, therefore most software can read such data. Loses data, so can’t reconstruct original. Can’t be used on files which must preserve all data. Loss of quality may be noticeable if compression high. jpg (image) mp3 (audio) ? ? ? ? ? ?
JPEGs We saw on the previous slide that JPGs result in a permanent loss in quality of the image. For images with blocks of colour, e.g. the above, we tend to get quite a lot of ‘noise’, so PNGs/GIFs tend to be better for ‘graphic art’. JPEG compression tends to work much better on photos, and is the file format typically outputted by cameras. Decreasing compression rate We can customise the amount of JPEG compression. Higher compression reduces file size but also reduces quality, as demonstrated above.
For your interest :: Image Compression Algorithms (Not in the syllabus) PNG compression [Source: Pink Kitty 111] This image shows the ‘relative cost’ (in terms of number of bits) required for each pixel, with blue the least bits and red the most bits. As you can see, areas of the same colour take up less space. But also repeating textures (e.g. ends of the bananas) also take up less space due to how the compression algorithm works.
For your interest :: Image Compression Algorithms (Not in the syllabus) PNG compression There is a two stage compression process, part of a compression algorithm known as DEFLATE: #1 :: LZSS Compression This identifies repeating sequences of characters. For images, this corresponds to efficiently compressing repeating segments within the image. (5,3) means we’re using the word starting at position 5 (the ‘S’) and 3 characters long. LZSS
For your interest :: Image Compression Algorithms (Not in the syllabus) #2 :: Huffman Coding Typically we would use the same number of bits for each character, e.g. 8 bits. But it would be more space efficient to use a varying number of bits for each character, so that more common characters use less bits and less common characters use more bits. Because no code is a prefix of any other code (e.g. 10 doesn’t appear as the first two digits of any other code), it means there is no ambiguity in how the string is split up. Symbol Code a b 10 c 110 d 111 a b c d e.g. 0110101110110101110 Suppose we had just 4 letters used in our data: a, b, c, d. The decimals show the proportion of time each letter appears, e.g. ‘a’ 40% of the time (and thus should have the least number of bits). ? 0 110 10 111 0 110 10 111 0 acbdacbda ? LZSS
For your interest :: Image Compression Algorithms (Not in the syllabus) JPEG compression JPEG compression is considerably more complicated and uses a large amount of mathematics. But a summary: The colour model is converted from RGB (Red-Green-Blue) to Y’CRCB, where Y’ is to do with Brightness and CRCB two colour components. Because the brightness is confined to a single value (rather than spread across R, G and B), and because human visual perception is dominated by brightness over colour, we can compress colour information more efficiently. A single 2D cosine wave: The brightness (Y’) of each pixel is initially preserved, but for each of the colour values CR and CB, it is ‘downsampled’, such that each 2×2 block is replaced with a single colour value. Each 8×8 block undergoes something called a Discrete Cosine Transformation for each of Y’, CR and CB, which means to approximate it as a sum of 2D cosine curves. We use bit encoding techniques similar to that for PNGs, e.g. Huffman encoding.
Exam Question ? OCR Sample Question Paper Notice that you need to give the practical implication of each benefit (even if it’s really obvious!)
Coding Mini-Tasks Return to the DrFrostMaths site to complete the various tasks on compression.