Data Compression
How Is This Possible? Entire King James Bible : 4,834,757 bytes Zip Archive Containing It: 1,339,843 bytes
More Questions Why does this file: Compress different than:
Behind The Scenes Compression used for: – ~50% of web traffic – Most audio/video files – Sometimes for every file on a drive
Trick 1: Describe the contents of this file in as few words as possible…
Trick 1: Run Length Encoding : – Describe repetition as: (How many times)What to repeat – A
RLE Examples ABABABABABAB 6AB AAABBBBBAAACC 3A,5B,3A,2C (5)1,(1)0,(6)01
RLE Fail ABCDEF 1A,1B,1C,1D,1E
RLE Fail 2 This file doesn't just have A's: 80A,1newline,80A,1newline,80A,1newline…
Trick 2 Same As Earlier: – Describe patterns with instructions to go back x and copy y characters ABCDEFG-b7c7 "Write down ABCDEFG, then go back 7 characters and copy the next 7 characters to the end of what you have"
Same As Earlier ABCDEFG-b7c7
Same As Earlier ABCDEFG-b7c7 ABCDEFG
Same As Earlier ABCDEFG-b7c7 ABCDEFG
Same As Earlier ABCDEFG-b7c7 ABCDEFGA
Same As Earlier ABCDEFG-b7c7 ABCDEFGAB
Same As Earlier ABCDEFG-b7c7 ABCDEFGABC
Same As Earlier ABCDEFG-b7c7 ABCDEFGABCD
Same As Earlier ABCDEFG-b7c7 ABCDEFGABCDE
Same As Earlier ABCDEFG-b7c7 ABCDEFGABCDEF
Same As Earlier ABCDEFG-b7c7ABCDEFG
Same As Earlier ABCDEFG-b7c7 ABCDEFGABCDEFG
Same As Earlier AB-b2c6
Same As Earlier AB-b2c6 AB
Same As Earlier AB-b2c6 AB
Same As Earlier AB-b2c6 ABA
Same As Earlier AB-b2c6AB
Same As Earlier AB-b2c6 ABABA
Same As Earlier AB-b2c6 ABABAB
Same As Earlier AB-b2c6 ABABABA
Same As Earlier AB-b2c6 ABABABAB
Same As Earlier AB-b2c6 ABABABAB
Same As Earlier AB-b2c2-C-b3c4
Same As Earlier AB-b2c2-C-b2c5 AB
Same As Earlier AB-b2c2-C-b2c5 AB
Same As Earlier AB-b2c2-C-b2c5AB
Same As Earlier AB-b2c2-C-b2c5AB
Same As Earlier AB-b2c2-C-b2c5 ABABC
Same As Earlier AB-b2c2-C-b2c5 ABABC
Same As Earlier AB-b2c2-C-b2c5 ABABCB
Same As Earlier AB-b2c2-C-b2c5 ABABCBC
Same As Earlier AB-b2c2-C-b2c5 ABABCBCB
Same As Earlier AB-b2c2-C-b2c5 ABABCBCBC
Same As Earlier AB-b2c2-C-b2c5 ABABCBCBCB
Same As Earlier AB-b2c2-C-b2c5 ABABCBCBCB
Shorter Symbol Trick Normally text is 8-bit ASCII – 8bits = 256 possibilities
Shorter Symbol Trick If messages is just A's and B's we are wasting space: A B Why not: 0 1
Shorter Symbol Trick Shorter Symbol Trick: – Use minimum number of bits to represent different symbols in message – More common symbols get shorter representation
More Common This message: AAAABAAC Three symbols, need 2 bits – Could do AAAABAAC (16 bits)
More Common But A is more common: AAAABAAC So maybe we can use a shorter code for it AAAABAAC (10 bits)
Why Does it Work No code is a prefix for another – 0 : it is an A – 1 : keep going ABCAAB
Why Does it Work A BAD code – 0 : is it an A? is it the start of a D? ABDA
Building a Code CS160 Reader… – Huffman Code Building
Lossy Compression Lossless compression : – Can recreate original perfectly – Algorithms: Run length encoding, same as earlier, shorter symbol – Examples: zip files, www traffic
Lossy Compression Lossy compression – Original can NOT be recreated perfectly
My Kids Kb
Every Other Line/Column Removed
Remaining pixels packed back down : 320Kb
Blown back up vs original OriginalCompressed
Only keep every 4th line/column : 81 Kb
Real JPEG Image broken into blocks of pixels
Real JPEG Each block processed seperately
Real JPEG Block processed, to look for compressible patterns
Real JPEG Patterns can more or less recreate image
JPEG 200% No compress Low compress Med compress High compress