Lecture 10: data compression
Outline Basics of Data Compression Text & Numeric Compression Image Compression Audio Compression Video Compression Data Security Through Encryption
Learning Outcomes Differentiate between the lossless and the lossy data compression process
Basics of Data Compression Digital compression concepts Compression techniques are used to replace a file with another that is smaller Compressed data requires less storage and can be transmitted at a faster rate Decompression techniques expands the compressed file to recover the original data – either exactly or in facsimile A pair of compression / decompression techniques that work together is called a codec for short
Basics of Data Compression (cont) Compress Decompress – CoDec Main function of CODEC : to reduce the redundancy in data How ??? – by replacing definable patterns with shorter sequences of symbols Uncompressed data Compression / coder Compressed data Decompression / decoder CODEC 10010110010 10101011100
Basics of Data Compression (cont) Types of Codecs Lossy CoDecs Codecs that produces only an approximation of the original data e.g: audio and digital images Lossless CoDecs Codecs that upon decompression always reproduce the originals file exactly e.g: text and numeric data
Basics of Data Compression (cont) Speed of Compression / Decompression Describes the amount of time required to compress and decompress data either symmetrically or asymmetrically Symmetric CoDec takes approximately the same amount of time to compress and decompress a file e.g: video teleconferencing transmission Asymmetric CoDec simple, fast decompression speed but compression is more complicated and significantly slower e.g: storing and accessing CD-ROM / DVD
Basics of Data Compression (cont) CoDec methods CoDec can be distinguish in 3 ways: Syntactic method (entropy encoding) attempt to reduce the redundancy of symbolic patterns without any attention to the type of information represented (ignores the source of information) Semantic method (Source coding) consider special properties of the type of information represented (helps to transform or reduce the amount of non-essential information in the original) Hybrid method combine both syntactic and semantic approaches Firstly, prepare the data using semantic method and then reduce it further with entropy encoding
Entropy Encoding Hybrid Source Coding Layered Coding Run-Length Coding Huffman Coding Arithmetic Coding Source Coding Prediction DPCM, DM Transformation FFT, DCT Layered Coding Bit Position, Subsampling, Sub- band coding Hybrid JPEG, MPEG, H.261, DVI RTV, DVI PLV
Text and Numeric Compression Several methods for compressing files representing text and numeric data 1) Run Length Encoding (RLE) A simple and direct form of compression Based on the assumption that a great deal of redundancy is present in the repetition of particular sequences of symbols Example: consider a sequence of text : A BB CC DDDDDDDDD EE F GGGGG after compression will become A BB CC D#9 EE F G#5 Fax transmissions use RLE for data reduction
Text and Numeric Compression (cont)
Text and Numeric Compression (cont) Run Length Encoding (RLE) Some data contain a sequence of identical bytes The RLE technique replaces these runs of data by using a marker or a counter that indicates the number of occurrences 3-bytes encoding: Uncompressed data AAAAOOOOOOOOBBBBCCCDCC Compressed data A#4 O#8 B#4 C C C D C C The # acts as the marker, followed by a number indicating the number of occurrence. This example shows that each run of code is compressed.
Text and Numeric Compression (cont) Run Length Encoding (RLE) RLE can also be coded using 2-bytes The first byte indicates the number of occurrence, where the second indicates the data 2-bytes encoding: Uncompressed data AAAAOOOOOOOOBBBBCCCDCC Compressed data 4A 8O 4B 3C D C C As a result of this, RLE manages to compress the data down a bit. The original data = 22-bytes (AAAAOOOOOOOOBBBBCCCDCC) RLE compresses it down to 11-bytes (4A 8O 4B 3C D C C)
Text and Numeric Compression (cont) Run Length Encoding (RLE) Compresses more efficiently if the run of strings is really long. Example: AAAAAAAAAAAAAAAAAAAA becomes 20A Instead of 20-bytes… the storage is brought down to just 2-bytes (1-bytes for ’20’ and 1-byte for ‘A’) RLE compression ratio can be measure by the formula: (original size / compressed size) : 1 For the previous example… compression ratio is 22/11 : 1, which is 2:1
Text and Numeric Compression (cont) Run Length Encoding (RLE) RLE on repetitive data source Consider this: 1, 3, 4, 1, 3, 4, 1, 3, 4, 1, 3, 4 RLE 4(1,3,4) – translates to 4 occurrences of 1,3 and 4 RLE on differencing Consider this: 1,2,4,5,7,8,10 RLE can also take the differences between adjacent strings and encodes them. In this case, 1 and 2 = 1; 2 and 4 = 2; 4 and 5 = 1… and so on The respective compressed differences would be 1,2,1,2, 1,2. Further compression 3(1,2)
Text and Numeric Compression (cont) Run Length Encoding (RLE) RLE only good for long runs! As the previous examples show, only long runs of identical data are worth compressing If we don’t have longs runs of data… no compression might be achieved. In some cases, data expansion might happen!!! Consider this: “SAMSUNGNOKIASONYACER” = 20 bytes 3S 3A M U 3N G 2O K I Y C E R = ? bytes
Text and Numeric Compression (cont) 2) Huffman Codes Form of statistical encoding that exploits the overall distribution or frequency of symbols in a source Produces an optimal coding for a passage-based source on assigning the fewest number of bits to encode each symbol given the probability of its occurrence e.g. if a passage-based content has a lot of character “e” then it would make sense to replace it with the smallest sequence of bits possible. Other characters can use its normal representation refer the HUFFMAN tree
Text and Numeric Compression (cont) Huffman Codes This technique is based on the probabilistic distribution of symbols or characters Characters with the most number of occurrences are assigned the shortest length of code The code length increases as the frequency of occurrence decreases Huffman codes are determined by successively constructing a binary tree The leaves of the tree represent the characters to be coded
Text and Numeric Compression (cont) Huffman Codes Characters are arranged in descending order of probability The tree is further built further by repeatedly adding two lowest probabilities and resorting This process goes on until the sum of probabilities of the last two symbols is 1 Once this process is complete, a Huffman binary tree can be generated
Text and Numeric Compression (cont) Huffman Codes The resultant code words are then formed by tracing the tree path from the root node to the end-nodes code words after assigning 0s and 1s to the branches If we do not obtain a probability of 1 in the last two symbols, most likely there is a mistake in the process. This probability of 1 which forms the last symbol is the root of the binary tree
Text and Numeric Compression (cont) Huffman Codes (example) Let’s say you have this particular probabilistic distribution: A = 0.10; B = 0.35; C = 0.16; D = 0.2; E = 0.19 The characters are listed in order of decreasing probability B = 0.35; D = 0.2; E = 0.19; C = 0.16; A = 0.10 Two characters with the lowest probability are combined A = 0.10 and C = 0.16 AC = 0.26 Re-Sort… and the new list is: B = 0.35; AC = 0.26; D = 0.2; E = 0.19 Then repeat what was done in step 2 D = 0.2 and E = 0.19 DE = 0.39 Re-Sort the list again and we get: DE = 0.39; B = 0.35; AC = 0.26
Text and Numeric Compression (cont) Huffman Codes (example - continued) Again… get the lowest two probs. and repeat the process B = 0.35 and AC = 0.26 BAC = 0.61 Re-Sort… and you get the new list: BAC = 0.61; DE = 0.39 Finally, BAC and DE are combined… and you get BACDE = 1.0 From all the combinations of probabilistic values that you’ve done… a binary tree is constructed. Each edge from node to sub-node is assigned either a 1 or 0
Text and Numeric Compression (cont) Huffman Codes (resulting binary tree) P(C) = 0.16 P(A) = 0.10 P(AC) = 0.26 P(D) = 0.2 P(E) = 0.19 P(DE) = 0.39 P(B) = 0.35 P(BAC) = 0.61 P(BACDE) = 1.0 1 Huffman Code for each Character Character Probabilities Code words A 0.10 011 B 0.35 00 C 0.16 010 D 0.20 10 E 0.19 11
Text and Numeric Compression (cont) 3) LZW compression (Lempel-Ziv Welch) Based on recognizing common string patterns Basic strategy: replace strings in a file with bit codes rather than replacing individual characters with bit codes Greater compression rate than both previous methods
Text and Numeric Compression (cont) 3) LZW compression (Lempel-Ziv Welch)
Image Compression Image compression involves reducing the size of image data files, while retaining the necessary information.
Image Compression (cont’d) Besides Compression Ratio, another way to state the compression is to use the terminology of bits per pixel. For an NxN image: Example: Say that we have a 256x256 image which is compressed to 6,554 bytes.
Image Compression (cont’d) The importance of reduction of file size: T0 reduce the amount of storage needed. To reduce the bandwidth requirement when sending the file over the network.
Image Compression (cont’d) The amount of data required for digital image is enormous. A single 512x512, 8-bit image requires 2,097,152 bits (256 KB) for storage. A single 512x512 RGB color image requires 786 KB for storage. To transmit the RGB image using a 56.6 kbps modem would require 1.8 minutes.
Huffman Coding (Image example) The Huffman algorithm can be described in five steps: Find the gray-level probabilities for the image by finding the histogram. Order the input probabilities (histogram magnitudes) from smallest to largest. Combine the smallest two by addition. Repeat step 2, until two probabilities are left. By working backward the tree, generate the code by alternating assignment of 0 and 1.
Huffman Coding (Image example) We have an image with 2 bits/pixel, giving 4 possible gray levels. The image is 10 rows by 10 columns. The histogram of the image is given below: Number of pixels 50 1 2 3 10 20 30 40 Gray level
Huffman Coding (Image example) Step 1: Find the gray-level probabilities. g0 = 20/100 = 0.2 g1 = 30/100 = 0.3 g2 = 10/100 = 0.1 g3 = 40/100 = 0.4 Step 2: Order probabilities from smallest to largest g3 0.4 g1 0.3 g0 0.2 g2 0.1
Huffman Coding (Image example) Step 3: Combine the smallest two by addition. 0.4 0.4 0.3 0.3 0.2 0.3 0.1 Step 4: Reorder and add until two values remain. 0.4 0.4 0.6 0.3 0.3 0.4 0.2 0.3 0.1 + + +
Huffman Coding (Image example) Step 5: Generate the code. The final code is given in the following table. Original Gray Level (Natural Code) Probability Huffman Code g0: 002 0.2 0102 g1: 012 0.3 002 g2: 102 0.1 0112 g3: 112 0.4 12
Huffman Coding (Image example) Note that the gray-level with the highest probability is assigned the least number of bits.
Run-Length Coding (Image example) RLC is an image compression method that works by counting the number of adjacent pixels with the same gray-level value. This count, called the run length, is then coded and stored. There are many variations of RLC: Basic methods: used for binary images Extended versions: for gray-scale images
Run-Length Coding (Image example) There can be two types of RLC: Horizontal RLC: count along rows Vertical RLC: count along columns The number of bits used for the coding depends on the number of pixels in a row: If the row has 2n pixels, then the required number of bits is n. A 256x256 image requires 8 bits, since 28 = 256.
Run-Length Coding (Image example) The next step is to define a convention for the first RLC number in a row. Does it represent a run of 0’s or 1’s? Consider the following binary image:
Run-Length Coding (Image example) Apply RLC to this image, using: Horizontal RLC The first RLC number represents a run of 0’s The RLC numbers are: First row: 8 Fifth row: 1,3,2,1,1 Second row: 0,4,4 Sixth row: 2,1,2,2,1 Third row: 1,2,5 Seventh row: 0,4,1,1,2 Fourth row: 1,5,2 Eighth row: 8
Image Compression popular formats for compressing digital images: 1) GIF (Graphic Interchange Format) Compression LZW codec for lossless compression (8-bit images) look for repeated horizontal patterns along each scan line can handle multiple images TIFF (Tagged Image File Format) Compression based on LZW method widely used by a variety of applications and hardware platforms
Image Compression (cont) 3) PNG (Portable Network Graphic) Compression designed to be a replacement of GIF using lossless method for transmitting single bitmap images over computer networks cannot handle multiple images, but improves compression rates, and can handle true color 24-bit look for repeated horizontal and vertical patterns along each scan line
Image Compression (cont) 4) JPEG (Joint Photographic Experts Group) Compression general-purpose standard for still images - continuous-tone graphics user can choose the compression rates but image quality is sacrificed in proportion to the compression rate -> greater compression rates means poorer image quality advantage – its wide acceptance and support in a variety of applications
Audio Compression The choice of sampling rates (frequency and amplitude) are very important to handle the size of an audio size Higher sampling rates mean higher fidelity, and cost more in storage space and transmission time Widely used method is ADPCM (Adaptive Differential Pulse Code Modulation)
Video Compression Transmitting standard full screen color imagery as video at 30 fps requires a data rate nearly 28MB per second video compression is absolutely essential !!! One idea is to reduce the amount of data rate (from 30 fps to 15 fps), but it will sacrifice a lot of video motions
Video Compression (cont) Intraframe (spatial) compression: reduce the redundant information contained within a single image or frame it is not sufficient for achieving the kinds of data rates essential for transmitting video in practical applications
Video Compression (cont) Interframe (temporal) compression The idea is that much of the data in video images is repeated frame after frame This technique will eliminates the redundancy of information between frames Must identify the key frame (master frame) Key frame: the basis for deciding how much motion or how many changes take place in succeeding frames
Video Compression (cont) Interframe (temporal) compression assumes that the background remains (sky, road, and grass) but only the car is moving the first frame is stored as key frame and it has enough information to reconstruct it independently
Video Compression (cont) MPEG (Moving Picture Experts Group) Compression Prediction approach (predicted pictures = P pictures; intrapictures pictures = I pictures; bi-directional pictures = B pictures). Some compressed frames are the difference results of predictions based on past frames used as a reference, and others are based on both past and future frames from the sequence I = intra picture; B = bi-directional picture;
Video Compression (cont) Spatial vs temporal compression
Summary Compressing data means reducing the effective size of a data file for storage or transmission Particular paired compression/decompression methods are called codecs Codecs that cannot reproduce the original file exactly are called lossy methods; those that reproduce the original exactly are called lossless methods Text and numbers usually lossless methods Images, video and sound codecs are usually lossy