Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Email: khaled.shawky@guc.edu.eg 1 1 1
Dictionary Techniques Statistical compression methods Use the statistical model of the data, and the quality of compression they achieve depends on how good that model is. Dictionary-based compression methods Do not use a statistical model The dictionary holds strings of symbols and it may be static or dynamic (adaptive). Static is permanent (allowing the addition of strings but no deletions). Dynamic holds strings previously found in the input stream, allowing for additions and deletions of strings as new input is being read. 2 2
Dictionary Techniques So far we assumed independent symbol with some statistical model: Not true for many common data types, e.g.: text, images, code the quality of compression they achieve depends on how good that model is Basic idea (ADAPTIVE) Identify frequent symbol patterns. Encode those frequent symbols more efficiently. Use these encoded pattern as a default (less efficient) encoding for the rest! Notes This looks reasonable for things like “text, image” … Also, it looks reasonable for things which is not very random. 3 3
Lempel-Ziv Compression Techniques Static coding requires two-passes: one pass to compute probabilities (or frequencies) and determine the mapping, and a second pass to encode. Examples of Static techniques: Static Huffman Coding All of the adaptive methods are one-pass methods; only one scan of the message (to encode) is required. Examples of adaptive techniques: LZ77?, LZ78?, and Adaptive Huffman Coding 4 4
Lempel-Ziv Compression Techniques LZ77 (Sliding Window) Variants: LZSS (Lempel-Ziv-Storer-Szymanski) Applications: gzip, Squeeze, LHA, PKZIP, ZOO LZ78 (Dictionary Based) Variants: LZW (Lempel-Ziv-Welch), LZC (Lempel-Ziv-Compress) Applications: compress, ARC , V.42bis (LZ78), gzip, zlib, GIF, CCITT (modems), PAK (LZ77) Traditionally: LZ77 was better but slower, but the gzip version is almost as fast as any LZ78. 5 5
Example Dictionary All/some letters from the alphabet + As many diagrams(pairs/group of letters) as possible Example: A= {a, b, c, d, r} 3 pairs: ab, ac, ad Encode the following: abracadabra 6 6
Example 7 7
Example 8 8
Example 9 9
Example 10 10
Example 11 11
Example 12 12
Example 13 13
Example 14 14
Example 15 15
LZ78 Compression Algorithm LZ78 inserts one- or multi-character, distinct patterns of the message to be encoded in a Dictionary. The multi-character patterns are of the form: C0C1 . . . Cn-1Cn. The prefix of a pattern refers to all the pattern characters except the last: C0C1 . . . Cn-1 LZ78 Output: Note: The dictionary is usually implemented as a hash table. 16
LZ78 Compression Algorithm (cont’d) Dictionary empty ; Prefix empty ; DictionaryIndex 1; while(characterStream is not empty) { Char next character in characterStream; if(Prefix + Char exists in the Dictionary) Prefix Prefix + Char ; else if(Prefix is empty) CodeWordForPrefix 0 ; CodeWordForPrefix DictionaryIndex for Prefix ; Output: (CodeWordForPrefix, Char) ; insertInDictionary( ( DictionaryIndex , Prefix + Char) ); DictionaryIndex++ ; } if(Prefix is not empty) CodeWordForPrefix DictionaryIndex for Prefix; Output: (CodeWordForPrefix , ) ; 17
Example 1: LZ78 Compression Encode (i.e., compress) the string ABBCBCABABCAABCAAB using the LZ78 algorithm. The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B) Note: The above is just a representation, the commas and parentheses are not transmitted; we will discuss the actual form of the compressed message later! 18
Example 1: LZ78 Compression (cont’d) How to solve it :-) 1. A is not in the Dictionary; insert it 2. B is not in the Dictionary; insert it 3. B is in the Dictionary. BC is not in the Dictionary; insert it. 4. B is in the Dictionary. BC is in the Dictionary. BCA is not in the Dictionary; insert it. 5. B is in the Dictionary. BA is not in the Dictionary; insert it. 6. B is in the Dictionary. BCA is in the Dictionary. BCAA is not in the Dictionary; insert it. 7. B is in the Dictionary, BC is in the Dictionary, BCA is in the Dictionary, BCAA is in the Dictionary. However, BCAAB is not in the Dictionary; insert it. 19
Example 2: LZ78 Compression Encode (i.e., compress) the string BABAABRRRA using the LZ78 algorithm. The compressed message is: (0,B)(0,A)(1,A)(2,B)(0,R)(5,R)(2, ) 20
Example 2: LZ78 Compression (cont’d) 1. B is not in the Dictionary; insert it 2. A is not in the Dictionary; insert it 3. B is in the Dictionary. BA is not in the Dictionary; insert it. 4. A is in the Dictionary. AB is not in the Dictionary; insert it. 5. R is not in the Dictionary; insert it. 6. R is in the Dictionary. RR is not in the Dictionary; insert it. 7. A is in the Dictionary and it is the last input character; output a pair containing its index: (2, ) 21
Example 3: LZ78 Compression Encode (i.e., compress) the string AAAAAAAAA using the LZ78 algorithm. 1. A is not in the Dictionary; insert it 2. A is in the Dictionary AA is not in the Dictionary; insert it 3. A is in the Dictionary. AA is in the Dictionary. AAA is not in the Dictionary; insert it. 4. A is in the Dictionary. AAA is in the Dictionary and it is the last pattern; output a pair containing its index: (3, ) 22
LZ78 Compression: Number of bits transmitted Example: Uncompressed String: ABBCBCABABCAABCAAB Number of bits = Total number of characters * 8 = 18 * 8 = 144 bits Suppose the codewords are indexed starting from 1: Compressed string( codewords): (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B) Codeword index 1 2 3 4 5 6 7 Each code word consists of an integer and a character: The character is represented by 8 bits. The number of bits n required to represent the integer part of the codeword with index i is given by: Alternatively number of bits required to represent the integer part of the codeword with index i is the number of significant bits required to represent the integer i – 1 23
LZ78 Compression: Number of bits transmitted Codeword (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B) index 1 2 3 4 5 6 7 Bits: (1 + 8) + (1 + 8) + (2 + 8) + (2 + 8) + (3 + 8) + (3 + 8) + (3 + 8) = 70 bits The actual compressed message is: 0A0B10C11A010A100A110B where each character is replaced by its binary 8-bit ASCII code. 24
LZ78 Decompression Algorithm (self study) Dictionary empty ; DictionaryIndex 1 ; while(there are more (CodeWord, Char) pairs in codestream){ CodeWord next CodeWord in codestream ; Char character corresponding to CodeWord ; if(CodeWord = = 0) String empty ; else String string at index CodeWord in Dictionary ; Output: String + Char ; insertInDictionary( (DictionaryIndex , String + Char) ) ; DictionaryIndex++; } 25
LZ78 Decompression Algorithm (self study) input: (CI, character) pairs output: if(CI == 0) output: currentCharacter else output: stringAtIndex CI + currentCharacter Insert: current output in dictionary 26
Example 1: LZ78 Decompression Decode (i.e., decompress) the sequence (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B) The decompressed message is: ABBCBCABABCAABCAAB 27
Example 2: LZ78 Decompression Decode (i.e., decompress) the sequence (0, B) (0, A) (1, A) (2, B) (0, R) (5, R) (2, ) The decompressed message is: BABAABRRRA 28
Example 3: LZ78 Decompression Decode (i.e., decompress) the sequence (0, A) (1, A) (2, A) (3, ) The decompressed message is: AAAAAAAAA 29
Exercises 1. Use LZ78 to trace encoding the string SATATASACITASA. 2. Write a MATLAB program that encodes a given string using LZ78. 3. Write a MATLAB program that decodes a given set of encoded codewords using LZ78. 30
LZ77: Sliding Window Lempel-Ziv (o)ffset= search_ptr – match_ptr = 7 (l)ength= number of consecutive letters matched = 4 (c)odeword(r) Encoding Notes <o, l, c> = <7, 4, C(‘r’)> |search buff| = S, S + |LA buff| = |W(indow)| Page31 31
LZ77: Example (0,0,a) (1,1,c) (3,3,a) (0,0,b) (3,3,a) (1,1,a) |LA buf| a c b (1,1,c) a c b (3,3,a) |search buff| a c b (0,0,b) O = 3 a c b (3,3,a) | search buff | | LA buff | O=1 a c b (1,1,a) (S)earch Buffer (size = 6) Page32 l (Length of matched Letters) Next character 32
LZ77 Decoding Decoder keeps the same dictionary window as encoder. For each message it looks it up in the dictionary and inserts a copy (which one at the initialization?) What if l > o? (only part of the message is in the dictionary.) E.g. input = abcd, codeword = (2,9,e) Simply copy starting at the cursor for (i = 0; i < length; i++) out[cursor+i] = out[cursor-offset+i] Out = abcdcdcdcdcdce Page33 33
Comparison to Lempel-Ziv 78 Both LZ77 and LZ78 and their variants keep a “dictionary” of recent strings that have been seen. The differences are: How the dictionary is stored (LZ78 is a table! However, LZ77 is decoded online) How it is extended (both extends an existing entry by one character) How it is indexed (LZ78 indexes the nodes of the table) Page34 34