Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic.

Similar presentations


Presentation on theme: "Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic."— Presentation transcript:

1 Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic Huffman coding algorithm 3. Gzip Decompression Algorithm 4. Other Method of data compression and open questions and open questions

2 Gzip file format 1.A gzip file consists of a series of “ member ”. The members simply appear one after another in the file, with no additional information before,between or after them. 2.Member format Each member has the following format: Each member has the following format: +---+---+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+---+---+ |ID1|ID2|CM|FLG| MTIME |XFL|OS| (more->) |ID1|ID2|CM|FLG| MTIME |XFL|OS| (more->) +---+---+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+---+---+ if FLG.FEXTRA set if FLG.FEXTRA set +---+---+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+---+---+ | XLEN | … XLEN bytes of “ extra field ” |(more->) | XLEN | … XLEN bytes of “ extra field ” |(more->) +---+---+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+---+---+

3 if FLG.FNAME set if FLG.FNAME set +---+---+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+---+---+ | … original file name, zero-terminated … | (more->) | … original file name, zero-terminated … | (more->) +---+---+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+---+---+ if FLG.COMMENT set if FLG.COMMENT set +---+---+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+---+---+ | … file comment, zero-terminated … |(more->) | … file comment, zero-terminated … |(more->) +---+---+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+---+---+ if FLG.FHCRC set if FLG.FHCRC set +---+---+ +---+---+ | CRC16| | CRC16| +---+---+ +---+---+ +====================+ +====================+ | … compressed blocks | (more->) | … compressed blocks | (more->) +====================+ +====================+

4 +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ | CRC32 | INSIZE | | CRC32 | INSIZE | +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ ID1=31,ID2=139, they are used to identify the file as being in gzip format. ID1=31,ID2=139, they are used to identify the file as being in gzip format. CM (compression method) CM (compression method) This identifies the compression method in the file. This identifies the compression method in the file. CM = 0-7 are reserved. CM = 8 denotes the “ deflate ” CM = 0-7 are reserved. CM = 8 denotes the “ deflate ” compression method, which is the one customarily compression method, which is the one customarily used by gzip and which is documented elsewhere. used by gzip and which is documented elsewhere. bit 0 FTEXT bit 1 FHCRC bit 0 FTEXT bit 1 FHCRC bit 2 FEXTRA bit 3 FNAME bit 2 FEXTRA bit 3 FNAME bit 4 FNAME others reserved. bit 4 FNAME others reserved. CRC32 CRC32 INSIZE original size of uncompressed data mod 2^32 INSIZE original size of uncompressed data mod 2^32

5 2.Gzip compression algorithm Introduction Gzip combine the LZ77 algorithm and dynamic Huffman Gzip combine the LZ77 algorithm and dynamic Huffman algorithm to compress data. Gzip use LZ77 algorithm to compress data first, then use dynamic Huffman algorithm to compress the result. 2.1 LZ77 compression algorithm Terms used in the algorithm:. input stream :the sequence of characters to be compressed.. input stream :the sequence of characters to be compressed.. character :the basic element in the input stream.. character :the basic element in the input stream.. coding position: the position of input stream being coded.. coding position: the position of input stream being coded. (the beginning of lookahead buffer ). lookahead buffer : the character sequence from the coding. lookahead buffer : the character sequence from the coding position to the end of input stream.

6 . window : size of w, contains w characters from coding. window : size of w, contains w characters from coding position backwards. i.e. the last w characters processed.. A pointer points the match in the window and also. A pointer points the match in the window and also specifies its length. The principle of encoding The algorithm searches the window for longest match with The algorithm searches the window for longest match with the lookahead buffer and output a pointer for that match. When we find the match, we use data pair to take place of the match. Offset: the offset from the beginning of match to window’s Offset: the offset from the beginning of match to window’s left bound. (length from coding position to the beginning of match) Length: length of match. Length: length of match. The encoding algorithm

7 step1: set the coding position to the beginning of input step1: set the coding position to the beginning of inputstream step2: if coding position is not at the end of input step2: if coding position is not at the end of input stream, search the window for the longest match with the lookahead buffer ; else algorithm terminates. step3: if find match, output ( off, length,c ), c is the character step3: if find match, output ( off, length,c ), c is the character following the match, coding position and window move length+1 bytes forward; else goto step4. step4: output current character at coding position, step4: output current character at coding position, coding position and windows move 1 byte forward; goto step2. Following is an example to explain the algorithm. Assume the size of window is 10, the content is “ abcdbbccaa ”, the string to be coded is “ abaeaaabaee ”. The steps of encoding is following:

8 step1: the longest match between string and window is step1: the longest match between string and window is “ ab ”, output (0,2,a), then window and coding position move forward 3 bytes. step2: the character at the current coding position is ‘ e ’. step2: the character at the current coding position is ‘ e ’. content of window is “ dbbccaaaba ”, there is no match with ‘ e ’, then output ‘ e ’. Window and coding position move 1 byte forward. step3: Content of window is “ bbccaaabae ”.Lookahead step3: Content of window is “ bbccaaabae ”.Lookahead buffer is “ aaabae ”, the longest match is itself. Then output (4,6,e). There are many other problems needed to be considered. You can refer the gzip source code and document.

9 Dynamic Huffman Coding Static Huffman coding algorithm: Assume that we give a set of characters, and frequencies of them. Then we can use the Huffman algorithm to encode for these characters. Dynamic Huffman coding process is a dynamic process to build a Huffman tree. We don ’ t know the characters and there frequency at first. Following is an example to introduce the process of dynamic huffman algorithm: String: TENNESSEE During the dynamic process of building Huffman tree, we must obey one rule: maintain the sibling property if each node (except the root) has a sibling and if the nodes can be numbered in order of nondecreasing weight with each node adjacent to its sibling. Moreover the parent of a node is higher in the numbering

10 T Stage 1 (First occurrence of t ) r 9 r 9 / \ / \ 7 0 t(1) 8 7 0 t(1) 8 Order: 0,t(1) * r represents the root * 0 represents the null node * t(1) denotes the occurrence of T with a frequency of 1

11 TE Stage 2 (First occurrence of e) r 9 r 9 / \ / \ 7 1 t(1) 8 7 1 t(1) 8 / \ / \ 5 0 e(1) 6 5 0 e(1) 6 Order: 0,e(1),1,t(1)

12 TEN Stage 3 (First occurrence of n ) r 9 r 9 / \ / \ 7 2 t(1) 8 7 2 t(1) 8 / \ / \ 5 1 e(1) 6 5 1 e(1) 6 / \ / \ 3 0 n(1) 4 3 0 n(1) 4 Order: 0,n(1),1,e(1),2,t(1) It is not a Huffman tree, we need to adjust it to Huffman tree

13 Reorder: TEN r 9 r 9 / \ / \ 7 t(1) 2 8 7 t(1) 2 8 / \ / \ 5 1 e(1) 6 5 1 e(1) 6 / \ / \ 3 0 n(1) 4 3 0 n(1) 4 Order: 0,n(1),1,e(1),t(1),2

14 TENN Stage 4 ( Repetition of n ) r 9 r 9 / \ / \ 7 t(1) 3 8 7 t(1) 3 8 / \ / \ 5 2 e(1) 6 5 2 e(1) 6 / \ / \ 3 0 n(2) 4 3 0 n(2) 4 Order: 0,n(2),2,e(1),t(1),3 Sibling property is no more valid, rebuild the tree. Swap this node with the node whose number is the biggest in the block. Block: a set of nodes whose weights are the same. In order to maintain the sibling property, we should swap node (n) with node (t), if the node has subtree, the subtree should be swapped together.

15 Reorder: TENN r 9 r 9 / \ / \ 7 n(2) 2 8 7 n(2) 2 8 / \ / \ 5 1 e(1) 6 5 1 e(1) 6 / \ / \ 3 0 t(1) 4 3 0 t(1) 4 Order: 0,t(1),1,e(1),n(2),2 t(1),n(2) are swapped t(1),n(2) are swapped

16 TENNE Stage 5 (Repetition of e ) r 9 r 9 / \ / \ 7 n(2) 3 8 7 n(2) 3 8 / \ / \ 5 1 e(2) 6 5 1 e(2) 6 / \ / \ 3 0 t(1) 4 3 0 t(1) 4 Order: 0,t(1),1,e(2),n(2),3

17 TENNES Stage 6 (First occurrence of s) r 9 r 9 / \ / \ 7 n(2) 4 8 7 n(2) 4 8 / \ / \ 5 2 e(2) 6 5 2 e(2) 6 / \ / \ 3 1 t(1) 4 3 1 t(1) 4 / \ / \ 1 0 s(1) 2 1 0 s(1) 2 Order: 0,s(1),1,t(1),2,e(2),n(2),4

18 TENNESS Stage 7 (Repetition of s) r 9 r 9 / \ / \ 7 n(2) 5 8 7 n(2) 5 8 / \ / \ 5 3 e(2) 6 5 3 e(2) 6 / \ / \ 3 2 t(1) 4 3 2 t(1) 4 / \ / \ 1 0 s(2) 2 1 0 s(2) 2 Order: 0,s(2),2,t(1),3,e(2),n(2),5 Sibling property is not valid. Adjust the tree to maintain sibling property.

19 Reorder: TENNESS r 9 r 9 / \ / \ 7 3 4 8 7 3 4 8 / \ / \ / \ / \ 3 1 s (2) 4 5 n(2) e(2) 6 3 1 s (2) 4 5 n(2) e(2) 6 / \ / \ 1 0 t(1) 2 1 0 t(1) 2 s(2) and t(1) are swapped e and 3 are also need to be swapped

20 TENNESSE Stage 8 (Second repetition of e ) r 9 r 9 / \ / \ 7 3 5 8 7 3 5 8 / \ / \ / \ / \ 3 1 s (2) 4 5 n(2) e(3) 6 3 1 s (2) 4 5 n(2) e(3) 6 / \ / \ 1 0 t(1) 2 1 0 t(1) 2 Order : 0,t(1),1,s(2),e(3),3,n(2),6

21 Reorder: TENNESSEE r 9 r 9 / \ / \ 7 3 6 8 7 3 6 8 / \ / \ / \ / \ 3 1 s (2) 4 5 n(2) e(4) 6 3 1 s (2) 4 5 n(2) e(4) 6 / \ / \ 1 0 t(1) 2 1 0 t(1) 2 sibling property is valid, need to rebuild the Huffman tree.

22 TENNESSEE Stage 9 (Second repetition of e ) r 9 r 9 / \ / \ 7 e(4) 5 8 7 e(4) 5 8 / \ / \ 5 n(2) 3 6 5 n(2) 3 6 / \ / \ 3 1 s(2) 4 3 1 s(2) 4 / \ / \ 1 0 t(1) 2 1 0 t(1) 2 Adaptive Huffman decoding is the inverse Adaptive Huffman decoding is the inverse procedure of encoding.


Download ppt "Gzip Compression and Decompression 1. Gzip file format 2. Gzip Compress Algorithm. LZ77 algorithm. LZ77 algorithm.Dynamic Huffman coding algorithm.Dynamic."

Similar presentations


Ads by Google