LZ77 Compression Introduced by Abraham Lempel and Jacob Ziv in 1977

LZ77 Compression Introduced by Abraham Lempel and Jacob Ziv in 1977
Widely used today Zip deflate algorithm (LZ77+Huffman coding) Lzma used in xz, 7zip (LZ77+Arithmetic encoding) Compresses by eliminating repeated byte sequences in the input file Aside: Lempel and Ziv wrote another paper the next year; LZ78 is used by gif, Unix compress, zip shrink algorithm

Decompression is Simple
$out .= substr($out, -$distance, $length); Reach back $distance, copy $length bytes. That's it! In LZ77-based systems, $distance usually has an upper limit, so this is called a “sliding window” or “sliding dictionary.” In Compress::Zlib, this limit is set by the WindowBits parameter.

Decompressor Skeleton
A simple program might look like this: $out = ''; while (...) { if (...) { $out .= chr($literal); } else { $out .= substr($out, -$dist, $len); print $out; We'll see how to inflate zipped data, but first...

Compression Example Message: 'The rain in Spain stays mainly in the plain' Literal 'The rain ' Pointer 3 3 'in ' Literal 'Sp' Pointer 9 4 'ain ' Literal 'stays m' Pointer 11 3 'ain' Literal 'ly' Pointer 22 4 ' in ' Literal 't' Pointer 34 3 'he ' Literal 'pl' Pointer 15 3 'ain'

Implementation Trick: Ring Buffer
$out .= is a bit inefficient; the entire contents of $out may need to be copied to a new memory location. Ring buffer avoids this. Add to head until it reaches tail, then write to output file and advance tail. Used Tail Head Unused

Preliminaries: Bitstream
In the Huffman talk, I used strings of 0s and 1s. That's easy, but uses 8 times the required amount of memory. The following functions are more like what you'd see in a “real” implementation. Using “bit-bashing” operators instead of substr to get at the bits: | (or), & (and), << (left shift), >> (right shift) We'll keep some useful information in here: open $IN, '<', $filename or die; $inf = { file=>$IN };

Get a Bit Read one byte (8 bits) at a time.
Remember the other bits for next call to get_bit. sub get_bit { my ($inf) unless ($inf->{bits}) { my $c = getc($inf->{file}); die 'unexpected end of file' unless defined $c; $inf->{byte} = ord $c; $inf->{bits} = 8; } my $bit = $inf->{byte} & 1; $inf->{byte} >>= 1; $inf->{bits}--; return $bit;

Get Several Bits sub get_bits { my ($inf, $nbits) my $bits = 0; for my $i (0 .. $nbits-1) { $bits |= get_bit($inf) << $i; } return $bits; This works, but it breaks the input into individual bits, then puts it back together again. Unnecessary work!

Several Bits Take Two Read 8 bits at a time, then peel off the required number of bits. sub get_bits { my ($inf, $nbits) while ($inf->{bits} < $nbits) { my $c = getc($inf->{file}); die 'unexpected end of file' unless defined $c; $inf->{byte} |= ord($c) << $inf->{bits}; $inf->{bits} += 8; } my $mask = (1 << $nbits) - 1; my $bits = $inf->{byte} & $mask; $inf->{byte} >>= $nbits; $inf->{bits} -= $nbits; return $bits;

Mask Construction $mask = (1 << $nbits) - 1;
Builds a mask of consecutive 1-bits. 1 << (1 << 5) Alternatively, using ~ (bitwise not): ~ ~0 << ~(~0 << 5)

Bit-bashing Headache? Example: want 12 bits, already have 3
old bits 1st byte << 2nd byte << 19 bits mask (12 ones) 12 bits 7 bits left over

Inflate Stream A deflated data stream contains a series of blocks
sub inflate { my ($IN, $OUT) my $inf = { file=>$IN, out=>'' }; my $last; do { $last = get_bit($inf); my $mode = get_bits($inf, 2); if ($mode == 0) { uncompressed_block($inf); } elsif ($mode == 1) { our $fixed_tree ||= rebuild_huff_tree( [(8) x 144, (9) x 112, (7) x 24, (8) x 8]); our $fixed_dist ||= rebuild_huff_tree([(5) x 32]); inflate_block($inf, $fixed_tree, $fixed_dist); elsif ($mode == 2) { my ($huff_tree, $huff_dist) = read_huff_trees($inf); inflate_block($inf, $huff_tree, $huff_dist); else { die 'invalid mode'; } until $last; print $OUT $inf->{out}; A deflated data stream contains a series of blocks

Inflate Block Similar to the decompressor skeleton, plus a stop code.
sub inflate_block { my ($inf, $huff_tree, $huff_dist) while (1) { my $code = read_huff($inf, $huff_tree); if ($code < 256) { $inf->{out} .= chr($code); } elsif ($code == 256) { last; else { See next page $inf->{out} .= substr($inf->{out}, -$dist, $len);

Length and Distance $code -= 257; my $len = $len[$code] + get_bits($inf,$len_bits[$code]); $code = read_huff($inf, $huff_dist); my $dist = $dist[$code] + get_bits($inf,$dist_bits[$code]); Use a set of tables to get $len and $dist from $code. = ( 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 19, 23, 27, 31, 35, 43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258 ); = ( 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 0 ); = ( 1, 2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193, 257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145, 8193, 12289, 16385, ); = ( 0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13 );

Be Careful It's possible for $len to be greater than $dist.
For example, if $dist is 1, repeat the previous byte $len times. To handle this correctly, add a loop: while ($len > $dist) { $inf->{out} .= substr($inf->{out}, -$dist); $len -= $dist; } $inf->{out} .= substr($inf->{out}, -$dist, $len);

Make Some Test Data Saved 8 bytes... not bad for such a short file.
use IO::Compress::RawDeflate qw( rawdeflate ); rawdeflate($ARGV[0], $ARGV[1]); > cat spain.txt The rain in Spain stays mainly in the plain > perl deflate spain.txt spain.defl > ls -l spain.txt spain.defl -rw-r--r-- 1 bob bob 36 Aug 12 07:30 spain.defl -rw-r--r-- 1 bob bob 44 Aug 12 07:29 spain.txt Saved 8 bytes... not bad for such a short file.

Inflate Test That's the expected output.
open my $IN, '<', $ARGV[0] or die; open my $OUT, '>', $ARGV[1] or die; binmode $IN; binmode $OUT; inflate($IN, $OUT); > perl inflate spain.defl spain.out > cat spain.out The rain in Spain stays mainly in the plain That's the expected output.

Bigger Test File File size reduced by 52% No differences--good!
> perl deflate ch1.txt ch1.defl > ls -l ch1.txt ch1.defl -rw-r--r-- 1 bob bob Aug 12 07:33 ch1.defl -rw-r--r-- 1 bob bob Aug 12 07:32 ch1.txt File size reduced by 52% > perl inflate ch1.defl ch1.out > diff ch1.txt ch1.out > No differences--good!

That's the basics of LZ77 decompression. But we left some pieces out.
Checkpoint That's the basics of LZ77 decompression. But we left some pieces out.

Uncompressed Block Just copy some bytes from input to output.
sub uncompressed_block { my ($inf) $inf->{bits} = 0; # discard partial byte my $len = get_bits($inf, 16); my $check = get_bits($inf, 16); die 'bad length' unless $check == ($len ^ 0xffff); my $count = read $inf->{file}, (my $buf), $len; die 'incomplete block' unless $count == $len; $inf->{out} .= $buf; }

Huffman Tree Example Huffman tree from the previous talk Each code is a path from the root of the tree to a leaf a 00 e 01 i 110 o 10 u 1110 y 1111 1 1 1 a e o 1 i 1 u y

Huffman Code Review The following functions are from the Huffman talk, with a few changes. Read a Huffman code from the data stream: sub read_huff { my ($inf, $tree) while (ref $tree) { $tree = $tree->[get_bit($inf)]; } return $tree;

Another Trick: Tree Flattening
00 01 10 11 a e o This tree uses two bits per transition instead of one, so fewer iterations are needed. The star * means that we have to give a bit back in that situation. 0* 10 11 i u y

Rebuild Codes from Lengths
a Sort by length First code is all 0 Count Zero-pad when length increases Last code is all 1 e 1 o 1 i 1 1 u 1 1 1 y 1 1 1 1

Rebuild Huffman Codes sub rebuild_huff_codes { my ($len) = @_;
= sort { $len->[$a] <=> $len->[$b] } grep { $len->[$_] } 0 .. $#$len; my $prev_len = 0; my $code = 0; = (undef) foreach my $i { $code <<= $len->[$i] - $prev_len; $codes[$i] = $code; $code++; $prev_len = $len->[$i]; } die 'invalid lengths' unless $code == 1 << $prev_len; return

Huffman Tree Rebuilding
1 Start with an empty tree...

1 1 a … and add branches

1 1 1 a 1 i

y 1111 1 1 1 a 1 i 1 y

Rebuild Huffman Tree sub rebuild_huff_tree { my ($len) = @_;
my $codes = rebuild_huff_codes($len); my $tree = [undef, undef]; for my $i (0 .. $#$len) { my $length = $len->[$i] or next; my $code = $codes->[$i]; my $pos = $tree; for (my $j = $length-1; $j > 0; $j--) { my $bit = ($code >> $j) & 1; $pos->[$bit] ||= [undef, undef]; $pos = $pos->[$bit]; } $pos->[$code & 1] = $i; return $tree;

Read Huffman Trees The final piece of the puzzle: read the stored Huffman trees from the compressed datastream. sub read_huff_trees { my ($inf) my $nlen = get_bits($inf, 5) + 257; my $ndist = get_bits($inf, 5) + 1; my $len = read_huff_len($inf, $nlen + $ndist); = $nlen); my $huff_tree = rebuild_huff_tree($len); my $huff_dist = return ($huff_tree, $huff_dist); }

Run-Length Encoding If we peek in the code length array, we might see something like this: 0, 0, 11, 0, 11, 10, 9, 8, 8, 8, 7, 7, 6, 6, 6, 6, 5, 5, 4, 5, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4 There's a lot of repetition in there. We can shrink it by replacing runs of the same number: 0, 0, 11, 0, 11, 10, 9, (8)x3, 7, 7, (6)x4, 5, 5, 4, 5, (4)x6, 3, 4, 3, 4 Zip has a clever way to do this, of course.

Read Huffman Code Lengths
sub read_huff_len { my ($inf, $num) my $huff = read_huff_huff($inf); = (0) x $num; my $i = 0; while ($i < $num) { my $code = read_huff($inf, $huff); if ($code < 16) { $len[$i] = $code; $i++; } elsif ($code == 16) { my $v = $len[$i-1]; my $n = get_bits($inf,2)+3; for (1 .. $n) { $len[$i] = $v; elsif ($code == 17) { $i += get_bits($inf,3)+3; } elsif ($code == 18) { $i += get_bits($inf,7)+11; die 'invalid lengths' unless $i == $num; return Read the run-length encoded Huffman code lengths from the data stream.

Self-Referentiality This reads the Huffman tree that's used to compress the other Huffman trees. = ( 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 ); sub read_huff_huff { my ($inf) my $num = get_bits($inf, 4) + 3; = (0) x 19; for my $i (0 .. $num) { $len[$huff_order[$i]] = get_bits($inf, 3); } return

This completes a decompressor for zipped data
Conclusion This completes a decompressor for zipped data in about 200 lines of Perl.

LZ77 Compression Introduced by Abraham Lempel and Jacob Ziv in 1977

Similar presentations

Presentation on theme: "LZ77 Compression Introduced by Abraham Lempel and Jacob Ziv in 1977"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LZ77 Compression Introduced by Abraham Lempel and Jacob Ziv in 1977

Similar presentations

Presentation on theme: "LZ77 Compression Introduced by Abraham Lempel and Jacob Ziv in 1977"— Presentation transcript:

Similar presentations

About project

Feedback