Page 1 Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB.

Slides:



Advertisements
Similar presentations
Algoritmi per IR Dictionary-based compressors. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary.
Advertisements

15-583:Algorithms in the Real World
Data Compression CS 147 Minh Nguyen.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.
An introduction to Data Compression
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Greedy Algorithms (Huffman Coding)
Algorithms for Data Compression
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lossless Compression - II Hao Jiang Computer Science Department Sept. 18, 2007.
Lempel-Ziv Compression Techniques
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
1 Lempel-Ziv algorithms Burrows-Wheeler Data Compression.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Chapter 7 Special Section Focus on Data Compression.
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Advanced Algorithms Piyush Kumar (Lecture 10: Compression) Welcome to COT5405 Source: Guy E. Blelloch, Emad, Tseng …
Algorithms in the Real World
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Source Coding-Compression
296.3Page 1 CPS 296.3:Algorithms in the Real World Data Compression: Lecture 2.5.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Fundamental Structures of Computer Science Feb. 24, 2005 Ananda Guna Lempel-Ziv Compression.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Digital Image Processing Lecture 22: Image Compression
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 4, 2012 Lossless Data Compression.
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.
Lecture 4: Data Compression Techniques TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency.
COMP261 Lecture 22 Data Compression 2.
Data Coding Run Length Coding
Compression & Huffman Codes
Data Compression.
Digital Image Processing Lecture 20: Image Compression May 16, 2005
Information and Coding Theory
Data Compression.
Algorithms in the Real World
COMP261 Lecture 21 Data Compression.
Algorithms in the Real World
Applied Algorithmics - week7
Chapter 11 Data Compression
Greedy: Huffman Codes Yin Tat Lee
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Page 1 Algorithms in the Real World Lempel-Ziv Burroughs-Wheeler ACB

Page 2 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks, … Information Theory: Entropy, etc. Probability Coding: Huffman + Arithmetic Coding Applications of Probability Coding: PPM + others Lempel-Ziv Algorithms: –LZ77, gzip, –LZ78, compress (Not covered in class) Other Lossless Algorithms: Burrows-Wheeler Lossy algorithms for images: JPEG, MPEG,... Compressing graphs and meshes: BBK

Page 3 Lempel-Ziv Algorithms LZ77 (Sliding Window) Variants: LZSS (Lempel-Ziv-Storer-Szymanski) Applications: gzip, Squeeze, LHA, PKZIP, ZOO LZ78 (Dictionary Based) Variants: LZW (Lempel-Ziv-Welch), LZC Applications: compress, GIF, CCITT (modems), ARC, PAK Traditionally LZ77 was better but slower, but the gzip version is almost as fast as any LZ78.

Page 4 LZ77: Sliding Window Lempel-Ziv Dictionary and buffer “windows” are fixed length and slide with the cursor Repeat: Output (o, l, c) where o = position (offset) of the longest match that starts in the dictionary (relative to the cursor) l = length of longest match c = next char in buffer beyond longest match Advance window by l + 1 aacaacabcababac Dictionary (previously coded) Lookahead Buffer Cursor

Page 5 LZ77: Example aacaacabcabaaac (_,0,a) aacaacabcabaaac (1,1,c) aacaacabcabaaac (3,4,b) aacaacabcabaaac (3,3,a) Dictionary (size = 6)Longest match Next characterBuffer (size = 4) aacaacabcabaaac (1,2,c)

Page 6 LZ77 Decoding Decoder keeps same dictionary window as encoder. For each message it looks it up in the dictionary and inserts a copy at the end of the string What if l > o? (only part of the message is in the dictionary.) E.g. dict = abcd, codeword = (2,9,e) Simply copy from left to right for (i = 0; i < length; i++) out[cursor+i] = out[cursor-offset+i] Out = abcdcdcdcdcdce First character is in the dictionary, and each time a character is read, another is written, so never run out of characters to read.

Page 7 LZ77 Optimizations used by gzip LZSS: Output one of the following two formats (0, position, length) or (1,char) Uses the second format if length < 3. aacaacabcabaaac (1,a) aacaacabcabaaac aacaacabcabaaac (0,3,4) aacaacabcabaaac (1,c)

Page 8 Optimizations used by gzip (cont.) 1.Huffman code the positions, lengths, and chars 2.Non greedy: possibly use shorter match so that next match is better 3.Use a hash table to store the dictionary. –Hash keys are all strings of length 3 in the dictionary window. –Find the longest match within the correct hash bucket. –Puts a limit on the length of the search within a bucket. –Within each bucket store in order of position to make deleting easier when window moves.

Page 9 The Hash Table aacaacabcabaaac …… …… aac19 aac10 aac7aca8 aca11 caa9 cab15 cab12 …

Page 10 Theory behind LZ77 The Sliding Window Lempel-Ziv Algorithm is Asymptotically Optimal, A. D. Wyner and J. Ziv, Proceedings of the IEEE, Vol. 82. No. 6, June Will compress long enough strings to the source entropy as the window size goes to infinity. Special case of proof: Assume infinite dictionary window, characters generated one-at-a-time independently, look for matches of length one only. If i’th character in alphabet occurs with probability p i, expected distance to nearest match is given by (like tossing coin until heads appears) Solving, we find that E[o i ] = 1/p i, which is intuitive. If we can encode o i using ≈ log o i bits, then the expected codeword length is Which is the entropy of the single character distribution.

Logarithmic Length Encoding How can we encode o i using ≈ log o i bits? Gamma code uses two parts: First  log o i  is encoded in unary using  log o i  bits. Then o i is encoded in binary notation using  log o i  bits. Total length of codeword: 2  log o i . Off by a factor of 2! Improvement: Code  log o i  in binary using  log  log o i   bits instead, and first code  log  log o i   in unary using  log  log o i   bits. Total length of codeword: 2  log  log o i   +  log o i . Etc.: 2  log  log  log o i    +  log  log o i   +  log o i  … Page 11

Page 12 Theory behind LZ77 General case: for any n, there is a window size w (typically exponential in n) such that on average a match of size n is found, and the number of bits needed to encode the position and length of the match approaches the source entropy for substrings of length n: Optimal in situation in which each character might depend on previous n-1, even though n is unknown. Uses logarithmic code for the position and match length. (Note that typically match length is short compared to position.) Problem: “long enough” window is really really long.

Page 13 Comparison to Lempel-Ziv 78 Both LZ77 and LZ78 and their variants keep a “dictionary” of recent strings that have been seen. The differences are: –How the dictionary is stored (LZ78 is a trie) –How it is extended (LZ78 only extends an existing entry by one character) –How it is indexed (LZ78 indexes the nodes of the trie) –How elements are removed

Page 14 Lempel-Ziv Algorithms Summary Adapts well to changes in the file (e.g. a tar file with many file types within it). Initial algorithms did not use probability coding and performed poorly in terms of compression. More modern versions (e.g. gzip) do use probability coding as “second pass” and compress much better. The algorithms are becoming outdated, but ideas are used in many of the newer algorithms.

Page 15 Compression Outline Introduction: Lossy vs. Lossless, Benchmarks, … Information Theory: Entropy, etc. Probability Coding: Huffman + Arithmetic Coding Applications of Probability Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, compress, … Other Lossless Algorithms: –Burrows-Wheeler –ACB Lossy algorithms for images: JPEG, MPEG,... Compressing graphs and meshes: BBK

Page 16 Burrows -Wheeler Currently near best “balanced” algorithm for text Breaks file into fixed-size blocks and encodes each block separately. For each block: –First sort each character by its full context. This is called the block sorting transform. –Then use the move-to-front transform to encode the sorted characters. The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.

Page 17 Burrows Wheeler: Example Let’s encode: decode Context “wraps” around. Last char is most significant. In the output, characters with similar contexts are near each other. sort using context as key All rotations of input start

Page 18 Burrows Wheeler Decoding Key Idea: Can construct entire sorted table from sorted column alone! First: sorting the output gives last column of context: ContextOutput co de de ec ed od

Page 19 Burrows Wheeler Decoding Now sort pairs in last column of context and output column to form last two columns of context: ContextOutput co de de ec ed od ContextOutput eco ede ode dec d cod

Page 20 Burrows Wheeler Decoding Repeat until entire table is complete. Pointer to first character provides unique decoding. Message was d in first position, preceded in wrapped fashion by ecode: decode.

Page 21 Burrows Wheeler Decoding Optimization: Don’t really have to rebuild the whole context table. What character comes after the first character, d 1 ? Just have to find d 1 in last column of context and see what follows it: e 1. Observation: instances of same character of output appear in same order in last column of context. (Proof is an exercise.)

Page 22 Burrows-Wheeler: Decoding Output o e e c d d  Context c d d e e o Rank The “rank” is the position of a character if it were sorted using a stable sort.

Page 23 Burrows-Wheeler Decode Function BW_Decode(In, Start, n) S = MoveToFrontDecode(In,n) R = Rank(S) j = Start for i=1 to n do Out[i] = S[j] j = R[j] Rank gives position of each char in sorted order.

Page 24 Decode Example SRank(S) o4o4 6 e2e2 4 e6e6 5 c3c3 1 d1d1 2 d5d5 3 

Page 25 Overview of Text Compression PPM and Burrows-Wheeler both encode a single character based on the immediately preceding context. LZ77 and LZ78 encode multiple characters based on matches found in a block of preceding text Can you mix these ideas, i.e., code multiple characters based on immediately preceding context? –BZ does this, but they don’t give details on how it works –ACB also does this – close to BZ

Page 26 ACB ( Associate Coder of Buyanovsky ) Keep dictionary sorted by context (the last character is the most significant) Find longest match of current context in context part of dictionary Find longest match of look-ahead buffer in contents part of dictionary Code Distance between matches in the sorted order Length of contents match Next character that doesn’t match Shift look-ahead window, update context, contents Has aspects of Burrows-Wheeler and LZ77

ACB Example Page 27 Have seen “decode” so far. Sort (by context) all places in dictionary (contents) that a match for text in look- ahead buffer could occur. Suppose current position is decode|odcoe Best Matches: “decode” in context “odc” in contents (2 chars) Output -4,2,c Decoder can find best match in context, go from there.