Processing of large document collections

Slides:



Advertisements
Similar presentations
Data Compression CS 147 Minh Nguyen.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
An introduction to Data Compression
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Lecture04 Data Compression.
Introduction to Data Compression
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
A Data Compression Algorithm: Huffman Compression
CS336: Intelligent Information Retrieval
Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.
CSE 143 Lecture 18 Huffman slides created by Ethan Apter
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Data Compression Basics & Huffman Coding
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Data dan Teknologi Multimedia Sesi 08 Nofriyadi Nurdam.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
Processing of large document collections Part 4. Text compression zDespite a continuous increase in storage and transmission capacities, more and more.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
15-853Page :Algorithms in the Real World Data Compression II Arithmetic Coding – Integer implementation Applications of Probability Coding – Run.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Lossless Compression(2)
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 4, 2012 Lossless Data Compression.
1 Data Compression Hae-sun Jung CS146 Dr. Sin-Min Lee Spring 2004.
Compression techniques Adaptive and non-adaptive.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
3.3 Fundamentals of data representation
HUFFMAN CODES.
COMP261 Lecture 22 Data Compression 2.
Data Coding Run Length Coding
Data Compression.
Data Compression.
Algorithms for iSNE Dr. Kenneth Cosh Week 13.
Huffman Coding Based on slides by Ethan Apter & Marty Stepp
Data Compression CS 147 Minh Nguyen.
Context-based Data Compression
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Chapter 11 Data Compression
Huffman Encoding.
Huffman Coding Greedy Algorithm
CSE 589 Applied Algorithms Spring 1999
Presentation transcript:

Processing of large document collections Fall 2002, Part 3

Text compression Despite a continuous increase in storage and transmission capacities, more and more effort has been put into using compression to increase the amount of data that can be handled no matter how much storage space or transmission bandwidth is available, someone always finds ways to fill it with

Text compression Efficient storage and representation of information is an old problem (before the computer era) Morse code: uses shorter representations for common characters Braille code for the blind: includes contractions, which represent common words with 2 or 3 characters

Text compression On a computer: changing the representation of a file so that it takes less space to store or less time to transmit original file can be reconstructed exactly from the compressed representation different than data compression in general text compression has to be lossless compare with sound and images: small changes and noise is tolerated

Text compression methods Huffman coding (in the 50’s) compressing English: 5 bits/character Ziv-Lempel compression (in the 70’s) 4 bits/character arithmetic coding 2 bits/char (more processing power needed) prediction by partial matching (80’s)

Text compression methods Since 80’s compression rate has been about the same improvements are made in processor and memory utilization during compression also: amount of compression may increase when more memory (for compression and uncompression) is available

Text compression methods Most text compression methods can be placed in one of two classes: symbolwise methods dictionary methods

Symbolwise methods Work by estimating the probabilities of symbols (often characters) coding one symbol at a time using shorter codewords for the most likely symbols (in the same way as Morse code does)

Symbolwise methods variations differ mainly in how they estimate probabilities for symbols the more accurate these estimates are, the greater the compression that can be achieved to obtain good compression, the probability estimate is usually based on the context in which a symbol occurs

Dictionary methods compress by replacing words and other fragments of text with an index to an entry in a ”dictionary” compression is achieved if the index is stored in fewer bits than the string it replaces

Symbolwise methods Modeling Coding estimating probabilities there does not appear to be any single ”best” method Coding converting the probabilities into a bitstream for transmission well understood, can be performed effectively

Models Compression methods obtain high compression by forming good models of the data that is to be coded the function of a model is to predict symbols e.g. during the encoding of a text , the ”prediction” for the next symbol might include a probability of 2% for the letter ’u’, based on its relative frequency in a sample of text

Models The set of all possible symbols is called the alphabet the probability distribution provides an estimated probability for each symbol in the alphabet

Encoding, decoding the model provides the probability distribution to the encoder, which uses it to encode the symbol that actually occurs the decoder uses an identical model together with the output of the encoder to find out what the encoded symbol was

Information content of a symbol The number of bits in which a symbol s should be coded is called the information content I(s) of the symbol the information content I(s) is directly related to the symbol’s predicted probability P(s), by the function I(s) = -log P(s) bits

Information content of a symbol The average amount of information per symbol over the whole alphabet is known as the entropy of the probability distribution, denoted by H:

Information content of a symbol Provided that the symbols appear independently and with the assumed probabilities, H is a lower bound on compression, measured in bits per symbol, that can be achieved by any coding method

Information content of a symbol If the probability of symbol ’u’ is estimated to be 2%, the corresponding information content is 5.6 bits if ’u’ happens to be the next symbol that is to be coded, it should be transmitted in 5.6 bits

Information content of a symbol predictions can usually be improved by taking account of the previous symbol if a ’q’ has just occurred, the probability of ’u’ may jump to 95 %, based on how often ’q’ is followed by ’u’ in a sample of text information content of ’u’ in this case is 0.074 bits

Information content of a symbol Models that take a few immediately preceding symbols into account to make a prediction are called finite-context models of order m m is the number of previous symbols used to make a prediction

Static models There are many ways to estimate the probabilities in a model we could use static modelling: always use the same probabilities for symbols, regardless of what text is being coded compressing system may not perform well, if different text is received e.g. a model for English with a file of numbers

Semi-static models One solution is to generate a model specifically for each file that is to be compressed an initial pass is made through the file to estimate symbol probabilities, and these are transmitted to the decode before transmitting the encoded symbols this is called semi-static modelling

Semi-static models Semi-static modelling has the advantage that the model is invariably better suited to the input than a static one, but the penalty paid is having to transmit the model first, as well as the preliminary pass over the data to accumulate symbol probabilities

Adaptive models Adaptive model begins with a bland probability distribution and gradually alters it as more symbols are encountered as an example, assume a zero-order model, i.e., no context is used to predict the next symbol

Adaptive models Assume that a encoder has already encoded a long text and come to a sentence: It migh now the probability that the next character is ’t’ is estimated to be 49,983/768,078 = 6.5 %, since in the previous text, 49,983 characters of the total of 768,078 characters were ’t’

Adaptive models Using the same system, ’e’ has the probability 9.4 % and ’x’ has probability 0.11 % the model provides this estimated probability distribution to an encoder the decoder is able to generate the same model since it has the same probability estimates as the encoder

Adaptive models For a higher-order model, such as a first-order model, the probability is estimated by how often that character has occurred in the current context in a zero-order model earlier, a symbol ’t’ occurred in a context: It migh , but the model made no use of the characters of the phrase

Adaptive models A first-order model would use the final ’h’ as a context with which to condition the probability estimates the letter ’h’ has occurred 37,525 times in the prior text, and 1,133 of these times it was followed by a ’t’ the probability of ’t’ occurring after an ’h’ can be estimated to be 1,133/37,525=3.02 %

Adaptive models For ’t’, a prediction of 3.2% is actually worse than in the zero-order model because ’t’ is rare in this context (’e’ follows ’h’ much more often) second-order model would use the relative frequency that the context ’gh’ is followed by ’t’, which is the case in 64,6%

Adaptive models Good: robust, reliable, flexible Bad: not suitable for random access to compressed files a text can be decoded only from the beginning: the model used for coding a particular part of the text is determined from all the preceding text -> not suitable for full-text retrieval

Coding Coding is the task of determining the output representation of a symbol, based on a probability distribution supplied by a model general idea: the coder should output short codewords for likely symbols and long codewords for rare ones symbolwise methods depend heavily on a good coder to achieve compression

Huffman coding A phrase is coded by replacing each of its symbols with the codeword given by a table Huffman coding generates codewords for a set of symbols, given some probability distribution for the symbols the type of code is called prefix-free code no codeword is the prefix of another symbol’s codeword

Huffman coding The codewords can be stored in a tree (a decoding tree) Huffman’s algorithm works by constructing the decoding tree from the bottom up

Huffman coding Algorithm create for each symbol a leaf node containing the symbol and its probability two nodes with the smallest probabilities become siblings under a new parent node, which is given a probability equal to the sum of its two children’s probabilities the combining operation is repeated until there is only one node without a parent the two branches from every nonleaf node are then labeled 0 and 1

Huffman coding Huffman coding is generally fast for both encoding and decoding, provided that the probability distribution is static adaptive Huffman coding is possible, but needs either a lot of memory or is slow coupled with a word-based model (rather than character-based model), gives a good compression

Dictionary models Dictionary-based compression methods use the principle of replacing substrings in a text with a codeword that identifies that substring in a dictionary dictionary contains a list of substrings and a codeword for each substring often fixed codewords used reasonable compression is obtained even if coding is simple

Dictionary models The simplest dictionary compression methods use small dictionaries for instance, digram coding selected pairs of letters are replaced with codewords a dictionary for the ASCII character set might contain the 128 ASCII characters, as well as 128 common letter pairs

Dictionary models Digram coding… the output codewords are eight bits each the presence of the full ASCII character set ensures that any (ASCII) input can be represented at best, every pair of characters is replaced with a codeword, reducing the input from 7 bits/character to 4 bits/characters at worst, each 7 bit character will be expanded to 8 bits

Dictionary models Natural extension: put even larger entries in the dictionary, e.g. common words like ’and’, ’the’,… or common components of words like ’pre’, ’tion’… a predefined set of dictionary phrases make the compression domain-dependent or very short phrases have to be used -> good compression is not achieved

Dictionary models One way to avoid the problem of the dictionary being unsuitable for the text at hand is to use a semi-static dictionary scheme constuct a new dictionary for every text that is to be compressed overhead of transmitting or storing the dictionary is significant decision of which phrases should be included is a difficult problem

Dictionary models Solution: use an adaptive dictionary scheme Ziv-Lempel coders (LZ77 and LZ78) a substring of text is replaced with a pointer to where it has occurred previously dictionary: all the text prior to the current position codewords: pointers

Dictionary models Ziv-Lempel… the prior text makes a very good dictionary since it is usually in the same style and language as upcoming text the dictionary is transmitted implicitly at no extra cost, because the decoder has access to all previously encoded text

LZ77 Key benefits: relatively easy to implement decoding can be performed extremely quickly using only a small amount of memory suitable when the resources required for decoding must be minimized, like when data is distributed or broadcast from a central source to a number of small computers

LZ77 The output of an encoder consists of a sequence of triples, e.g. <3,2,b> the first component of a triple indicates how far back to look in the previous (decoded) text to find the next phrase the second component records how long the phrase is the third component gives the next character from the input

LZ77 The components 1 and 2 constitute a pointer back into the text the component 3 is actually necessary only when the character to be coded does not occur anywhere in the previous input

LZ77 Encoding for the text from the current point ahead: search for the longest match in the previous text output a triple that records the position and length of the match the search for a match may return a length of zero, in which case the position of the match is not relevant search can be accelerated by indexing the prior text with a suitable data structure

LZ77 limitations on how far back a pointer can refer and the maximum size of the string referred to e.g. for English text, a window of a few thousand characters the length of the phrase e.g. maximum of 16 characters otherwise too much space wasted without benefit

LZ77 The decoding program is very simple, so it can be included with the data at very little cost in fact, the compressed data is stored as part of the decoder program, which makes the data self-expanding common way to distribute files