Processing of large document collections Part 4. Text compression zDespite a continuous increase in storage and transmission capacities, more and more.

Processing of large document collections Part 4

Text compression zDespite a continuous increase in storage and transmission capacities, more and more effort has been put into using compression to increase the amount of data that can be handled zno matter how much storage space or transmission bandwidth is available, someone always finds ways to fill it with

Text compression zEfficient storage and representation of information is an old problem (before the computer era) yMorse code: uses shorter representations for common characters yBraille code for the blind: includes contractions, which represent common words with 2 or 3 characters

Text compression zOn a computer: changing the representation of a file so that it takes less space to store or less time to transmit yoriginal file can be reconstructed exactly from the compressed representation zdifferent than data compression in general ytext compression has to be lossless ycompare with sound and images: small changes and noise is tolerated

Text compression methods zHuffman coding (in the 50’s) ycompressing English: 5 bits/character zZiv-Lempel compression (in the 70’s) y4 bits/character zarithmetic coding y2 bits/char (more processing power needed) zprediction by partial matching (80’s)

Text compression methods zSince 80’s compression rate has been about the same zimprovements are made in processor and memory utilization during compression zalso: amount of compression may increase when more memory (for compression and uncompression) is available

Text compression methods zMost text compression methods can be placed in one of two classes: ysymbolwise methods ydictionary methods

Symbolwise methods zWork by estimating the probabilities of symbols (often characters) ycoding one symbol at a time yusing shorter codewords for the most likely symbols (in the same way as Morse code does)

Dictionary methods zAchieve compression by replacing words and other fragments of text with an index to an entry in a ”dictionary” ythe Braille code is a dictionary method since special codes are used to represent whole words

Symbolwise methods zUsually based on Huffman coding or arithmetic coding zvariations differ mainly in how they estimate probabilities for symbols ythe more accurate these estimates are, the greater the compression that can be achieved yto obtain good compression, the probability estimate is usually based on the context in which a symbol occurs

Symbolwise methods zModeling yestimating probabilities ythere does not appear to be any single ”best” method zCoding yconverting the probabilities into a bitstream for transmission ywell understood, can be performed effectively

Dictionary methods zGenerally use quite simple representations to code references to entries in the dictionary zobtain compression by representing several symbols as one output codeword

Dictionary methods zoften based on Ziv-Lempel coding yreplaces strings of characters with a reference to a previous occurrence of a string yadaptive yeffective: most characters can be coded as part of a string that has occurred earlier in the text ycompression is achieved if the reference is stored in fewer bits than the string it replaces

Models zCompression methods obtain high compression by forming good models of the data that is to be coded zthe function of a model is to predict symbols ye.g. during the encoding of a text, the ”prediction” for the next symbol might include a probability of 2% for the letter ’u’, based on its relative frequency in a sample of text

Models zThe set of all possible symbols is called the alphabet zthe probability distribution provides an estimated probability for each symbol in the alphabet

Encoding, decoding zthe model provides the probability distribution to the encoder, which uses it to encode the symbol that actually occurs zthe decoder uses an identical model together with the output of the encoder to find out what the encoded symbol was

Information content of a symbol zThe number of bits in which a symbol s should be coded is called the information content I(s) of the symbol zthe information content I(s) is directly related to the symbol’s predicted probability P(s), by the function yI(s) = -log P(s) bits

Information content of a symbol zThe average amount of information per symbol over the whole alphabet is known as the entropy of the probability distribution, denoted by H:

Information content of a symbol zProvided that the symbols appear independently and with the assumed probabilities, H is a lower bound on compression, measured in bits per symbol, that can be achieved by any coding method

Information content of a symbol zIf the probability of symbol ’u’ is estimated to be 2%, the corresponding information content is 5.6 bits zif ’u’ happens to be the next symbol that is to be coded, it should be transmitted in 5.6 bits zpredictions can usually be improved by taking account of the previous symbol

Information content of a symbol zpredictions can usually be improved by taking account of the previous symbol zif a ’q’ has just occurred, the probability of ’u’ may jump to 95 %, based on how often ’q’ is followed by ’u’ in a sample of text zinformation content of ’u’ in this case is 0.074 bits

Information content of a symbol zModels that take a few immediately preceding symbols into account to make a prediction are called finite-context models of order m ym is the number of previous symbols used to make a prediction

Adaptive models zThere are many ways to estimate the probabilities in a model zwe could use static modelling: yalways use the same probabilities for symbols, regardless of what text is being coded ycompressing system may not perform well, if different text is received xe.g. a model for English with a file of numbers

Adaptive models zOne solution is to generate a model specifically for each file that is to be compressed zan initial pass is made through the file to estimate symbol probabilities, and these are transmitted to the decode before transmitting the encoded symbols zthis is called semi-static modelling

Adaptive models zSemi-static modelling has the advantage that the model is invariably better suited to the input than a static one, but the penalty paid is yhaving to transmit the model first, yas well as the preliminary pass over the data to accumulate symbol probabilities

Adaptive models zAdaptive model begins with a bland probability distribution and gradually alters it as more symbols are encountered zas an example, assume a zero-order model, i.e., no context is used to predict the next symbol

Adaptive models zAssume that a encoder has already encoded a long text and come to a sentence: It migh znow the probability that the next character is ’t’ is estimated to be 49,983/768,078 = 6.5 %, since in the previous text, 49,983 characters of the total of 768,078 characters were ’t’

Adaptive models zUsing the same system, ’e’ has the probability 9.4 % and ’x’ has probability 0.11 % zthe model provides this estimated probability distribution to an encoder zthe decoder is able to generate the same model since it has the same probability estimates as the encoder

Adaptive models zFor a higher-order model, such as a first- order model, the probability is estimated by how often that character has occurred in the current context zin a zero-order model earlier, a symbol ’t’ occurred in a context: It migh, but the model made no use of the characters of the phrase

Adaptive models zA first-order model would use the final ’h’ as a context with which to condition the probability estimates zthe letter ’h’ has occurred 37,525 times in the prior text, and 1,133 of these times it was followed by a ’t’ zthe probability of ’t’ occurring after an ’h’ can be estimated to be 1,133/37,525=3.02 %

Adaptive models zFor ’t’, a prediction of 3.2% is actually worse than in the zero-order model because ’t’ is rare in this context (’e’ follows ’h’ much more often) zsecond-order model would use the relative frequency that the context ’gh’ is followed by ’t’, which is the case in 64,6%

Adaptive models zGood: robust, reliable, flexible zBad: not suitable for random access to compressed files ya text can be decoded only from the beginning: the model used for coding a particular part of the text is determined from all the preceding text y-> not suitable for full-text retrieval

Coding zCoding is the task of determining the output representation of a symbol, based on a probability distribution supplied by a model zgeneral idea: the coder should output short codewords for likely symbols and long codewords for rare ones zsymbolwise methods depend heavily on a good coder to achieve compression

Huffman coding zA phrase is coded by replacing each of its symbols with the codeword given by a table zHuffman coding generates codewords for a set of symbols, given some probability distribution for the symbols zthe type of code is called prefix-free code yno codeword is the prefix of another symbol’s codeword

Huffman coding zThe codewords can be stored in a tree (a decoding tree) zHuffman’s algorithm works by constructing the decoding tree from the bottom up

Huffman coding zAlgorithm ycreate for each symbol a leaf node containing the symbol and its probability ytwo nodes with the smallest probabilities become siblings under a new parent node, which is given a probability equal to the sum of its two children’s probabilities ythe combining operation is repeated until there is only one node without a parent ythe two branches from every nonleaf node are then labeled 0 and 1

Huffman coding zHuffman coding is generally fast for both encoding and decoding, provided that the probability distribution is static yadaptive Huffman coding is possible, but needs either a lot of memory or is slow zcoupled with a word-based model (rather than character-based model), gives a good compression

Canonical Huffman codes zThe frequency of each word is counted zthe codewords are chosen for each word to minimize the size of the compressed file -> static zero-order word-level model zthe codewords are shown in decreasing order of length (and therefore in increasing order of word frequency) ywithin each block of codes of the same length, words are ordered alphabetically

Canonical Huffman codes zThe list begins with the thousands of words (and numbers) that appear only once zwhen the codewords are sorted in lexical order, they are also in order from the longest to the shortest codeword

Canonical Huffman codes zA word’s encoding can be determined quickly from the length of its codeword, how far through the list it is, and the codeword for the first word of that length

Canonical Huffman codes zIt is not necessary to store a decode tree zall the is required is ya list of the symbols ordered according to the lexical order of the codewords yan array storing the first codeword of each distinct length

Dictionary models zDictionary-based compression methods use the principle of replacing substrings in a text with a codeword that identifies that substring in a dictionary zdictionary contains a list of substrings and a codeword for each substring zoften fixed codewords used yreasonable compression is obtained even if coding is simple

Dictionary models zThe simplest dictionary compression methods use small dictionaries zfor instance, digram coding yselected pairs of letters are replaced with codewords ya dictionary for the ASCII character set might contain the 128 ASCII characters, as well as 128 common letter pairs

Dictionary models zDigram coding… ythe output codewords are eight bits each ythe presence of the full ASCII character set ensures that any (ASCII) input can be represented yat best, every pair of characters is replaced with a codeword, reducing the input from 7 bits/character to 4 bits/characters yat worst, each 7 bit character will be expanded to 8 bits

Dictionary models zNatural extension: yput even larger entries in the dictionary, e.g. common words like ’and’, ’the’,… or common components of words like ’pre’, ’tion’… za predefined set of dictionary phrases make the compression domain-dependent yor very short phrases have to be used -> good compression is not achieved

Dictionary models zOne way to avoid the problem of the dictionary being unsuitable for the text at hand is to use a semi-static dictionary scheme yconstuct a new dictionary for every text that is to be compressed yoverhead of transmitting or storing the dictionary is significant ydecision of which phrases should be included is a difficult problem

Dictionary models zSolution: use an adaptive dictionary scheme zZiv-Lempel coders (LZ77 and LZ78) za substring of text is replaced with a pointer to where it has occurred previously zdictionary: all the text prior to the current position zcodewords: pointers

Dictionary models zZiv-Lempel… ythe prior text makes a very good dictionary since it is usually in the same style and language as upcoming text ythe dictionary is transmitted implicitly at no extra cost, because the decoder has access to all previously encoded text

LZ77 zKey benefits: yrelatively easy to implement ydecoding can be performed extremely quickly using only a small amount of memory zsuitable when the resources required for decoding must be minimized, like when data is distributed or broadcast from a central source to a number of small computers

LZ77 zThe output of an encoder consists of a sequence of triples, e.g. ythe first component of a triple indicates how far back to look in the previous (decoded) text to find the next phrase ythe second component records how long the phrase is ythe third component gives the next character from the input

LZ77 zThe components 1 and 2 consitute a pointer back into the text zthe component 3 is actually necessary only when the character to be coded does not occur anywhere in the previous input

LZ77 zEncoding yfor the text from the current point ahead: xsearch for the longest match in the previous text xoutput a triple that records the position and length of the match xthe search for a match may return a length of zero, in which case the position of the match is not relevant ysearch can be accelerated by indexing the prior text with a suitable data structure

LZ77 zlimitations on how far back a pointer can refer and the maximum size of the string referred to ze.g. for English text, a window of a few thousand characters zthe length of the phrase e.g. maximum of 16 characters zotherwise too much space wasted without benefit

LZ77 zThe decoding program is very simple, so it can be included with the data at very little cost zin fact, the compressed data is stored as part of the decoder program, which makes the data self-expanding zcommon way to distribute files

Synchronization zGood compression methods perform best when compressing large files zthis tends to preclude random access because the decompression algorithms are sequential by nature zfull-text retrieval systems require random access, and special measures need to be taken to facilitate this

Synchronization zThere are two reasons that the better compression methods require files to be decoded from the beginning ythey use variable-length codes ytheir models are adaptive

Synchronization zWith variable-length codes, it is not possible to begin decoding at an arbitrary position in the file ywe cannot be sure of starting on the boundary between two codewords zadaptive modelling: even if the codeword boundary is known, the model required for decoding can be constructed only using all the preceding text

Synchronization zTo achieve synchronization with adaptive modelling, large files must be broken up into small sections of a few kilobytes za good compression cannot be obtained by compressing the small sections separately ->the compression model has to be constructed based on a sample text of the whole data

Synchronization zFor full-text retrieval, it is usually preferable to use a static model rather than an adaptive one zstatic models are more appropriate given the static nature of a large textual database

Synchronization zIn a full-text retrieval system, the main text usually consists of a number of documents, which represent the smallest unit to which random access is required zin an uncompressed text, a document can be identified simply by specifying how many bytes it is from the start of the file ywhen a variable-length code is used, a document may not start on a byte boundary

Synchronization zA solution is to insist that documents begin on byte boundaries, which means that the last byte may contain some wasted bits zthere has to be some way to tell the decoder that it should not interpret some bits

Processing of large document collections Part 4. Text compression zDespite a continuous increase in storage and transmission capacities, more and more.

Similar presentations

Presentation on theme: "Processing of large document collections Part 4. Text compression zDespite a continuous increase in storage and transmission capacities, more and more."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Processing of large document collections Part 4. Text compression zDespite a continuous increase in storage and transmission capacities, more and more.

Similar presentations

Presentation on theme: "Processing of large document collections Part 4. Text compression zDespite a continuous increase in storage and transmission capacities, more and more."— Presentation transcript:

Similar presentations

About project

Feedback