Text Operations: Coding / Compression Methods
Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated with space requirements, I/O overhead, communication delays –obstacle: need for IR systems to access text randomly to access a given word in some forms of compressed text, the entire text must be decoded from the beginning until the desired word is reached Two strategies –statistical methods –dictionary methods
Statistical Methods Basic concepts –Modeling: a probability is estimated for each symbol –Coding: a code is assigned to each symbol based on the model –shorter codes are assigned to the most likely symbols Relationship between probabilities and codes –Source code theorem (by Claude Shannon) : a symbol that occurs with probability p should be assigned a code of length log 2 (1/p) bits
Statistical Methods Compression models –adaptive model: progressively learn about the statistical distribution as the compression process goes on decompression of a file has to start from its beginning –static model: assume an average distribution for all input texts poor compression ratios when data deviates from initial distribution assumptions –semi-static model: learn a distribution in a first pass, compress the data in a second pass by using a fixed code derived from the distribution learned information on the data distribution must be stored
Huffman Coding Building a Huffman tree –for each symbol of the alphabet, create a node containing the symbol and its probability –the two nodes with the smallest probabilities become children of a newly created parent node –the parent node is associated a probability equal to the sum of the probabilities of the two chosen children –the operation is repeated, ignoring nodes that are already children, until there is only one node
Huffman Coding Example: “for each rose, a rose is a rose” rose is“, “ a eachfor for each rose, a rose is a rose
Canonical Huffman Coding Height of left tree is never shorter than right tree. S: ordered sequence of pairs (x i, y i ) for each level in tree where x i = # symbols y i = numerical value of first symbol rose is“, “ a eachfor for each rose, a rose is a rose S = ((1, 1) (1, 1) (0, ∞) (4, 0)
Byte-Oriented Huffman Coding Tree has branching factor of 256 Ensure no empty nodes in higher levels of tree # of bottom level elements = 1 + ((v – 256) mod 255) Charactertistics –Decompression is faster than for plain Huffman coding. –Compression ratios are better than for Ziv-Lempel family of codings. –Allows direct searching on compressed text.
Dictionary Methods Basic concepts –replacing groups of consecutive symbols with a pointer to an entry in a dictionary –the pointer representations are references to entries in a dictionary composed of a list of symbols that are expected to occur frequently –pointers to the dictionary entries are chosen so that they need less space than the phrase they replace –modeling and coding does not exist –there are no explicit probabilities associated to phrases
Dictionary Methods Static dictionary methods –selected pairs of letters are replaced with codewords –ex) Digram coding at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary if so, they are coded together and the coding position is shifted by two characters; otherwise, the single character is represented by its normal code and the position is shifted by one character –main problem the dictionary might be suitable for one text, but unsuitable for another
Dictionary Methods Adaptive dictionary methods –Ziv-Lempel placing strings of characters with a reference to a previous occurrence of the string if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces, compression is achieved
Ziv-Lempel Code Characteristics –identifying each text segment the first time it appears and then simply pointing back to this first occurrence rather than repeating the segment –an adaptive model of coding, with increasingly long text segments encoded as the text is scanned –require less space than the repeated text segments –higher compression than the Huffman codes –codes of roughly 4 bits per character
Ziv-Lempel Code LZ77 – Gzip encoding –the code consists of a set of triples a: identifies how far back in the decoded text to look for the upcoming text segment b: tells how many characters to copy for the upcoming segment c: a new character to add to complete the next segment ex),
Ziv-Lempel Code An example (decoding) p pe pet peter peter_ peter_pi peter_piper peter_piper_pic peter_piper_pick peter_piper_picked ………..
Dictionary Methods Adaptive dictionary methods –Disadvantages over the Huffman method no random access: does not allow decoding to start in the middle of a compression file Dictionary schemes are popular for their speed low memory use, but statistical methods are more common in an IR environment
Inverted File Compression Inverted file is a –Vector of all words in collection (vocabulary) –For each word List of documents that include that word Assuming list includes documents in ascending order –Can be compressed by storing size of gaps rather than document numbers Unary code: (x-1) 1 bits followed by a zero bit
Golumb Number Compression Value x encoded as: –q + 1 in unary, where q = floor ((x – 1) / b) –followed by –r in binary, where r = (x – 1) – q * b –b is set based on size and distribution of numbers being encoded For gap compression –b =.69 *(N / f t ) where –N is the total number of documents –f t is the number of documents containing term t Implies that compression coding varies for each term
Comparing Text Compression Techniques ArithmeticCharacter Huffman Word Huffman Ziv-Lempel Compression Ratio very goodpoorvery goodgood Compression Speed slowfast very fast Decompression Speed slowfastvery fast Memory Spacelow highmoderate Compressed pattern matching noyes Random Access noyes no