Download presentation
Presentation is loading. Please wait.
1
Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Information Retrieval Data compression
2
Architectural features...
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Architectural features... net L2 RAM HD CPU L1 registers Few Gbs Tens of nanosecs Some words fetched Few Tbs Cache Few Mbs Some nanosecs Few words fetched Few millisecs B = 32K page Many Tbs Even secs Packets
3
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
4
How much can we compress?
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" How much can we compress? For lossless compression, assuming all input messages are valid, if even one string is compressed, some other must expand. Take all messages of length n. Is it possible to compress them in less bits ? NO, they are 2n but we have less compressed msg… We need to talk about stochastic sources
5
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Entropy (Shannon, 1948) For a set of symbols S with probability p(s), the self information of s is: bits Lower probability higher information Entropy is the weighted average of i(s)
6
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Statistical Coding How do we use the p(s) to encode s? Prefix codes and relationship to Entropy Huffman codes Arithmetic codes
7
Uniquely Decodable Codes
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Uniquely Decodable Codes A variable length code assigns a bit string (codeword) of variable length to every symbol e.g. a = 1, b = 01, c = 101, d = 011 What if you get the sequence 1011 ? Is it aba, ca, or, ad? A uniquely decodable code is a variable length code in which bit strings can always be uniquely decomposed into their codewords.
8
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prefix Codes A prefix code is a variable length code in which no codeword is a prefix of another e.g a = 0, b = 100, c = 101, d = 11 Can be viewed as a binary trie 1 a b c d 1
9
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Average Length For a code C with probability p(s) associated to symbol s and codeword length |C[s]|, the average length is defined as We say that a prefix code C is optimal if for all prefix codes C’, la(C) la(C’) Fact (Kraft-McMillan): For any optimal uniquely decodable code, it does exist a prefix code with the same symbol lengths and thus same average optimal length. And vice versa…
10
A property of optimal codes
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A property of optimal codes Theorem. If C is an optimal prefix code for the source S={p1, …, pn} then pi < pj l(ci) ≥ l(cj) Golden rule: Assign shorter codewords to frequent symbols
11
Relationship to Entropy
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Relationship to Entropy Theorem (lower bound, Shannon). For any probability distribution p(S) with associated uniquely decodable code C, we have Theorem (upper bound). For any probability distribution p(S) and its optimal prefix code C, Huffman Arithmetic
12
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Huffman Codes Invented by Huffman as a class assignment in ‘50. Used in many, if not most, compression algorithms gzip, bzip, jpeg (as option), fax compression,… Properties: Generates optimal prefix codes Cheap to generate codes Cheap to encode and decode la=H if probabilities are powers of 2 Otherwise, at most 1 bit more per symbol!!!
13
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Example p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5 a(.1) b(.2) c(.2) d(.5) (.3) (.5) (1) 1 a=000, b=001, c=01, d=1 What about ties ? What about tree depth ?
14
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Huffman Codes Huffman Algorithm Start with a set of singleton trees (leaves), one per symbol s and with weight p(s) Repeat: Select two trees with minimum weight roots p1 and p2 Join into one single tree by adding a new root with weight p1 + p2
15
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Encoding and Decoding Encoding: Emit the root-to-leaf path leading to the symbol to be encoded. Decoding: Start at root and take branch for each bit received. When at leaf, output its symbol and return to root. 1 (.5) d(.5) abc... 1 dcb (.3) c(.2) 1 a(.1) b(.2)
16
A property on tree contraction
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A property on tree contraction
17
Huffman codes are optimal
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Huffman codes are optimal
18
Bounding the Huffman-codes’ length
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Bounding the Huffman-codes’ length We derive the bound on lx and ly (by cases px < py and vice versa).
19
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Optimum vs. Huffman
20
Canonical Huffman Codes
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Canonical Huffman Codes Huffman codes can be made succinct in the representation of the codeword tree, and fast in (de)coding. Encoding Normal huffman codes are static. To be applied in a dynamic model, we need a Decoding
21
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
You find this at
22
Byte-aligned Huffword [Moura et al, 98]
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Byte-aligned Huffword [Moura et al, 98] Compressed text derived from a word-based Huffman: Symbols of the huffman tree are the words of T The Huffman tree has fan-out 128 Codewords are byte-aligned and tagged 7 bits Codeword g a b huffman 1 Byte-aligned codeword g a b tagging “or” a g 1 [bzip] [ ] [not] [or] T = “bzip or not bzip” a C(T) b g b space a bzip b a or not
23
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CGrep and other ideas... = 1a 0b P= bzip a g b g b space a Byte-search bzip b a or not 1 [bzip] [ ] [not] [or] T = “bzip or not bzip” a C(T) yes no yes no Note: (1) no need to decompress the text, but direct search over C(T)
24
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
You find this at You find it under my Software projects
25
Problem with Huffman Coding
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Problem with Huffman Coding Consider a message with probability The self information of this message is If we were to send 1000 such messages we might hope to use 1000*.0014 = 1.44 bits. Using Huffman, we take at least one bit per message, so we would require 1000 bits. It might seem like we cannot do better than this. Assuming Huffman codes, how could we improve this? Assuming there is only one possible other message (with prob. .001), what would the Expected length be for sending 1000 messages picked from this distribution? (about 10bits bits = bits)
26
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
What can we do? Huffman may loose 1 bit per symbol, and this may be large for highly-probable symbols We can “enlarge” the symbol: Macro-symbol = block of k symbols 1/k-bits of inefficiency Larger model to be transmitted Possibly too much since the msg is limited Shannon considered infinite-sequences, and k ∞ !! Arithmetic does better, achieving almost the optimum for 0-entropy encoders: nH0 + 2 bits in theory, nH n bits in practice It might seem like we cannot do better than this. Assuming Huffman codes, how could we improve this? Assuming there is only one possible other message (with prob. .001), what would the Expected length be for sending 1000 messages picked from this distribution? (about 10bits bits = bits)
27
Arithmetic Coding: Introduction
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation is not too bad.
28
Arithmetic Coding (message intervals)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Arithmetic Coding (message intervals) Assign each symbol to an interval range from 0 (inclusive) to 1 (exclusive). e.g. a = .2 c = .3 b = .5 0.0 0.2 0.7 1.0 f(a) = .0, f(b) = .2, f(c) = .7 We are going to make a distinction between message, sequence, and code intervals. It is important to keep them straight. Also I will give different meanings to p(I) and pi and it is important to keep these straight The interval for a particular symbol will be called the symbol interval (e.g for b it is [.2,.7))
29
Arithmetic Coding: Encoding Example
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Arithmetic Coding: Encoding Example Coding the message sequence: bac The final sequence interval is [.27,.3) a = .2 c = .3 b = .5 0.0 0.2 0.7 1.0 0.3 0.55 0.21 0.27 If the notation gets confusing, the intuition is still clear, as shown by this example.
30
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Arithmetic Coding To code a sequence of symbols with probabilities pi (i = 1..n) use the following: Each message narrows the interval by a factor of pi. Final interval size is The interval for a message sequence will be called the sequence interval How does the interval size relate to the probability of that message sequence?
31
Uniquely defining an interval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Uniquely defining an interval Important property: The intervals for distinct messages of length n will never overlap Therefore by specifying any number in the final interval uniquely determines the msg. Decoding is similar to encoding, but on each step need to determine what the message value is and then reduce interval Why will they never overlap? How should we pick a number in the interval, and how should we represent the number?
32
Arithmetic Coding: Decoding Example
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Arithmetic Coding: Decoding Example Decoding the number .49, knowing the message is of length 3: The message is bbc. a = .2 c = .3 b = .5 0.0 0.2 0.7 1.0 0.3 0.55 0.35 0.475 0.49 Basically we are running the coding example but instead of paying attention to the symbol names, We pay attention to the number.
33
Representing an Interval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Representing an Interval Binary fractional representation: So how about just using the smallest binary fractional representation in the sequence interval. e.g. [0,.33) = [.33,.66) = .1 [.66,1) = .11 Algorithm x = 2 *x If x < 1 output 0 else x = x - 1; output 1 It is not a prefix code.
34
Representing an Interval (continued)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Representing an Interval (continued) Can view binary fractional numbers as intervals by considering all completions. We will call this the code interval. We now have mentioned all three kinds of intervals: message, sequence and code. Can you give an intuition for the lemma?
35
Selecting the Code Interval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Selecting the Code Interval To find a prefix code, find a binary fractional number whose code interval is contained in the sequence interval (dyadic number). Can use L + s/2 truncated to 1 + log (1/s) bits .61 .79 .625 .75 Sequence Interval Code Interval (.101) Recal that s is the size of the interval. Using -log(s/2) bits represents an interval of size >= s/2.
36
Bound on Arithmetic length
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Bound on Arithmetic length Note that –log s+1 = log (1/s)+1
37
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Bound on Length Theorem: For a text of length n, the Arithmetic encoder generates at most 1 + log (1/s) = = 1 + log ∏ (1/pi) ≤ 2 + ∑ j=1,n log (1/pi) = 2 + ∑k=1,|| npk log (1/pk) = 2 + n H0 bits The 1 + log s is from previous slide based on truncating to -log(s/2) bits. Note that s is overloaded (size of sequence interval, and self information of message). I apologize. nH n bits in practice because of rounding
38
Integer Arithmetic Coding
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Integer Arithmetic Coding Problem is that operations on arbitrary precision real numbers is expensive. Key Ideas of integer version: Keep integers in range [0..R) where R=2k Use rounding to generate integer interval Whenever sequence intervals falls into top, bottom or middle half, expand the interval by factor of 2 Integer Algorithm is an approximation S gets very small and require many bits of precision --- as many as the size of the final code. If input probabilities are binary floats, then can use extended floats. If input probabilities are rationals, then can use arbitrary precision rationals.
39
Integer Arithmetic (scaling)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Integer Arithmetic (scaling) If l R/2 then (in top half) Output 1 followed by m 0s m = 0 Message interval is expanded by 2 If u < R/2 then (in bottom half) Output 0 followed by m 1s If l R/4 and u < 3R/4 then (in middle half) Increment m All other cases, just continue... Hard part is if you keep narrowing down on the middle.
40
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
You find this at
41
Using Conditional Probabilities: PPM
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Using Conditional Probabilities: PPM Use previous k characters as the context. Makes use of conditional probabilities Base probabilities on counts: e.g. if seen th 12 times followed by e 7 times, then the conditional probability p(e|th) = 7/12. Need to keep k small so that dictionary does not get too large (typically less than 8). 8-gram entropy is entropy based on blocking letters into groups of 8.
42
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
PPM: Partial Matching Problem: What do we do if we have not seen context followed by character before? Cannot code 0 probabilities! The key idea of PPM is to reduce context size if previous match has not been seen. If character has not been seen before with current context of size 3, send an escape-msg and then try context of size 2, and then again an escape-msg and context of size 1, …. Keep statistics for each context size < k The escape is a special character with some probability. Different variants of PPM use different heuristics for the probability. Why can’t we code 0 probabilities? This not only solves the 0 probability problem, but also allows much better compression than 2.3bits/char because we get better statistics when only small contexts are available
43
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
PPM: Example Contexts Context Counts Empty A = 4 B = 2 C = 5 $ = 3 A B C C = 3 $ = 1 A = 2 A = 1 C = 2 AC BA CA CB CC B = 1 $ = 2 C = 1 String = ACCBACCACBA k = 2
44
You find this at: compression.ru/ds/
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" You find this at: compression.ru/ds/
45
Lempel-Ziv Algorithms
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary is stored How it is extended How it is indexed How elements are removed LZ-algos are asymptotically optimal, i.e. their compression ratio goes to H(S) for n !! No explicit frequency estimation
46
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
LZ77: Sliding Window Algorithm’s step: Output <d, len, c> d = distance of copied string wrt current position len = length of longest match c = next char in text beyond longest match Advance by len + 1 A buffer “window” has fixed length and moves a c b Dictionary (all substrings starting here) ? ? ? Cursor <2,3,c>
47
Example: LZ77 with window
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Example: LZ77 with window a c b (0,0,a) a c b a c b (1,1,c) a c b (3,4,b) a c b c a b (3,3,a) c a b a c b (1,2,c) a c b within W Window size = 6 Longest match Next character
48
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
LZ77 Decoding Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i] Output is correct: abcdcdcdcdcdce
49
LZ77 Optimizations used by gzip
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically uses the second format if length < 3. (1,a) a a c a a c a b c a b a a a c (1,a) a a c a a c a b c a b a a a c (1,c) a a c a a c a b c a b a a a c (0,3,4) a a c a a c a b c a b a a a c
50
Optimizations used by gzip (cont.)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Optimizations used by gzip (cont.) Triples are coded with Huffman’s code Special greedy: possibly use shorter match so that next match is better Use hash table to store dictionary: Hash is based on strings of length 3. Find the longest match by extending the positions lying in the current hash bucket. Store within bucket in decreasing order of position
51
LZ78: Dictionary Lempel-Ziv
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" LZ78: Dictionary Lempel-Ziv Basic algorithm: Keep dictionary of words with integer id for each entry, stored in a trie. Coding loop find the longest match S in the dictionary Output the entry id of the match and the next character c after the match in the input string Add the string Sc to the dictionary Decoding keeps same dictionary and looks up ids
52
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
LZ78: Coding Example Output Dict. (0,a) 1 = a a a b a a c a b c a b c b (1,b) 2 = ab a a b a a c a b c a b c b (1,a) 3 = aa a a b a a c a b c a b c b (0,c) 4 = c a a b a a c a b c a b c b (2,c) 5 = abc a a b a a c a b c a b c b (5,b) 6 = abcb a a b a a c a b c a b c b
53
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
LZ78: Decoding Example Dict. Input (0,a) 1 = a a a b (1,b) 2 = ab a b (1,a) 3 = aa a b c (0,c) 4 = c a b c (2,c) 5 = abc a b c (5,b) 6 = abcb
54
LZW (Lempel-Ziv-Welch) [‘84]
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" LZW (Lempel-Ziv-Welch) [‘84] Don’t send extra character c, but still add Sc to the dictionary. The dictionary is initialized with byte values being the first 256 entries (e.g. a = 112, ascii), otherwise there is no way to start it up. The decoder is one step behind the coder since it does not know c There is an issue for strings of the form SSc where S[0] = c, and these are handled specially!!!
55
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
LZW: Encoding Example Output Dict. 112 256=aa a b c 112 257=ab a b c 113 258=ba a b c 256 259=aac a b c 114 260=ca a b c 257 261=aba a b c 261 262=abac a b c
56
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
LZW: Decoding Example Input Dict 112 a b c one step later 112 256=aa a b c 113 257=ab a b c 256 258=ba a b c 114 259=aac a b c 257 260=ca a b c 261 ? 261 a b 261=aba a b c
57
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
LZ78 and LZW issues How do we keep the dictionary small? Throw the dictionary away when it reaches a certain size (used in GIF) Throw the dictionary away when it is no longer effective at compressing (e.g. compress) Throw the least-recently-used (LRU) entry away when it reaches a certain size (used in BTLZ, the British Telecom standard)
58
You find this at: www.gzip.org/zlib/
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" You find this at:
59
Run Length Encoding (RLE)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Run Length Encoding (RLE) Code by specifying <symbol, #occ>: e.g. abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1) In case of binary strings just numbers and one bit Properties: Takes advantage of spatial locality The counts can be coded based on frequency. Possibly < 1 bit per symbol if text highly repetitive. Should each character have its own frequency distribution, or the same for all of them?
60
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Move to Front Coding Transforms char sequence into integer sequence, that can then be statistically coded Start with values in a total order: L=[a,b,c,d,…] For each symbol s output the position of s in L move s to the front of L e.g.: Given s=“d”, L=[a,b,c,d,e,…] => out: 3, new L=[d,a,b,c,e,…] s=“a”, L=[d,a,b,c,e,…] => out: 1, new L=[a,d,b,c,e,…] Properties: Takes advantage of temporal locality It is a dynamic code No much worse than Huffman ...but it may be far better Mention the use of Splay-trees for the move-to-front Heuristic. In particular, output the 0-1 path to the node, and then splay it to the root.
61
The Burrows-Wheeler Transform (1994)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# F L T # mississipp i mississippi# i #mississip p ississippi#m i ppi#missis s ssissippi#mi sissippi#mis p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m issippi#miss Sort the rows ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi
62
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
A famous example
63
Suffix Array vs. BW-transform
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Suffix Array vs. BW-transform 12 11 8 5 2 1 10 9 7 4 6 3 SA i p s m # L #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m Build SA via an “indirect qsort”: SA[i] < SA[j] iff T[SA[i],N] <L T[SA[j],N] mississippi
64
An elegant/inefficient construction algorithm
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" An elegant/inefficient construction algorithm
65
BWT construction: Manzini’s web page
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" BWT construction: Manzini’s web page Should each character have its own frequency distribution, or the same for all of them?
66
The skew algorithm (divide step)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" The skew algorithm (divide step)
67
The skew algorithm (conquer step)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" The skew algorithm (conquer step) T(n) = T(2n/3) + O(n) = O(n) provided that merging takes O(n) time
68
The skew algorithm (recombine)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" The skew algorithm (recombine) Merging Tk in SA1 with Ti in SA2,0and , the key issue: i mod 3 = 2 Ti+1 and Tk+1 belong to SA2,0 i mod 3 = 0 Ti+2 and Tk+2 belong to SA2,0 hence, their order can be derived from SA2,0 in O(1) time
69
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
How to compress L ? F L unknown # mississipp i A key observation: L is locally homogeneous i #mississip p i ppi#missis s L is highly compressible p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m Algorithm Bzip : Move-to-Front coding of L Run-Length coding (Wheeler’s code) Statistical coder: Arithmetic, Huffman T = m i s s i s s i p p i # Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
70
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
An encoding example T = mississippimississippi S = imps # at 11 L = ippssssmmi#ppiissssiiii Mtf = Mtf’ = bin(x+1) without front 1 RLE0 = Arithmetic su |S|+1 simboli integer coding of len
71
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
A useful tool: L F F L unknown # mississipp i i #mississip p i ppi#missis s p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m How do we map L’s onto F’s chars ? ... Need to distinguish equal chars in F... Take two equal L’s chars Rotate rightward their rows Same relative order !!
72
The BW-Trasform is invertible
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" The BW-Trasform is invertible F L unknown # mississipp i T = # i 1. We can map L’s to F’s chars 2. T = .... L[ i ] F[ i ] ... Two key properties: i #mississip p p i ppi#missis s p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m Reconstruct T backward: p i
73
You find this at: sources.redhat.com/bzip2/
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" You find this at: sources.redhat.com/bzip2/
74
What is a “compression booster” ?
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" What is a “compression booster” ? It is a technique that takes a poor compressor A and turns it into a compressor with better performance guarantee A memoryless compressor is poor in that it assigns codewords to symbols according only to their frequencies (e.g. Huffman) It incurs in some obvious limitations: T = anbn T’= random string of length 2n and same number of ‘a,b’
75
The empirical entropy Hk
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" The empirical entropy Hk Use Huffman or Arithmetic Compress T up to Hk(T) compress all T[w] up to their H0 For any k-long context = (1/|T|) ∑|w|=k | T[w] | H0(T[w]) Hk(T) T[w] = string of symbols that precede w in T Example: Given T = “mississippi”, we have T[is] = ms T[i]= mssp, Problems with this approach: How to go from all T[w] back to the string T ? How do we choose efficiently the best k ? BWT Suffix Tree or Array
76
Use BWT to approximate Hk
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Use BWT to approximate Hk bwt(T) unknown #mississipp i Remember that... Hk(T) = (1/|T|) ∑|w|=k |T[w]| H0(T[w]) i#mississip p ippi#missis s pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m Compress T up to Hk(T) compress all T[w] up to their H0 compress pieces of bwt(T) up to H0 T[w] is a permutation of a piece of bwt(T) T = mississippi# There is a way to compute the optimal partition
77
You find this at Manzini’s home page
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" You find this at Manzini’s home page
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.