Download presentation
Presentation is loading. Please wait.
1
Information Retrieval Space occupancy evaluation
2
Storage analysis First we will consider space for the postings Recall that access is sequential Then will do the same for the dictionary Recall that access is random Finally we will analyze the storage of the documents… Random access is here crucial for “snippet retrieval”
3
Information Retrieval Postings storage
4
Recall that… Brutus the Calpurnia 12358132134 248163264128 1316
5
Postings: two conflicting forces A term like Calpurnia occurs in maybe one doc out of a million. Hence we would like to store this pointer using log 2 #docs bits. A term like the occurs in virtually every doc, so that number of bits is too much. We would prefer 0/1 vector in this case.
6
Gap-coding for postings Store the gaps between consecutive docIDs: Brutus: 33, 47, 154, 159, 202 … 33, 14, 107, 5, 43 … Advantages: Store smaller integers Smaller and smaller if clustered How much is the saving by the -encoding ?
7
code for integer encoding x > 0 and Length = log 2 x +1 e.g., 9 represented as. code for x takes 2 log 2 x +1 bits (ie. factor of 2 from optimal) Length-1 Optimal for Pr(x) = 1/2x 2, and i.i.d integers
8
It is a prefix-free encoding… Given the following sequence of coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 63 597
9
A rough analysis on Zipf law and -coding… Zipf Law: kth term occurs c n/k times, for a proper constant c depending on the data collection. -coding on the k-th term costs: n/k gaps, use 2log 2 k +1 bits for each gap; Log is concave, thus the maximum cost is (2n/k) log 2 k + (n/k) bits
10
Sum over k from 1 to m=500K Do this by breaking values of k into groups: group i consists of 2 i-1 k < 2 i. Group i has 2 i-1 components in the sum, each contributing ~(2n/k) log 2 k ~ (2ni)/2 i-1. Summing over i from 1 19 (500k terms), we get a net estimate of 340Mbits. Then add #docs bit per each occurrence (they are 1Gb) because of +1 in : 1.34 Gbit ~ 170Mb 20-bit coding, would have required 20 Gbits [~ 2.5 Gb]
11
Sum over k from 1 to m=500K Do this by breaking values of k into groups: group i consists of 2 i-1 k < 2 i. Group i has 2 i-1 components in the sum, each contributing ~(2n/k) log 2 k ~ (2ni)/2 i-1. Summing over i from 1 19 (500k terms), we get a net estimate of 340Mbits. Then add #docs bit per each occurrence (they are 1Gb) because of +1 in : 1.34 Gbit ~ 170Mb 20-bit coding, would have required 20 Gbits [~ 2.5 Gb]
12
code for integer encoding Use -coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as. coding x takes about log 2 x + 2 log 2 ( log 2 x ) + 2 bits. Optimal for Pr(x) = 1/2x(log x) 2, and i.i.d integers
13
Variable-byte codes [10.2 bits per TREC12] Wish to get very fast (de)compress byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=2 14 +1 binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !!
14
Fixed Binary Codewords [~8 bits per TREC12] Get fast (de)compress & reduce space wasting fixed number of bits [width] for a varying number of items [span] Example: 38,17,13,34,6,2,1,3,1,2,3,1 Flat binary code, needs 6 bits per item, total 72 -code would need 36 bits Greedily split and use fixed-binary codes: <12; (6,4:38,17,13,34),(3,1:6),(2,7:2,1,3,1,2,3,1) Width=#bitsSpan=#items What widths and spans are preferrable ? Every group forced to fit in 1 machine word!!
15
Golomb codes [7.54 bits on TREC12] It is a parametric code: depends on k We set d = 2 log k +1 - k, hence d may be represented in log k bits Quotient q= (v-1)/k , and the rest is r= v – k * q – 1 In the second case r > d, so that (r+d)/2 > d Consequently, the first log k bits discriminate the case in decompression Useful when integers concentrated around k: k=3, v=9 d=1, Usually k 0.69 * mean(v) [Bernoulli model] Optimal for Pr(x) = p (1-p) x-1, where mean(x)=1/p, and i.i.d ints If k is a power-of-2, then we have the simpler Rice code [q times 0s] 1 If r ≤ d use log k bits (enough), else use log k bits r ≤ d ? r : (r+d)
16
PFor and PForDelta Thick line: decompr R2c Thin line: decompr R2R
17
IL compression S9, S16 stuff ints in a word using 9/16 configurations Positions impact for a factor 4, but more on decompr since worse cache use on 1000 queries
18
Integer coding vs postings length S9, S16 stuff ints in a word using 9/16 configurations
19
Information Retrieval Dictionary storage (several incremental solutions)
20
Recall the Heaps Law… Empirically validated model: V = kN b where b ≈ 0.5, k ≈ 30 100; N = # tokens Some considerations: V is decreased by case-folding, stemming Indexing all numbers could make it extremely large (so usually don’t) Spelling errors contribute a fair bit of size
21
1 st solution: basic idea… Array of fixed-width entries 500,000 terms; 28 bytes/term = 14MB. Binary search 20 bytes4 bytes each Wasteful, avg word is 8 char
22
2 nd solution: raw term-sequence ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Binary search these pointers Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space.
23
3 rd solution: blocking Store pointers to every kth on term string. Example below: k=4. Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Save 12 bytes on 3 pointers Lose 4*1 bytes on term lengths. Net saving is 8 bytes every 4 dict words
24
Impact of k on search & space Search for a dictionary word Binary search down to 4-term block; Then linear search through terms in block. Increasing k, we would slow down the linear scan but reduce the dictionary space occupancy to ~8MB
25
Encodes automate Suffix length to drop 4 th solution: front coding Idea: Sorted words commonly have long common prefix – store differences only wrt the previous term in a block of k. 8automata8automate9automatic10automation 8automata1 e1 ic1 on Lucene stores k=128 First term is full
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.