Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing. Overview of the Talk zInverted File Indexing zCompression of inverted files zSignature files and bitmaps zComparison of indexing methods zConclusion.

Similar presentations


Presentation on theme: "Indexing. Overview of the Talk zInverted File Indexing zCompression of inverted files zSignature files and bitmaps zComparison of indexing methods zConclusion."— Presentation transcript:

1 Indexing

2 Overview of the Talk zInverted File Indexing zCompression of inverted files zSignature files and bitmaps zComparison of indexing methods zConclusion

3 Inverted File Indexing zInverted file index ycontains a list of terms that appear in the document collection (called a lexicon or vocabulary) yand for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list. zGranularity of an index determines the accuracy of representation of the location of the word yCoarse-grained index requires less storage and more query processing to eliminate false matches yWord-level index enables queries involving adjacency and proximity, but has higher space requirements Usual granularity is document-level, unless a significant fraction of the queries are expected to be proximity-based.

4 Inverted File Index: Example DocText 1 Pease porridge hot, pease porridge cold, 2 Pease porridge in the pot, 3 Nine days old. 4 Some like it hot, some like it cold, 5 Some like it in the pot, 6 Nine days old. TermsDocuments --------------------------- cold days hot in it like nine old pease porridge pot some the Notation: N: number of documents; (=6) n: number of distinct terms; (=13) f: number of index pointers; (=26)

5 Inverted File Compression Each inverted list has the form A naïve representation results in a storage overhead of This can also be stored as Each difference is called a d-gap. Since each pointer requires fewer thanbits. Assume d-gap representation for the rest of the talk, unless stated otherwise

6 Text Compression Two classes of text compression methods ySymbolwise (or statistical) methods xEstimate probabilities of symbols - modeling step xCode one symbol at a time - coding step xUse shorter code for the most likely symbol xUsually based on either arithmetic or Huffman coding yDictionary methods xReplace fragments of text with a single code word (typically an index to an entry in the dictionary). eg: Ziv-Lempel coding, which replaces strings of characters with a pointer to a previous occurrence of the string. xNo probability estimates needed Symbolwise methods are more suited for coding d-gaps

7 Models model encoder model decoder compressed text text Entropy, or the average amount of information per symbol over the whole alphabet, denoted H, is given by Information content of a symbol s, denoted by I(s) is given by Shannon’s formula Models can be static, semi-static or adaptive.

8 A 0.05 B 0.05 Huffman Coding: Example C 0.1 D 0.2 E 0.3 F 0.2 G 0.1

9 Huffman Coding: Example C 0.1 D 0.2 E 0.3 F 0.2 G 0.1 A 0.05 B 0.05 0.1

10 Huffman Coding: Example A 0.05 0.1 1.0 0.4 0.20.6 B 0.05 0.3 G 0.1 F 0.2 E 0.3 D 0.2 C 0.1 01 1 0 01 1 01 01 0 SymbolCode Probability A0000 0.05 B0001 0.05 C001 0.1 D01 0.2 E10 0.3 F110 0.2 G111 0.1

11 Huffman Coding: Conclusions

12 Arithmetic Coding Pr[c]=1/3 Pr[b]=1/3 Pr[a]=1/3 1.0000 0.6667 0.3333 0.0000 A) Interval used to code b B) 0.6667 0.5834 0.4167 0.3333 Interval used to code c Pr[c]=1/4 Pr[b]=2/4 Pr[a]=1/4 0.6667 0.6334 0.6001 0.5834 0.6667 0.6501 0.6390 0.6334 Final interval (represents whole output) Pr[c]=2/5 Pr[b]=2/5 Pr[a]=1/5 Pr[c]=3/6 Pr[b]=2/6 Pr[a]=1/6 C) String= bccb Alphabet = {a, b, c} Code= 0.64

13 Arithmetic Coding: Conclusions zHigh probability events do not reduce the size of the interval in the next step very much, whereas low- probability events do. zA small final interval requires many digits to specify a number guaranteed to be in the interval. zNumber of bits required is proportional to the negative logarithm of the size of the interval. zA symbol s of probability Pr[s] contributes -log Pr[s] bits to the output. Arithmetic Coding produces near-optimal codes, given an accurate model

14 Inverted File Compression This can also be stored as Each difference is called a d-gap. Since Each inverted list has the form A naive representation results in a storage overhead of each pointer requires fewer than bits.

15 Methods for Inverted File Compression zMethods for compressing d-gap sizes can be classified into yglobal: each list is compressed using the same model ylocal: the model for compressing an inverted list is adjusted according to some parameter, like the frequency of the term zGlobal methods can be divided into ynon-parameterized: probability distribution for d-gap sizes is predetermined. yparameterized: probability distribution is adjusted according to certain parameters of the collection. zBy definition, local methods are parameterized.

16 Non-parameterized models Unary code: An integer x > 0, is coded as (x-1) ‘1’ bits followed by a ‘0’ bit. code of bits that represents in binary.γ code: Number x is coded as a unary code forfollowed by a δ code: Number of bits in binary is represented using γ code. For small integers, δ codes are longer than γ codes, but for large integers, the situation reverses.

17 Non-parameterized models Each code has an underlying probability distribution, which can be derived using Shannon’s formula. Probability assumed by unary is too small.

18 Global parameterized models Probability that a random document contains a random term, Assuming a Bernoulli process, Arithmetic coding: Huffman-style coding (Golomb coding):

19 Global observed frequency model zUse exact d-gap values and then use arithmetic or Huffman coding zOnly slightly better than γ or δ code zReason: pointers are not scattered randomly in the inverted file zNeed local methods for any improvement

20 Local methods zLocal Bernoulli yUse a different p for each inverted list yUse γ code for storing zSkewed Bernoulli yLocal Bernoulli model is bad for clusters yUse a cross between γ and Golomb, with b=median gap size yNeed to store b (use γ representation) yThis is still a static model Need an adaptive model that is good for clusters

21 Interpolative code Consider an inverted list Documents 8, 9, 11, 12 and 13 form a cluster Can do better with a minimal binary code

22 Performance of index compression methods Compression of inverted files in bits per pointer

23 Signature Files zEach document is given a signature, that captures its content yHash each document term to get several hash values yBits corresponding to those values are set to 1 zQuery processing: yHash each query term to get several hash values yIf a document has all bits corresponding to those values set to 1, it may contain the query term zFalse matches yset several bits for each term ymake the signatures sufficiently long zNaïve representation: may have to read the entire signature file for each query term zUse bitslicing to save on disk transfer time

24 Signature files: Conclusion zDesign involves many tradeoffs ywide, sparse signatures reduce number of false matches yshort, dense signatures require more disk accesses zFor reasonable query times, requires more space than compressed inverted file zInefficient for documents of varying sizes yBlocking makes simple queries difficult to answer zText is not random

25 Bitmaps zSimple representation: For each term in the lexicon, store a bitvector of length N. A bit is set if and only if the corresponding document contains the term. zEfficient for boolean queries zEnormous amount of storage requirement, even after removing stop words zHave been used to represent common words

26 Compression of signature files and bitmaps zSignature files are already in compressed form yDecompression affects query time substantially yLossy compression results in false matches zBitmaps can be compressed by a significant amount 0000 0010 0000 0011 1000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0101 1010 0000 0000 1100 Compressed code: 1100 : 0101, 1010 : 0010, 0011, 1000, 0100

27 Comparison of indexing methods zAll indexing methods are variations of the same basic idea!! zSignature files and inverted files require an order of magnitude less secondary storage than bitmaps zSignature files cause unnecessary access to the document collection unless signature width is large zSignature files are disastrous when record lengths vary a lot zAdvantages of signature files yno need to keep lexicon in memory ybetter for conjunctive queries involving common terms Compressed inverted files are the most useful for indexing a collection of variable length text documents

28 Conclusion zFor practical purposes, the best index compression algorithm is the local Bernoulli method (using Golomb coding) zCompressed inverted indices are almost always better than signature files and bitmaps in most practical situations, in terms of both space and response time for queries


Download ppt "Indexing. Overview of the Talk zInverted File Indexing zCompression of inverted files zSignature files and bitmaps zComparison of indexing methods zConclusion."

Similar presentations


Ads by Google