Indexing. Overview of the Talk zInverted File Indexing zCompression of inverted files zSignature files and bitmaps zComparison of indexing methods zConclusion.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Arithmetic Coding. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a How we can do better than Huffman? - I As we have seen, the.
Hashing and Indexing John Ortiz.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Processing of large document collections
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern Information Retrieval Chapter 8 Indexing and Searching.
3. 1 Static Huffman A known tree is used for compressing a file. A different tree can be used for each type of file. For example a different tree for an.
School of Computing Science Simon Fraser University
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Modern Information Retrieval
BTrees & Bitmap Indexes
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
Inverted Files, Signature Files, Bitmaps
Chapter 8 File organization and Indices.
2015/6/15VLC 2006 PART 1 Introduction on Video Coding StandardsVLC 2006 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
CS336: Intelligent Information Retrieval
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef
Chapter 7 Special Section Focus on Data Compression.
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Data Compression Arithmetic coding. Arithmetic Coding: Introduction Allows using “fractional” parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Introduction to AEP In information theory, the asymptotic equipartition property (AEP) is the analog of the law of large numbers. This law states that.
Information and Coding Theory
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
1 Lossless Compression Multimedia Systems (Module 2 Lesson 2) Summary:  Adaptive Coding  Adaptive Huffman Coding Sibling Property Update Algorithm 
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Signature files. Signature Files Important alternative to inverted indexes. Given a document, the signature is calculated as follows. - First, each word.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
 The amount of data we deal with is getting larger  Not only do larger files require more disk space, they take longer to transmit  Many times files.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
CPSC 231 D.H.1 Learning Objectives Understanding of disk versus RAM performance gap. Understanding definition, design goals and design problems of file.
Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Evidence from Content INST 734 Module 2 Doug Oard.
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 7 (W5)
J IANPING F AN D EPT OF C OMPUTER S CIENCE UNC-C HARLOTTE Inverted Files, Signature Files, Bitmaps.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
1 Inverted Index. The Inverted Index The inverted index is a set of posting lists, one for each term in the lexicon Example: Doc 1: A C F Doc 2: B E D.
Information theory Data compression perspective Pasi Fränti
Why indexing? For efficient searching of a document
Module 11: File Structure
Text Indexing and Search
Shannon Entropy Shannon worked at Bell Labs (part of AT&T)
Applied Algorithmics - week7
Implementation Issues & IR Systems
CS 430: Information Discovery
Context-based Data Compression
Chapter 11: Indexing and Hashing
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Arithmetic coding Let L be a set of items.
Chapter 11: Indexing and Hashing
Advance Database System
Presentation transcript:

Indexing

Overview of the Talk zInverted File Indexing zCompression of inverted files zSignature files and bitmaps zComparison of indexing methods zConclusion

Inverted File Indexing zInverted file index ycontains a list of terms that appear in the document collection (called a lexicon or vocabulary) yand for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list. zGranularity of an index determines the accuracy of representation of the location of the word yCoarse-grained index requires less storage and more query processing to eliminate false matches yWord-level index enables queries involving adjacency and proximity, but has higher space requirements Usual granularity is document-level, unless a significant fraction of the queries are expected to be proximity-based.

Inverted File Index: Example DocText 1 Pease porridge hot, pease porridge cold, 2 Pease porridge in the pot, 3 Nine days old. 4 Some like it hot, some like it cold, 5 Some like it in the pot, 6 Nine days old. TermsDocuments cold days hot in it like nine old pease porridge pot some the Notation: N: number of documents; (=6) n: number of distinct terms; (=13) f: number of index pointers; (=26)

Inverted File Compression Each inverted list has the form A naïve representation results in a storage overhead of This can also be stored as Each difference is called a d-gap. Since each pointer requires fewer thanbits. Assume d-gap representation for the rest of the talk, unless stated otherwise

Text Compression Two classes of text compression methods ySymbolwise (or statistical) methods xEstimate probabilities of symbols - modeling step xCode one symbol at a time - coding step xUse shorter code for the most likely symbol xUsually based on either arithmetic or Huffman coding yDictionary methods xReplace fragments of text with a single code word (typically an index to an entry in the dictionary). eg: Ziv-Lempel coding, which replaces strings of characters with a pointer to a previous occurrence of the string. xNo probability estimates needed Symbolwise methods are more suited for coding d-gaps

Models model encoder model decoder compressed text text Entropy, or the average amount of information per symbol over the whole alphabet, denoted H, is given by Information content of a symbol s, denoted by I(s) is given by Shannon’s formula Models can be static, semi-static or adaptive.

A 0.05 B 0.05 Huffman Coding: Example C 0.1 D 0.2 E 0.3 F 0.2 G 0.1

Huffman Coding: Example C 0.1 D 0.2 E 0.3 F 0.2 G 0.1 A 0.05 B

Huffman Coding: Example A B G 0.1 F 0.2 E 0.3 D 0.2 C SymbolCode Probability A B C D E F G

Huffman Coding: Conclusions

Arithmetic Coding Pr[c]=1/3 Pr[b]=1/3 Pr[a]=1/ A) Interval used to code b B) Interval used to code c Pr[c]=1/4 Pr[b]=2/4 Pr[a]=1/ Final interval (represents whole output) Pr[c]=2/5 Pr[b]=2/5 Pr[a]=1/5 Pr[c]=3/6 Pr[b]=2/6 Pr[a]=1/6 C) String= bccb Alphabet = {a, b, c} Code= 0.64

Arithmetic Coding: Conclusions zHigh probability events do not reduce the size of the interval in the next step very much, whereas low- probability events do. zA small final interval requires many digits to specify a number guaranteed to be in the interval. zNumber of bits required is proportional to the negative logarithm of the size of the interval. zA symbol s of probability Pr[s] contributes -log Pr[s] bits to the output. Arithmetic Coding produces near-optimal codes, given an accurate model

Inverted File Compression This can also be stored as Each difference is called a d-gap. Since Each inverted list has the form A naive representation results in a storage overhead of each pointer requires fewer than bits.

Methods for Inverted File Compression zMethods for compressing d-gap sizes can be classified into yglobal: each list is compressed using the same model ylocal: the model for compressing an inverted list is adjusted according to some parameter, like the frequency of the term zGlobal methods can be divided into ynon-parameterized: probability distribution for d-gap sizes is predetermined. yparameterized: probability distribution is adjusted according to certain parameters of the collection. zBy definition, local methods are parameterized.

Non-parameterized models Unary code: An integer x > 0, is coded as (x-1) ‘1’ bits followed by a ‘0’ bit. code of bits that represents in binary.γ code: Number x is coded as a unary code forfollowed by a δ code: Number of bits in binary is represented using γ code. For small integers, δ codes are longer than γ codes, but for large integers, the situation reverses.

Non-parameterized models Each code has an underlying probability distribution, which can be derived using Shannon’s formula. Probability assumed by unary is too small.

Global parameterized models Probability that a random document contains a random term, Assuming a Bernoulli process, Arithmetic coding: Huffman-style coding (Golomb coding):

Global observed frequency model zUse exact d-gap values and then use arithmetic or Huffman coding zOnly slightly better than γ or δ code zReason: pointers are not scattered randomly in the inverted file zNeed local methods for any improvement

Local methods zLocal Bernoulli yUse a different p for each inverted list yUse γ code for storing zSkewed Bernoulli yLocal Bernoulli model is bad for clusters yUse a cross between γ and Golomb, with b=median gap size yNeed to store b (use γ representation) yThis is still a static model Need an adaptive model that is good for clusters

Interpolative code Consider an inverted list Documents 8, 9, 11, 12 and 13 form a cluster Can do better with a minimal binary code

Performance of index compression methods Compression of inverted files in bits per pointer

Signature Files zEach document is given a signature, that captures its content yHash each document term to get several hash values yBits corresponding to those values are set to 1 zQuery processing: yHash each query term to get several hash values yIf a document has all bits corresponding to those values set to 1, it may contain the query term zFalse matches yset several bits for each term ymake the signatures sufficiently long zNaïve representation: may have to read the entire signature file for each query term zUse bitslicing to save on disk transfer time

Signature files: Conclusion zDesign involves many tradeoffs ywide, sparse signatures reduce number of false matches yshort, dense signatures require more disk accesses zFor reasonable query times, requires more space than compressed inverted file zInefficient for documents of varying sizes yBlocking makes simple queries difficult to answer zText is not random

Bitmaps zSimple representation: For each term in the lexicon, store a bitvector of length N. A bit is set if and only if the corresponding document contains the term. zEfficient for boolean queries zEnormous amount of storage requirement, even after removing stop words zHave been used to represent common words

Compression of signature files and bitmaps zSignature files are already in compressed form yDecompression affects query time substantially yLossy compression results in false matches zBitmaps can be compressed by a significant amount Compressed code: 1100 : 0101, 1010 : 0010, 0011, 1000, 0100

Comparison of indexing methods zAll indexing methods are variations of the same basic idea!! zSignature files and inverted files require an order of magnitude less secondary storage than bitmaps zSignature files cause unnecessary access to the document collection unless signature width is large zSignature files are disastrous when record lengths vary a lot zAdvantages of signature files yno need to keep lexicon in memory ybetter for conjunctive queries involving common terms Compressed inverted files are the most useful for indexing a collection of variable length text documents

Conclusion zFor practical purposes, the best index compression algorithm is the local Bernoulli method (using Golomb coding) zCompressed inverted indices are almost always better than signature files and bitmaps in most practical situations, in terms of both space and response time for queries