Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Part IV: Memory Management
Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned.
Data Compression CS 147 Minh Nguyen.
Introduction to Information Retrieval
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Allocating Memory.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Modern Information Retrieval
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
A Data Compression Algorithm: Huffman Compression
Information Retrieval IR 4. Plan This time: Index construction.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Sec 14.7 Bitmap Indexes Shabana Kazi. Introduction A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (commonly.
Compressed Prefix Sums O’Neil Delpratt Naila Rahman Rajeev Raman.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Evidence from Content INST 734 Module 2 Doug Oard.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
Index construction: Compression of postings
COMP9319: Web Data Compression and Search
Information Retrieval in Practice
Text Indexing and Search
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Succinct Data Structures
Data Compression.
Indexing & querying text
Information Retrieval in Practice
Information Retrieval and Web Search
Implementation Issues & IR Systems
Index construction: Compression of postings
Index construction: Compression of postings
Index construction: Compression of postings
Reseeding-based Test Set Embedding with Reduced Test Sequences
Presentation transcript:

Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering, University of Thessaly, Greece 23rd International Conference on Database and Expert Systems Applications DEXA 2012, September 3-7, Vienna, Austria

University of Thessaly, GreeceDEXA 2012 Introduction We present a method for the organization and compression of the positional data in Web Inverted Indexes. Important problem since the positions enable: Evaluation of ranked proximity queries. Evaluation of exact phrase queries.

University of Thessaly, GreeceDEXA 2012 Block-based index setup

University of Thessaly, GreeceDEXA 2012 Block-based index setup It allows partial decompression of the docID blocks, without touching the rest of the index data (frequencies, positions, etc). Useful during DAAT/TAAT query processing where we are interested in locating docIDs smaller than or equal to a given value. It requires: A skip table to jump forward in the inverted list. A positions look-up structure (more later).

University of Thessaly, GreeceDEXA 2012 Inverted index compression Many algorithms, divided in two categories: Individual integer compression methods Applicable to single integer values. Variable Byte, Elias Gamma/Delta, Golomb/Rice. Group integer compression methods Applicable to entire bundles of integers. VSEncoding, PForDelta (P4D), Optimized P4D. Fast, but require decoding the entire block.

University of Thessaly, GreeceDEXA 2012 Positional data The group encoding methods are ideal for docIDs But what about the positional data? The positional data dominate the index: Each posting includes on average positional values. Hence, decoding all the positions for all of the candidate postings is very expensive.

University of Thessaly, GreeceDEXA 2012 Positional data access during query processing Query processing optimization. Divide the processing in two stages: On the first step, quickly identify the most suitable results by using docIDs and frequencies only. On the second step, perform a more refined processing by accessing the positional data for the K best results of the previous phase only. Apparently, we are only interested in accessing the positions of specific postings. Decoding entire blocks is redundant.

University of Thessaly, GreeceDEXA 2012 Positional Data Organization Consequently, the group encoding methods are rendered inappropriate. Moreover, we do not know exactly where the positional values of a posting are stored. For this reason, two strategies have been proposed: Create a tree-based positions look-up structure. Construct indexed lists.

University of Thessaly, GreeceDEXA 2012 Positional data Look-ups The look-up structure requires one look-up per posting, thus decelerating processing. The indexed list requires one pointer per posting pointing to the respective positional data, thus it is prohibitively expensive.

University of Thessaly, GreeceDEXA 2012 PFBC: Positions Fixed-Bit Compression PFBC is designed to overcome all these problems. For each inverted list block, it employs a fixed number of bits to binary-encode each positional value. Then, by using the frequency values, we are able to calculate the exact location of the positions for each posting, without look-ups. We decode only the data actually required.

University of Thessaly, GreeceDEXA 2012 PFBC: Index organization PFBC is compatible to all state-of-the-art inverted list partitioning approaches. Compressed Embedded Skip Lists, Skips every 128 postings, Skips every postings (F t : number of documents containing t), etc. We store all the positional values contiguously. For each inverted list block B i we store: A pointer R Bi pointing to the beginning of the positional data of the postings of the block B i. A C Bi value representing the number of bits used to encode the positions of the block B i.

University of Thessaly, GreeceDEXA 2012 PFBC: Index organization

University of Thessaly, GreeceDEXA 2012 PFBC: Compression Phase Encode a bundle of K positional values. Steps 3-8: Identify the largest positional value p max. Step 9: Calculate C, i.e. the number of bits required to represent all K values. Step 13: The function write() stores an integer into a compressed sequence P by using C bits.

University of Thessaly, GreeceDEXA 2012 PFBC: Decompression Phase Decode the positional data of posting j, of the block B i. Steps 2-5: Sum up all the frequency values of the previous postings of B i. Step 6: Locate the bit where the positions of j start from. Step 9: The read() function reads each encoded position from the P vector.

University of Thessaly, GreeceDEXA 2012 PFBC Gains It facilitates direct access to the positional data without requiring expensive look-ups. Therefore, query processing is accelerated. It saves the space cost of maintaining a separate look-up structure; the required data is stored within the skip table. It needs much fewer pointers than the indexed lists. It enables decoding of the information actually needed, without decompressing entire blocks of integers.

University of Thessaly, GreeceDEXA 2012 Experiments Compression experiments: PFBC against OptP4D and VSEncoding. Organization experiments: PFBC against the tree look-up structure of [Yan et.al] and the indexed list of [Transier and Sanders]. Document collection: ClueWeb-09B dataset (~50 million documents, ~1.5 TB).

University of Thessaly, GreeceDEXA 2012 Experiments: Index size PFBC introduces a slight loss in the overall index size (increase of 1%-2%). This loss is amortized by the reduced size of the accompanying data structures (skip table, pointers to positions).

University of Thessaly, GreeceDEXA 2012 Experiments: Query throughput Recall: The query processing is performed in two phases; the positions are retrieved for the K best results of the first phase only. PFBC touches much fewer data. PFBC is look-up free. PFBC offers ~5 times faster decompression.

University of Thessaly, GreeceDEXA 2012 Conclusions We presented PFBC, a method for effective organization and compression of the positional data in Web inverted indexes. PFBC uses a fixed number of bits to encode the positions of an inverted list block. The slight increase in the overall index size is amortized by the slight decrease in the assistant data structures sizes. Much faster decompression. No look-ups for the positions are required. Allows us to decode the data actually needed.

University of Thessaly, GreeceDEXA 2012 The end Thank you for watching! Any questions?