Compression of Inverted Indexes for Fast Query Evaluation Falk Scholer Hugh Williams John Yiannis Justin Zobel (RMIT University, Melbourne, Australia)

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Chapter 11: File System Implementation
Modern Information Retrieval
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Chapter 12: File System Implementation
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
File System Implementation
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large  It might be useful for some modern devices to support information retrieval.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Storage and File structure COP 4720 Lecture 20 Lecture Notes.
Chapter 5 Record Storage and Primary File Organizations
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
File System Implementation
COMP9319: Web Data Compression and Search
Memory Management.
Module 11: File Structure
Chapter 11: File System Implementation
Chapter 11: File System Implementation
26 - File Systems.
Hashing CENG 351.
9/12/2018.
Chapter 11: File System Implementation
Disk Storage, Basic File Structures, and Hashing
Database Implementation Issues
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.
Chapter 11: File System Implementation
Computer Architecture
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.
So far… Text RO …. printf() RW link printf Linking, loading
Main Memory Background Swapping Contiguous Allocation Paging
DATABASE IMPLEMENTATION ISSUES
Contents Memory types & memory hierarchy Virtual memory (VM)
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.
Chapter 11: File System Implementation
Database Implementation Issues
Database Implementation Issues
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Page Main Memory.
Presentation transcript:

Compression of Inverted Indexes for Fast Query Evaluation Falk Scholer Hugh Williams John Yiannis Justin Zobel (RMIT University, Melbourne, Australia) URL: Published 2002

To conserve storage space and improve query performance, an inverted index can be compressed. An uncompressed inverted index typically consumes over 30% of the space required to store the uncompressed collection. A compressed index can consume between 10% and 15% of the uncompressed collection.

Bitwise and bytewise compression schemes were considered. Of the bitwise compression algorithms, three were considered: Elias gamma coding, Elias delta coding, and Golomb-Rice coding.

Gamma coding is relatively inefficient for storing integers larger than 15. Delta coding is more suited to larger integers. Golomb-Rice coding offers generally more compact storage and faster retrieval of integers than the Elias codes.

These three coding schemes can be combined: Golomb codes for document numbers Gamma codes for frequencies Delta codes for offsets

Bytewise coding schemes involve compressing integers to an integral number of bytes. Bytewise enhancements to the coding schemes that were tested included byte boundary alignment of integers, and the use of a signature block to indicate the number of byte comprising an integer. The use of signature blocks was shown to reduce performance.

It was concluded experimentally that a variable-byte bytewise compression scheme resulted in better overall performance than more compact bitwise schemes. Query evaluation was twice as fast.

Incremental Updates of Inverted Lists for Text Document Retrieval Anthony Tomasic, Stanford University Hector Garcia-Molina, Stanford University Kurt Shoens, IBM Almaden URL: Published 1994

The Internet presents us with large, rapidly growing repositories of information. Efficient methods of indexing and of updating these indexes are necessary. Article presents properties of and recommendations for variations of a certain dynamic indexing scheme.

Algorithm presented is as follows: Two data structures are present: and inverted index in memory, and an inverted index on disk. The in-memory indexes are called short lists, and for each, a fixed amount of space is allocated to it called a bucket.

The disk indexes are called long lists, and each term has an undetermined amount of space.

Algorithm: An in-memory list L for word w must be moved to disk. First, if w already has a long list on disk, L is appended to the long list. Otherwise, we assume L is a short list and insert it into bucket h(w). If the bucket is not already in memory, it is read in, and L inserted. If the bucket overflows, we then pick a longest short list in the bucket, remove it, and make it a long list, writing it to disk.

When building indexes, there is a tradeoff between update performance and query performance.

Two index-building styles described and tested: new and whole The new style of building an index is best if query performance is not critical. As short lists fill up, they are written to disk to available free blocks. For common words, several long lists may exist. No effort is made to consolidate these lists on disk.

The whole style appends long lists of the same words together. Every time a list is written to disk, the entire index is copied to a different location if necessary. This style is better for applications where query performance is critical.