Download presentation
Presentation is loading. Please wait.
Published byBeverly Flynn Modified over 9 years ago
1
Performance of Compressed Inverted Indexes
2
Reasons for Compression Compression reduces the size of the index Compression can increase the performance of query evaluation operations
3
Factors Affecting Index Performance Retrieval time for index lists (index size) Complexity of decoding index lists
4
Standard Techniques Translate absolute location of terms into differences between locations Use bitwise encoding schemes such as Golomb-Rice or Elias coding Usually reduce an index to about 15% of the size of the collection Performance is generally equal or better than an uncompressed index
5
Articles Reviewed Compression of Inverted Indexes For Fast Query Evaluation Scholer, Williams, Yiannis and Zobel, 2002 School of Computer Science and Information Technology, RMIT University, Melbourne, Australia Index Compression vs. Retrieval Time of Inverted Files for XML Documents Fuhr and Govert, 2002 University of Dortmund, Germany
6
Article 1: Improving Performance Two techniques were chosen to attempt to improve the performance of compressed indexes: Optimization of existing bitwise compression routines Implementation of bytewise compression routines
7
Optimized Bitwise Compression Routines Improved existing code developed by Williams and Zobel Optimized for the Intel / Linux platform Decoding speed improved to 60% of that achieved by Williams and Zobel
8
Bytewise Compression Routines Integers are stored in standard binary form using only 7 bits of a byte Each integer only takes up as many bytes as necessary to store the integer 1 bit per byte is used as a flag to indicate that a byte is the final byte for the integer Decoding of the integers is much simpler than the complex bitwise encodings
9
Bitwise vs. Bytewise Bytewise encoding of indexes takes up nearly 20% of the original document size (33% more than bitwise encodings) Bytewise encoding provides query performance that is double that of the optimized bitwise encodings Even when the index is small enough to be stored in memory, bytewise encoding shows small improvements over uncompressed indexes
10
Article 2: Structured Indexes Most IR approaches in the past have ignored the structure and formatting of documents The widespread adoption of HTML and XML has created the need for improvements in structured IR
11
Inverted Indexes of XML Documents The document structure must be stored or referenced from the inverted index Standard schemes use a Path-In-List (PIL) approach; structure data is stored within the inverted list for each term Indexes are generally much larger than the original text when uncompressed
12
Compression of Inverted Lists Problem: the uncompressed PIL approach generates an index that is too large Two possible solutions were explored: Use bitwise compression schemes to compress the existing PIL representation Store only a pointer in the list that points into another data structure that models the document structure
13
XML Structure (XS) Tree The XS Tree is a compact representation of the structure of an XML document Size of XS Tree is generally 1-2% of the original document size XS Trees for an entire document collection can usually be kept in memory
14
Performance of PIL vs. XS Trees The XS Tree index, including the XS Trees, is generally 2-3 times smaller than the compressed PIL approach Both approaches yield indexes that are smaller than the document collection In both cases, compression results in retrieval performance that is far worse than uncompressed PIL. Retrieval performance of the XS Tree approach is 10-100 times worse than that of the uncompressed PIL
15
Conclusions Retrieval performance is dependent on: the retrieval time of the index (index size) the complexity of decoding the index entries Scholer et. al. find the ideal balance with bytewise compression, which results in optimal retrieval times
16
Conclusions The XS Tree’s goal of compressing the size of the index is successful The complexity of decoding the XS Tree structure results in nearly unusable performance Future research should be undertaken to find a structure that is quicker to decode than the XS Tree
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.