« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Slides:

Advertisements

Similar presentations

Online Algorithm Huaping Wang Apr.21

Advertisements

361 Computer Architecture Lecture 15: Cache Memory

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.

A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:

ARC: A SELF-TUNING, LOW OVERHEAD REPLACEMENT CACHE

Chapter 5: Introduction to Information Retrieval

Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark.

Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.

PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Cache Memory By JIA HUANG. "Computer Science has only three ideas: cache, hash, trash.“ - Greg Ganger, CMU.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.

Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.

Chapter 3.2 : Virtual Memory

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

A Case for Delay-conscious Caching of Web Documents Peter Scheuermann, Junho Shim, Radek Vingralek Department of Electrical and Computer Engineering Northwestern.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Memory Systems Architecture and Hierarchical Memory Systems

Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

CMPE 421 Parallel Computer Architecture

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.

Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.

Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.

Date: 2012/3/5 Source: Marcus Fontouraet. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou 1 Efficiently encoding term co-occurrences in inverted.

Chapter 21 Virtual Memoey: Policies Chien-Chung Shen CIS, UD

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Computer Architecture Lecture 26 Fasih ur Rehman.

Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,

CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

An Overview of Proxy Caching Algorithms Haifeng Wang.

Page Table Implementation. Readings r Silbershatz et al:

Transforming Policies into Mechanisms with Infokernel Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J.

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

Memory: Page Table Structure

Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras

COSC3330 Computer Architecture

CS161 – Design and Architecture of Computer

Computer Architecture

Chapter 21 Virtual Memoey: Policies

Memory Management 6/20/ :27 PM

How will execution time grow with SIZE?

Cache Memory Presentation I

Adaptive Cache Replacement Policy

EECE.4810/EECE.5730 Operating Systems

Pyramid Sketch: a Sketch Framework

Contents Memory types & memory hierarchy Virtual memory (VM)

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Presentation transcript:

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008) J. Zhang, X. Long, T. Suel Paper presentation: Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

Paper Outline Introduction Technical background-compressed index Technical background-caching in SE Paper Contributions Inverted index compression List caching policies Compression plus caching Conclusion

Introduction Observation: in todays large SE it is crucial to improve query throughput and fast response time Authors study comparatively two important techniques to achieve that goal: invert index compression and inverted index caching (or list caching)

Compressed index organization Two cases considered: (i) we have docIDs and frequencies, i.e., each posting is of the form (di, fi), and (ii) we also store positions, i.e., each posting is of the form (d i, f i, p i,0,…, p i,freq-1 ). We use word-oriented positions, i.e., p i,j = k if a word is the k- th word in the document. Figure 1: Disk-based inverted index structure with blocks for caching and chunks for skipping. DocIDs and positions are shown after taking the differences to the preceding values.

Caching in Search Engines Result caching means that if a query is identical to another query (by possibly a different user) that was recently processed, then we can simply return the previous result. Thus, a result cache stores the top-k results of all queries that were recently computed by the engine List caching (or inverted index caching) caches in main memory those parts of the inverted index that are frequently accessed. Here, list caching is based on the blocks. If a particular term occurs frequently in queries, then the blocks containing its inverted list are most likely already in cache and do not have to be fetched from disk. Cached data is kept in compressed form in memory (uncompressed data would typically be about 3 to 8 times larger depending on index format and compression method). So decompression speed must also be considered Figure 2: Two-level architecture with result caching at the query integrator and inverted index caching in the memory of each server.

Paper Contributions (1) Detailed experimental evaluation of fast state-of-the-art index compression techniques, including Variable-Byte coding, Simple9 coding, Rice coding and PForDelta coding (2) Introduce new variations of these techniques, including an extension of Simple9 called Simple16, which we discovered during implementation and experimental evaluation (3) Compare the performance of several caching policies for inverted list caching on real search engine query traces (AOL and Excite). LRU is not a good policy in this case (4) Study of the benefits of combining index compression and index caching to improve the resulting performance. Our conclusion is that for almost the entire range of system parameters, PForDelta compression with LFU caching achieves the best performance, except for small cache sizes and fairly slow disks where our optimized Rice code is slightly better

Inverted index compression Algorithms for list compression (limit only to those that allow fast decompression): 1.Variable-byte encoding 2.Simple9 (S9) coding 3.Simple16 (S16) coding 4.PForDelta coding 5.Rice Coding Figure 4: Total index size under different compression schemes. Results are shown for docIDs, frequencies, positions, an index with docIDs and positions only, and an index with all three fields.

Inverted index compression evaluation Figure 5: Comparison of algorithms for compressing DocIDs on inverted lists with different lengths. Figure 6: Times for decompressing the entire inverted index. Figure 7: Average compressed size of the data required for processing a query from the Excite trace. Figure 8: Average decompression time (CPU only) per query, based on 100; 000 queries from the Excite trace.

Experimental setup Data set: ~7.4M web pages (selected randomly from a crawl of 120M in Oct02) Total compressed size: ~36GB Total # of distinct words: ~20M (269 distinct words per doc) Total # of postings: ~2B Query log :~1,2M queries issued to Excite SE, ~36M issued to AOL

List caching policies the inverted index consists of 64KB blocks of compressed data (block is the basic caching unit objective is to minimize the following functions: (a) query-oriented, where we count a cache hit whenever all inverted lists for a query are in cache, and a miss otherwise, (b) list-oriented, where we count a cache hit for each inverted list needed for a query that is found in cache, and (c) block- or data size- oriented, where we count the number of blocks or amount of data served from cache versus fetched from disk during query processing (this is more appropriate for very large collections)

1)Least Recently Used (LRU): baseline approach 2)Least Frequently Used (LFU): depends on the length of history list that we keep 3)Optimized Landlord: when an object is inserted into the cache, it is assigned a deadline given by the ratio between its benefit and its size. When we evict an object, we choose the one with the smallest deadline. Whenever an object already in cache is accessed, it’s deadline is reset to some suitable value (in our case, every object has the same size and benefit since we have an unweighted caching problem. So, if we reset each deadline back to the same original value upon access, Landlord becomes identical to LRU). 4)Multi-Queue: we use m=8 different LRU queues, depending on the access # of blocks 5)Adaptive Replacement Cache: tries to balance between LRU and LFU List caching policies

Comparision of list caching policies Figure 9: Impact of history size on the performance of LFU, using an index compressed with PForDelta. Figure 10: Cache hit rates for different caching policies and relative cache sizes. The graphs for LFU and MQ are basically on top of each other and largely indistinguishable. Optimal is the clairvoyant algorithm that knows all future accesses and thus provides an upper bound on hit ratio. Each data point is based on a complete run over our query trace, where we measure the hit ratio for the last 100,000 queries.

Compression plus caching Figure 12: Comparison of compression algorithms with different memory size, using 100,000 queries and LFU for caching. Total index size varied between 4:3 GB for variable-byte and 2:8 GB for Entropy. Figure 13: Comparison of PForDelta, S16, and Rice coding, using LFU caching and assuming 10MB/s disk speed. Figure 14: Comparison of PForDelta, S16, and Rice coding, using LFU caching and assuming 50MB/s disk speed. Figure 15: Comparison of query processing costs for different disk speeds, using LFU and a 128MB cache (fixed).

Conclusions detailed experiments conducted for inverted index compression and caching (more attention is given to compression methods) all figure numbers correspond to initial paper settings open question 1: improve on compression speed of PForDelta open question 2: compression on position data (applied on web based collections)

Any questions? Thank you for your attention!