Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007.

Slides:



Advertisements
Similar presentations
Extreme Performance with Oracle Data Warehousing
Advertisements

Information Retrieval in Practice
LIBRA: Lightweight Data Skew Mitigation in MapReduce
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Query Processing and Optimizing on SSDs Flash Group Qingling Cao
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Ingres/VectorWise Doug Inkster – Ingres Development.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.
Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.
Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.
2/25/2004 The Google Cluster Architecture February 25, 2004.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.
DaMoN 2011 Paper Preview Organized by Stavros Harizopoulos and Qiong Luo Athens, Greece Jun 13, 2011.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Disk Array Performance Estimation AGH University of Science and Technology Department of Computer Science Jacek Marmuszewski Darin Nikołow, Marek Pogoda,
Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.
Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.
C-Store: Column Stores over Solid State Drives Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 19, 2009.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
VectorWise The world’s fastest database GIUA, 13 September 2011.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
Ingres/VectorWise Doug Inkster – Ingres Development.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
V.2 Index Compression Heap’s law (empirically observed and postulated): Size of the vocabulary (distinct terms) in a corpus with total number of term occurrences.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
Column Oriented Database Vs Row Oriented Databases By Rakesh Venkat.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Price Performance Metrics CS3353. CPU Price Performance Ratio Given – Average of 6 clock cycles per instruction – Clock rating for the cpu – Number of.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
 Used MapReduce algorithms to process a corpus of web pages and develop required index files  Inverted Index evaluated using TREC measures  Used Hadoop.
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Database Techniek Query Processing & Cost Modeling (chapter )
Section 13.1 – Secondary storage management (Former Student’s Note)
Query Optimization Cases. D. ChristozovINF 280 DB Systems Query Optimization: Cases 2 Executable Block 1 Algorithm using Indices (if available) Temporary.
 An independent SQL Consultant  A user of SQL Server from version 2000 onwards with 12+ years experience.
The Google Cluster Architecture Written By: Luiz André Barroso Jeffrey Dean Urs Hölzle Presented By: Omkar Kasinadhuni Simerjeet Kaur.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
University of Maryland Baltimore County
On the analysis of indexing schemes
Text Indexing and Search
Indexing & querying text
Database Management Systems (CS 564)
Implementation Issues & IR Systems
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units Naiyong Ao, Fan Zhang, Di Wu, Douglas Stones Gang.
Section 13.1 – Secondary storage management (Former Student’s Note)
Information Retrieval and Web Design
Presentation transcript:

Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007

Background Process query-intensive workloads over large datasets efficiently within a DBMS Application Areas Information Retrieval Data mining Scientific data analysis

MonetDB/X100 Highlights Vectorized query engine Transparent, light-weight compression

Keyword Search Inverted index: TD(termid, docid, score) TopN( Project( MergeJoin( RangeSelect( TD1=TD, TD1.termid=10 ), RangeSelect( TD2=TD, TD2.termid=42 ), TD1.docid = TD2.docid), [docid = TD1.docid, score = TD1.scoreQ + TD2.scoreQ]), [score DESC], 20 )

Keyword Search Inverted index: TD(termid, docid, score) TopN( Project( MergeJoin( RangeSelect( TD1=TD, TD1.termid=10 ), RangeSelect( TD2=TD, TD2.termid=42 ), TD1.docid = TD2.docid), [docid = TD1.docid, score = TD1.scoreQ + TD2.scoreQ]), [score DESC], 20 )

Keyword Search Inverted index: TD(termid, docid, score) TopN( Project( MergeJoin( RangeSelect( TD1=TD, TD1.termid=10 ), RangeSelect( TD2=TD, TD2.termid=42 ), TD1.docid = TD2.docid), [docid = TD1.docid, score = TD1.scoreQ + TD2.scoreQ]), [score DESC], 20 )

Keyword Search Inverted index: TD(termid, docid, score) TopN( Project( MergeJoin( RangeSelect( TD1=TD, TD1.termid=10 ), RangeSelect( TD2=TD, TD2.termid=42 ), TD1.docid = TD2.docid), [docid = TD1.docid, score = TD1.scoreQ + TD2.scoreQ]), [score DESC], 20 )

Vectorized Execution [CIDR05] Volcano based iterator pipeline Each next() call returns collection of column-vectors of tuples Amortize overheads Introduce parallelism Stay in CPU Cache Vectors

Light-Weight Compression Compressed buffer-manager pages: Increase I/O bandwidth Increase BM capacity Favor speed over compression ratio CPU-efficient algorithms >1 GB/s decompression speed Minimize main-memory overhead RAM-CPU Cache decompression

Naïve Decompression 1.Read and decompress page 2.Write back to RAM 3.Read for processing

RAM-Cache Decompression 1.Read and decompress page at vector granularity, on-demand

2006 TREC TeraByte Track X100 compared to custom IR systems Others prune index (q/s) Throughput /CPU X X Wumpus MPI Melbourne Univ

Thanks!

MonetDB/X100 in Action Corpus: 25M text documents, 427GB docid + score: 28GB, 9GB compressed Hardware: 3GHz Intel Xeon 4GB RAM 10 disk RAID, 350 MB/s

MonetDB/X100 [CIDR’05] Vector-at-a-time instead of tuple-at-a-time Volcano Vector = Array of Values ( ) Vectorized Primitives Array Computations Loop Pipelinable  very fast Less Function call overhead Vectors are Cache Resident RAM considered secondary storage

MonetDB/X100 [CIDR’05] Vector-at-a-time instead of tuple-at-a-time Volcano Vector = Array of Values ( ) Vectorized Primitives Array Computations Loop Pipelinable  very fast Less Function call overhead Vectors are Cache Resident RAM considered secondary storage decompress

MonetDB/X100 [CIDR’05] Vector-at-a-time instead of tuple-at-a-time Volcano Vector = Array of Values ( ) Vectorized Primitives Array Computations Loop Pipelinable  very fast Less Function call overhead Vectors are Cache Resident RAM considered secondary storage decompress

Vector Size vs Execution Time

Compression docid: PFOR-DELTA Encode deltas as a b-bit offset from an arbitrary base value: deltas within get encoded deltas outside range are stored as uncompressed exceptions score: Okapi -> quantize -> PFOR compress

Compressed Block Layout Forward growing section of bit-packed b-bit code words

Compressed Block Layout Forward growing section of bit-packed b-bit code words Backwards growing exception list

Naïve Decompression Mark ( ) exception positions for(i=0; i < n; i++) { if (in[i] == ) { out[i] = exc[--j] } else { out[i]=DECODE(in[i]) } }

Patched Decompression Link exceptions into patch-list Decode: for(i=0; i < n; i++) { out[i]=DECODE(in[i]); }

Patched Decompression Link exceptions into patch-list Decode: for(i=0; i < n; i++) { out[i]=DECODE(in[i]); } Patch: for(i=first_exc; i<n; i += in[i]) { out[i] = exc[--j]; }

Patched Decompression Link exceptions into patch-list Decode: for(i=0; i < n; i++) { out[i]=DECODE(in[i]); } Patch: for(i=first_exc; i<n; i += in[i]) { out[i] = exc[--j]; }

Patch Bandwidth