Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Chapter 11: File System Implementation
An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Physical Data Warehouse Design Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Overview of File Organizations and Indexing Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read-Optimized Databases Stavros Harizopoulos MIT CSAIL joint work with: Velen Liang, Daniel Abadi,
Pattern Matching in DAME using AURA technology Jim Austin, Robert Davis, Bojian Liang, Andy Pasley University of York.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
Performance Tradeoffs in Read-Optimized Databases Stavros Harizopoulos * MIT CSAIL joint work with: Velen Liang, Daniel Abadi, and Sam Madden massachusetts.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
Improved Query Performance With Variant Indexes Patrick O’Neil, Dallan Quass Presented by Bo Han.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Information Retrieval in Practice
Database System Architecture and Implementation
How To Build a Compressed Bitmap Index
CS522 Advanced database Systems
COMP 430 Intro. to Database Systems
Introduction to Query Optimization
Relational Algebra Chapter 4, Part A
Evaluation of Relational Operations
Chapter 15 QUERY EXECUTION.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Evaluation of Relational Operations: Other Operations
Introduction to Database Systems
Lecture 15: Bitmap Indexes
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Overview of Query Evaluation
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated with: Institute of Computer Science and Business Informatics, University of Vienna, Austria

February 6, Outline  Brief Overview of Index Data Structures  Conventional Bitmap Indices:  Simple Bitmap Indices  Bitmap Encoding Techniques  Bitmap Compression  Bitmap Indices for Scientific Data  A Novel Bitmap Algorithm  Towards a Cost Model for a Query Optimiser  Features of My Bitmap Index Implementation  Performance Benchmarks on Synthetic Data:  Verbatim Bitmap Indices  Compressed Bitmap Indices  Performance Benchmarks on Real Data:  High Energy Physics  Sloan Digital Sky Server  Conclusions

February 6, Brief Overview of Index Data Structures  One dimensional index data structures:  Total order for one-dimension  Hash-based:  Optimised for exact match queries, e.g. jetE = 106  Tree-based:  Optimised for range queries, e.g. jetE < 106  Most widely used: B+-tree (1972):  Multidimensional index data structures  No total order for all dimensions  Hash-based:  Grid-File, Bang-File, …  Tree based:  R-Trees, Pyramid-Tree, …  Bitmap Indices:  Applied in Data Warehouses for typical read-only environments

February 6, Simple Bitmap Indices (Equality Encoding) a) List of attributes b) Bitmap Index (equality encoding) a) List of 12 attributes with 10 distinct attribute values, i.e attribute cardinality = 10 b) For each distinct attribute value, one bit slice is created, i.e bitmap index consists of 10 bit slices (E0 to E9) Bit Slice E2 encodes attributes with value 2

February 6, Various Bitmap Encoding Techniques a) list of attributes b) equality encoding c) range encoding Attribute cardinality = 10 Range encoding optimised for one-sided range queries, e.g. a0 <= 2

February 6, Equality (EE) vs Range Encoding (RE) Index size: |A| bit slices where |A| is the attribute cardinality, i.e. number of distinct attribute values One-sided range queries can be more efficiently handled with range encoded bitmap indices!

February 6, Pros and Cons of Bitmap Indices  Pros:  Easy to build and to maintain  Easy to identify records that satisfy a complex multi-attribute predicate (multi-dim. ad-hoc queries)  Very space efficient for attributes with low cardinality (number of distinct attribute values, e.g. “Yes”, “No”)  Cons:  Space inefficient for attributes with high cardinality  A possible solution: Bitmap Compression

February 6, Bitmap Compression  Advantage:  Less disk space for storing indices  Indices can be read from disk faster into memory  More indices can be cached in memory  Possible problems:  Difficult to combine bitmap compression with optimal index design reported in the literature  If bitmaps must be decompressed before performing Boolean operations, the decompression overhead might outweigh the advantages of compression

February 6, Various Bitmap Compression Algorithms  Run Length Encoding (RLE):  one-sided (asymmetric) vs. two-sided (symmetric)  Gzip (Lempel-Ziv, LZ):  verbatim (uncompressed) bitmap is compressed via zlib  ExpGol:  variable bit length encoding (RLE-bitmap is compressed)  Byte-Aligned Bitmap Compression (BBC):  variable byte length encoding (Oracle patent)  one-sided vs. two-sided (BBC1 vs. BBC2)

February 6, Algorithms for Boolean Operations on Compressed Bitmaps [Johnson VLDB99]  Basic:  Input (I): two verbatim bitmaps  Output (O): one verbatim bitmap  Inplace:  I: one verbatim bitmap + one RLE, ExpGol or BBC-bitmap  O: one verbatim bitmap  Direct:  I: two compressed bitmaps (RLE or BBC)  O: one compressed bitmap (RLE or BBC)

February 6, Outline  Brief Overview of Index Data Structures  Conventional Bitmap Indices:  Simple Bitmap Indices  Bitmap Encoding Techniques  Bitmap Compression  Bitmap Indices for Scientific Data  A Novel Bitmap Algorithm  Towards a Cost Model for a Query Optimiser  Features of My Bitmap Index Implementation  Performance Benchmarks on Synthetic Data:  Verbatim Bitmap Indices  Compressed Bitmap Indices  Performance Benchmarks on Real Data:  High Energy Physics  Sloan Digital Sky Server  Conclusions

February 6, Bitmap Indices for Scientific Data  Bitmaps indices of commercial products (Oracle, Sybase, Informix) are optimised for discrete attribute values, e.g. integers  However, scientific data is mostly non-discrete, e.g. floating points  Using commercial bitmap indices for non-discrete values would produce one bit slice per distinct attribute value!  Possible solutions:  Build function-based indices on top of commercial indices:  See evaluation of DB-Group on Qracle’s bitmap indices  However, Oracle uses equality encoded bitmap indices (not optimised for range queries)!  Develop your own range-based bitmap indices (topic of my Ph.D. thesis)

February 6, Range Encoding for Non- Discrete Attribute Values  Encoding of attribute ranges [0;140) rather than attribute values (7 logical but 6 physical bins) Query processing: see next slide

February 6, A Novel Bitmap Algorithm - GenericRangeEncoding  Extract candidate objects from “candidate slice” via XOR with “previous” bit slice for query: x < 63 XOR Hits objects Only these candidates need to be checked rather than all candidates in the “candidate slice” Result after “candidate check”

February 6, Towards a Cost Model for a Query Optimiser  Basic Idea:  Before a query is executed the Query Optimiser calculates the I/O costs for both access paths, namely the sequential scan and the query based on the bitmap index  Given these costs, the Query Optimiser selects the access paths with the lowest expected costs (cost-based Query Optimiser).  Approach for Cost Model based on GenericRangeEncoding:  Given the query range and the binning strategy, calculate the expected I/O costs for checking the candidate objects against the query constraint  Use stochastic model  Note: We do not attempt to discuss the whole approach. For details refer to

February 6, Cost Model #1: #Candidates per Dimension  For discrete attribute values the main bottleneck is the “index scan”  For non-discrete attribute values the main bottleneck is the “candidate check”, i.e. all candidate objects must be checked against the query constraint  Simplifying assumption: equally distributed and independent data values  Max. number of expected candidates (E c ) per indexed attribute: E c = O/b where O … #total_objects, b … #bit_slices  e.g. 1,000,000 objects with 100 bins => 10,000 candidate objects

February 6, Cost Model #2: Page I/O for Candidates per Dimension  Access granularity of database is one page rather than one object  Thus, if one object is accessed, the whole page is read  Costs for page I/O [O’Neil, Quass 1997]:  C = p tot *[1-e^(-E c /p tot )] where p tot … total #pages of all objects E c … expected #candidate objects

February 6, Outline  Brief Overview of Index Data Structures  Conventional Bitmap Indices:  Simple Bitmap Indices  Bitmap Encoding Techniques  Bitmap Compression  Bitmap Indices for Scientific Data  A Novel Bitmap Algorithm  Towards a Cost Model for a Query Optimiser  Features of My Bitmap Index Implementation  Performance Benchmarks on Synthetic Data:  Verbatim Bitmap Indices  Compressed Bitmap Indices  Performance Benchmarks on Real Data:  High Energy Physics  Sloan Digital Sky Server  Conclusions

February 6, My Bitmap Indices  Bitmap Indices are built on top of Objectivity/DB  Single Bit Slices are based on new version of HepODMBS Tags:  Persistent, scalable segmented VArrays called “sliced Tag” (column- wise clustering, see next slide)  Prefetch optimisation for concurrent reading  “Base objects”, i.e. non-indexed data, are also stored as sliced Tag  Query Preprocessor:  with Koen Holtman (Caltech/CMS): “any” mathematical (query) expression can be evaluated  E.g. Bitmaps “jet1E 0.3 && jet2E > 5.5”  Bitmap Compression:  with Theodore Johnson (AT&T Labs-Research) – [VLDB99/00] + own enhancements of Boolean operations for two-sided BBC

February 6, Clustering of Generic vs. Sliced Tags in HepODBMS attr 1 attr 2 attr 3 attr 1 attr 2 attr 3 a1 a2 a3 Generic Tags (PAW: row-wise) Sliced Tags (PAW: column-wise) tag0 tag1tag2tag3 “old” version “new” version: not released yet

February 6, Outline  Brief Overview of Index Data Structures  Conventional Bitmap Indices:  Simple Bitmap Indices  Bitmap Encoding Techniques  Bitmap Compression  Bitmap Indices for Scientific Data  A Novel Bitmap Algorithm  Towards a Cost Model for a Query Optimiser  Features of My Bitmap Index Implementation  Performance Benchmarks on Synthetic Data:  Verbatim Bitmap Indices  Compressed Bitmap Indices  Performance Benchmarks on Real Data:  High Energy Physics  Sloan Digital Sky Server  Conclusions

February 6, Definitions and Assumptions for Verbatim Bitmap Indices  First set of tests is based on 1,000,000 base objects with 25 attributes (dimensions)  Attributes are clustered together (sliced Tag alias column-wise clustering)  Attribute values are equally distributed and independent, and in the range of [0;100]  Bitmap Index (BMI):  100 equi-width bins per dimension  => Size of BMI ~3 times the size of the base objects  Query selectivity per attribute (dimension):  #selected_attribute_values/#total_attribute_values (per dimension)  e.g. a3 30 % selectivity  Total query selectivity:  #selected_objects/#total_objects  e.g. a3 40 => 12 % selectivity

February 6, 5-Dimensional Query - Page I/O & Response Time Total query sel. = x 5 sequential scan Max. speed up of BMI relative to seq. scan: ~ factor 2 Note: All benchmarks in this talk are performed on cold disk cache!

February 6, 10-Dimensional Query - Page I/O & Response Time Total query sel. = x 10 sequential scan Max. speed up of BMI relative to seq. scan: ~ factor 3

February 6, 25-Dimensional Query - Page I/O & Response Time Total query sel. = x 25 sequential scan Max. speed up of BMI relative to seq. scan: ~ factor 5

February 6, Assumptions for Compressed Bitmap Indices  1,000,000 base objects with 25 attributes (dimensions)  Attribute values are exponentially distributed and independent  Bitmap Index (BMI):  100 equi-width bins per dimension  => Size of BMI ~3 times the size of the base objects

February 6, 2-Sided Byte Aligned Bitmap Compression (BBC2) Exponential data distribution Good compression ratio Range Encoded Bitmap Index

February 6, Verbatim vs Compressed (BBC2) Bitmap Indices Advantage of compressed bitmap index

February 6, Outline  Brief Overview of Index Data Structures  Conventional Bitmap Indices:  Simple Bitmap Indices  Bitmap Encoding Techniques  Bitmap Compression  Bitmap Indices for Scientific Data  A Novel Bitmap Algorithm  Towards a Cost Model for a Query Optimiser  Features of My Bitmap Index Implementation  Performance Benchmarks on Synthetic Data:  Verbatim Bitmap Indices  Compressed Bitmap Indices  Performance Benchmarks on Real Data:  High Energy Physics  Sloan Digital Sky Server  Conclusions

February 6, Specific HEP Data  Physics data: 1,401,020 Tags with 37 attributes (in Objectivity)  Data Size: 262 MB  Index Size: 790 MB (37 dimensions with 100 bins each)

February 6, Distribution Functions of Specific HEP Data  Data Distribution 4 different physics attributes Range Encoded BMIs with 100 bins

February 6, BMI Results for Specific HEP Data  For the particular queries we studied we got a performance improvement of a factor of two for 10-dimensional queries (as compared to the sequential scan) based on bitmap indices with 100 bins (~3 times the size of base objects)  Tests based on real data with synthetic queries  However, as we have seen all the results are relative and highly depended on: a) Data distribution b) Access patterns c) Binning strategy – which should reflect a) and b)  For higher dimensional queries the performance improvement can be even more significant!

February 6, Specific Sloan Digital Sky Server (SDSS) Data  Sloan Digital Sky Server: 6,182,527 real astronomy objects (on top of Objectivity)  Extraction of these objects and porting to sliced tags with bitmap indices  In total: 65 bitmap indices (one index for each attribute)  Data size (base objects): ~2 GB  Index size: ~5.2 GB

February 6, SDSS Sample Queries  From 357 query logs of 41 users, 49 queries based on this data set (sxGalaxy).  3 typical multi-dimensional ones: Q1: SELECT g,r,I FROM sxGalaxy WHERE ((RA() between 180 and 185) && (DEC() between 1. and 1.2) && (r between 10 and 18) && (i between 10 and 18) && (g between 10 and 18)) Q2: SELECT g,r,i FROM sxGalaxy WHERE ((g-r between 1.05 and 1.13) &&(r-i between 0.42 and 0.51) && (r between and 19.68)) Q3: SELECT u,g,r FROM sxGalaxy WHERE ((u-g between 0.0 and 0.75) && (g-r between 0.0 and 0.5) && (u between 18 and 23) && (g between 18 and 23) && (r between 18 and 23) && ((u-g)/(g-r) between 0.8 and 1.2))

February 6, BMI Results for Specific SDSS Data  Speedup factor of queries against bitmap indices over queries against Sloan Sky Server:  Q1: speedup factor ~10  Q2: speedup factor ~20  Q3: speedup factor ~15  Reason for better performance of bitmap indices:  Better clustering of base objects - attribute-wise rather than object-wise  Low selectivity queries require fewer page I/Os than Sloan Queries

February 6, Conclusions  Depending on the data distribution, the query access pattern and the binning strategy, bitmap indices can significantly improve the response time of high-dimensional queries  Detailed results can be found in Ph.D. thesis:  Future work:  Collaboration with Arie Shoshani and John Wu from Berkeley to further improve query response time & bitmap compression  Improve Cost Model for Query Optimiser to increase accuracy of predictions of I/O costs for queries against real data with various binning strategies