Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated.

Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated with: Institute of Computer Science and Business Informatics, University of Vienna, Austria

February 6, 2002Kurt.Stockinger@cern.ch2 Outline  Brief Overview of Index Data Structures  Conventional Bitmap Indices:  Simple Bitmap Indices  Bitmap Encoding Techniques  Bitmap Compression  Bitmap Indices for Scientific Data  A Novel Bitmap Algorithm  Towards a Cost Model for a Query Optimiser  Features of My Bitmap Index Implementation  Performance Benchmarks on Synthetic Data:  Verbatim Bitmap Indices  Compressed Bitmap Indices  Performance Benchmarks on Real Data:  High Energy Physics  Sloan Digital Sky Server  Conclusions

February 6, 2002Kurt.Stockinger@cern.ch3 Brief Overview of Index Data Structures  One dimensional index data structures:  Total order for one-dimension  Hash-based:  Optimised for exact match queries, e.g. jetE = 106  Tree-based:  Optimised for range queries, e.g. jetE < 106  Most widely used: B+-tree (1972):  Multidimensional index data structures  No total order for all dimensions  Hash-based:  Grid-File, Bang-File, …  Tree based:  R-Trees, Pyramid-Tree, …  Bitmap Indices:  Applied in Data Warehouses for typical read-only environments

February 6, 2002Kurt.Stockinger@cern.ch4 Simple Bitmap Indices (Equality Encoding) a) List of attributes b) Bitmap Index (equality encoding) a) List of 12 attributes with 10 distinct attribute values, i.e attribute cardinality = 10 b) For each distinct attribute value, one bit slice is created, i.e bitmap index consists of 10 bit slices (E0 to E9) Bit Slice E2 encodes attributes with value 2

February 6, 2002Kurt.Stockinger@cern.ch5 Various Bitmap Encoding Techniques a) list of attributes b) equality encoding c) range encoding Attribute cardinality = 10 Range encoding optimised for one-sided range queries, e.g. a0 <= 2

February 6, 2002Kurt.Stockinger@cern.ch6 Equality (EE) vs Range Encoding (RE) Index size: |A| bit slices where |A| is the attribute cardinality, i.e. number of distinct attribute values One-sided range queries can be more efficiently handled with range encoded bitmap indices!

February 6, 2002Kurt.Stockinger@cern.ch7 Pros and Cons of Bitmap Indices  Pros:  Easy to build and to maintain  Easy to identify records that satisfy a complex multi-attribute predicate (multi-dim. ad-hoc queries)  Very space efficient for attributes with low cardinality (number of distinct attribute values, e.g. “Yes”, “No”)  Cons:  Space inefficient for attributes with high cardinality  A possible solution: Bitmap Compression

February 6, 2002Kurt.Stockinger@cern.ch8 Bitmap Compression  Advantage:  Less disk space for storing indices  Indices can be read from disk faster into memory  More indices can be cached in memory  Possible problems:  Difficult to combine bitmap compression with optimal index design reported in the literature  If bitmaps must be decompressed before performing Boolean operations, the decompression overhead might outweigh the advantages of compression

February 6, 2002Kurt.Stockinger@cern.ch9 Various Bitmap Compression Algorithms  Run Length Encoding (RLE):  one-sided (asymmetric) vs. two-sided (symmetric)  Gzip (Lempel-Ziv, LZ):  verbatim (uncompressed) bitmap is compressed via zlib  ExpGol:  variable bit length encoding (RLE-bitmap is compressed)  Byte-Aligned Bitmap Compression (BBC):  variable byte length encoding (Oracle patent)  one-sided vs. two-sided (BBC1 vs. BBC2)

February 6, 2002Kurt.Stockinger@cern.ch10 Algorithms for Boolean Operations on Compressed Bitmaps [Johnson VLDB99]  Basic:  Input (I): two verbatim bitmaps  Output (O): one verbatim bitmap  Inplace:  I: one verbatim bitmap + one RLE, ExpGol or BBC-bitmap  O: one verbatim bitmap  Direct:  I: two compressed bitmaps (RLE or BBC)  O: one compressed bitmap (RLE or BBC)

February 6, 2002Kurt.Stockinger@cern.ch12 Bitmap Indices for Scientific Data  Bitmaps indices of commercial products (Oracle, Sybase, Informix) are optimised for discrete attribute values, e.g. integers  However, scientific data is mostly non-discrete, e.g. floating points  Using commercial bitmap indices for non-discrete values would produce one bit slice per distinct attribute value!  Possible solutions:  Build function-based indices on top of commercial indices:  See evaluation of DB-Group on Qracle’s bitmap indices  However, Oracle uses equality encoded bitmap indices (not optimised for range queries)!  Develop your own range-based bitmap indices (topic of my Ph.D. thesis)

February 6, 2002Kurt.Stockinger@cern.ch13 Range Encoding for Non- Discrete Attribute Values  Encoding of attribute ranges [0;140) rather than attribute values (7 logical but 6 physical bins) Query processing: see next slide

February 6, 2002Kurt.Stockinger@cern.ch14 A Novel Bitmap Algorithm - GenericRangeEncoding  Extract candidate objects from “candidate slice” via XOR with “previous” bit slice for query: x < 63 XOR Hits objects Only these candidates need to be checked rather than all candidates in the “candidate slice” Result after “candidate check”

February 6, 2002Kurt.Stockinger@cern.ch15 Towards a Cost Model for a Query Optimiser  Basic Idea:  Before a query is executed the Query Optimiser calculates the I/O costs for both access paths, namely the sequential scan and the query based on the bitmap index  Given these costs, the Query Optimiser selects the access paths with the lowest expected costs (cost-based Query Optimiser).  Approach for Cost Model based on GenericRangeEncoding:  Given the query range and the binning strategy, calculate the expected I/O costs for checking the candidate objects against the query constraint  Use stochastic model  Note: We do not attempt to discuss the whole approach. For details refer to http://kurts.home.cern.ch/kurts/research/diss.pshttp://kurts.home.cern.ch/kurts/research/diss.ps

February 6, 2002Kurt.Stockinger@cern.ch16 Cost Model #1: #Candidates per Dimension  For discrete attribute values the main bottleneck is the “index scan”  For non-discrete attribute values the main bottleneck is the “candidate check”, i.e. all candidate objects must be checked against the query constraint  Simplifying assumption: equally distributed and independent data values  Max. number of expected candidates (E c ) per indexed attribute: E c = O/b where O … #total_objects, b … #bit_slices  e.g. 1,000,000 objects with 100 bins => 10,000 candidate objects

February 6, 2002Kurt.Stockinger@cern.ch17 Cost Model #2: Page I/O for Candidates per Dimension  Access granularity of database is one page rather than one object  Thus, if one object is accessed, the whole page is read  Costs for page I/O [O’Neil, Quass 1997]:  C = p tot *[1-e^(-E c /p tot )] where p tot … total #pages of all objects E c … expected #candidate objects

February 6, 2002Kurt.Stockinger@cern.ch19 My Bitmap Indices  Bitmap Indices are built on top of Objectivity/DB  Single Bit Slices are based on new version of HepODMBS Tags:  Persistent, scalable segmented VArrays called “sliced Tag” (column- wise clustering, see next slide)  Prefetch optimisation for concurrent reading  “Base objects”, i.e. non-indexed data, are also stored as sliced Tag  Query Preprocessor:  with Koen Holtman (Caltech/CMS): “any” mathematical (query) expression can be evaluated  E.g. Bitmaps “jet1E 0.3 && jet2E > 5.5”  Bitmap Compression:  with Theodore Johnson (AT&T Labs-Research) – [VLDB99/00] + own enhancements of Boolean operations for two-sided BBC

February 6, 2002Kurt.Stockinger@cern.ch20 Clustering of Generic vs. Sliced Tags in HepODBMS attr 1 attr 2 attr 3 attr 1 attr 2 attr 3 a1 a2 a3 Generic Tags (PAW: row-wise) Sliced Tags (PAW: column-wise) tag0 tag1tag2tag3 “old” version “new” version: not released yet

February 6, 2002Kurt.Stockinger@cern.ch22 Definitions and Assumptions for Verbatim Bitmap Indices  First set of tests is based on 1,000,000 base objects with 25 attributes (dimensions)  Attributes are clustered together (sliced Tag alias column-wise clustering)  Attribute values are equally distributed and independent, and in the range of [0;100]  Bitmap Index (BMI):  100 equi-width bins per dimension  => Size of BMI ~3 times the size of the base objects  Query selectivity per attribute (dimension):  #selected_attribute_values/#total_attribute_values (per dimension)  e.g. a3 30 % selectivity  Total query selectivity:  #selected_objects/#total_objects  e.g. a3 40 => 12 % selectivity

February 6, 2002Kurt.Stockinger@cern.ch23 5-Dimensional Query - Page I/O & Response Time Total query sel. = x 5 sequential scan Max. speed up of BMI relative to seq. scan: ~ factor 2 Note: All benchmarks in this talk are performed on cold disk cache!

February 6, 2002Kurt.Stockinger@cern.ch24 10-Dimensional Query - Page I/O & Response Time Total query sel. = x 10 sequential scan Max. speed up of BMI relative to seq. scan: ~ factor 3

February 6, 2002Kurt.Stockinger@cern.ch25 25-Dimensional Query - Page I/O & Response Time Total query sel. = x 25 sequential scan Max. speed up of BMI relative to seq. scan: ~ factor 5

February 6, 2002Kurt.Stockinger@cern.ch26 Assumptions for Compressed Bitmap Indices  1,000,000 base objects with 25 attributes (dimensions)  Attribute values are exponentially distributed and independent  Bitmap Index (BMI):  100 equi-width bins per dimension  => Size of BMI ~3 times the size of the base objects

February 6, 2002Kurt.Stockinger@cern.ch27 2-Sided Byte Aligned Bitmap Compression (BBC2) Exponential data distribution Good compression ratio Range Encoded Bitmap Index

February 6, 2002Kurt.Stockinger@cern.ch28 Verbatim vs Compressed (BBC2) Bitmap Indices Advantage of compressed bitmap index

February 6, 2002Kurt.Stockinger@cern.ch30 Specific HEP Data  Physics data: 1,401,020 Tags with 37 attributes (in Objectivity)  Data Size: 262 MB  Index Size: 790 MB (37 dimensions with 100 bins each)

February 6, 2002Kurt.Stockinger@cern.ch31 Distribution Functions of Specific HEP Data  Data Distribution 4 different physics attributes Range Encoded BMIs with 100 bins

February 6, 2002Kurt.Stockinger@cern.ch32 BMI Results for Specific HEP Data  For the particular queries we studied we got a performance improvement of a factor of two for 10-dimensional queries (as compared to the sequential scan) based on bitmap indices with 100 bins (~3 times the size of base objects)  Tests based on real data with synthetic queries  However, as we have seen all the results are relative and highly depended on: a) Data distribution b) Access patterns c) Binning strategy – which should reflect a) and b)  For higher dimensional queries the performance improvement can be even more significant!

February 6, 2002Kurt.Stockinger@cern.ch33 Specific Sloan Digital Sky Server (SDSS) Data  Sloan Digital Sky Server: 6,182,527 real astronomy objects (on top of Objectivity)  Extraction of these objects and porting to sliced tags with bitmap indices  In total: 65 bitmap indices (one index for each attribute)  Data size (base objects): ~2 GB  Index size: ~5.2 GB

February 6, 2002Kurt.Stockinger@cern.ch34 SDSS Sample Queries  From 357 query logs of 41 users, 49 queries based on this data set (sxGalaxy).  3 typical multi-dimensional ones: Q1: SELECT g,r,I FROM sxGalaxy WHERE ((RA() between 180 and 185) && (DEC() between 1. and 1.2) && (r between 10 and 18) && (i between 10 and 18) && (g between 10 and 18)) Q2: SELECT g,r,i FROM sxGalaxy WHERE ((g-r between 1.05 and 1.13) &&(r-i between 0.42 and 0.51) && (r between 15.68 and 19.68)) Q3: SELECT u,g,r FROM sxGalaxy WHERE ((u-g between 0.0 and 0.75) && (g-r between 0.0 and 0.5) && (u between 18 and 23) && (g between 18 and 23) && (r between 18 and 23) && ((u-g)/(g-r) between 0.8 and 1.2))

February 6, 2002Kurt.Stockinger@cern.ch35 BMI Results for Specific SDSS Data  Speedup factor of queries against bitmap indices over queries against Sloan Sky Server:  Q1: speedup factor ~10  Q2: speedup factor ~20  Q3: speedup factor ~15  Reason for better performance of bitmap indices:  Better clustering of base objects - attribute-wise rather than object-wise  Low selectivity queries require fewer page I/Os than Sloan Queries

February 6, 2002Kurt.Stockinger@cern.ch36 Conclusions  Depending on the data distribution, the query access pattern and the binning strategy, bitmap indices can significantly improve the response time of high-dimensional queries  Detailed results can be found in Ph.D. thesis: http://kurts.home.cern.ch/kurts/research/diss.ps  Future work:  Collaboration with Arie Shoshani and John Wu from LBNL @ Berkeley to further improve query response time & bitmap compression  Improve Cost Model for Query Optimiser to increase accuracy of predictions of I/O costs for queries against real data with various binning strategies

Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated.

Similar presentations

Presentation on theme: "Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated.

Similar presentations

Presentation on theme: "Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated."— Presentation transcript:

Similar presentations

About project

Feedback