September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.

Slides:



Advertisements
Similar presentations
Data Models There are 3 parts to a GIS: GUI Tools
Advertisements

Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Systems.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
10 REASONS Why it makes a good option for your DB IN-MEMORY DATABASES Presenter #10: Robert Vitolo.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
Making earth science data more accessible: experience with chunking and compression Russ Rew January rd Annual AMS Meeting Austin, Texas.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
BTrees & Bitmap Indexes
Chapter 7 Memory Management
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Computer Organization Cs 147 Prof. Lee Azita Keshmiri.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Physical Data Warehouse Design Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 Lecture 7: Data structures for databases I Jose M. Peña
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
Review of Memory Management, Virtual Memory CS448.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Oracle Index study for Event TAG DB M. Boschini S. Della Torre
Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Ch. 4 Memory Mangement Parkinson’s law: “Programs expand to fill the memory available to hold them.”
Chapter 4 Storage Management (Memory Management).
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
The STAR Grid Collector and TBitmapIndex John Wu Kurt Stockinger, Rene Brun, Philippe Canal – TBitmapIndex Junmin Gu, Jerome Lauret, Arthur M. Poskanzer,
Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics.
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Indexing and Selection of Data Items Using Tag Collections Sebastien Ponce CERN – LHCb Experiment EPFL – Computer Science Dpt Pere Mato Vila CERN – LHCb.
Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.
Sec 14.7 Bitmap Indexes Shabana Kazi. Introduction A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (commonly.
Evidence from Content INST 734 Module 2 Doug Oard.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 4 Logical & Physical Database Design
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Andrea Valassi (CERN IT-DB)CHEP 2004 Poster Session (Thursday, 30 September 2004) 1 HARP DATA AND SOFTWARE MIGRATION FROM TO ORACLE Authors: A.Valassi,
Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,
Ch. 4 Memory Mangement Parkinson’s law: “Programs expand to fill the memory available to hold them.”
How To Build a Compressed Bitmap Index
CSC 322 Operating Systems Concepts Lecture - 12: by
Database Management Systems (CS 564)
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Database Implementation Issues
Computer Architecture
Lecture 15: Bitmap Indexes
Lecture 7: Index Construction
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Storage Structure and Efficient File Access
Database Implementation Issues
Database Implementation Issues
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory

September, 2002 Outline Introduction —Example application: high-energy physics data —Task: range queries on high-dimensional data —Approach: bitmap index —To make it work: compression, encoding, binning New compression scheme —Best known scheme (BBC): CPU bound —Improve CPU efficiency: 10 X Compressed bitmap index —Index size smaller than b-tree —Answer queries faster than b-tree, … Applying bitmaps for a feature tracking problem

September, 2002 Example I: High-energy Physics Selected attributes of STAR summary data (tags). Actual size (January 2002): 20 million objects, 502 attributes OIDRunEventNLbtpcTracksParticlesVertexqxb[2]Energy Typical data processing steps: Collect raw data: collision events, … (done once) Generate summary data (done once): attributes per event Access data according to summary attributes (performed by many scientists): <=Run & 200<Energy<300 …

September, 2002 Range Queries on High-dimensional Data Typical query: partial range query <=Run & 200<Energy<300 … Characteristics of data —Large: millions or billions of records —High-dimensional: hundreds of attributes per object —Appends in batches —Most attributes are not categorical (integer, floating- point values) Known solutions —Sequential scan —R-tree etc. are usually slower than sequential scan —Bitmap index is faster in some cases

September, 2002 Basic Bitmap Index Bitmap index is efficient for processing range queries on read-only data (P. O’Neil, 1987) NLb Qxb[2] eventTime NLb=0 NLb=1 NLb=6 The basic bitmap index

September, 2002 Features of Bitmap Index Main operations are bitwise logical operations and they are fast Index sizes are small for categorical attributes with low cardinality Each individual bitmap is small and frequently used ones can be cached in memory X Scientific datasets have mostly non-categorical attributes  Index size may be large  Query processing may be slow

September, 2002 Effective Bitmap Index To make bitmap index effective for scientific datasets: 1.Binning: reduce the number of bitmaps —Say 0 <= NLb < 4000, we can use 20 equal size bins [0,200)[200,400)[400,600) 2.Encoding: reduce the number of bitmaps or reduce the number of operations —Basic: equality encoding: generates on bitmap for each bin (shown above) —Other: range encoding, interval encoding, … 3.Compression: reduce the size of each bitmap, may also speedup the logical operations —Find an efficient compression scheme to reduce query processing time —This talk only addresses the issue of compression

September, 2002 Efficient Compression Schemes Word-aligned Hybrid Code

September, 2002 Efficient Compression Schemes Best known compression scheme for bitmap indexes --- byte-aligned bitmap code (BBC) —Uses run-length encoding —Encode/decode bitmaps 8 bits (one byte) at a time —Compresses nearly as well as LZ77 (gzip) —Bitwise logical operations can be performed on compressed bitmaps directly —Operations are usually faster compared to other compression schemes, e.g., ExpGol, … —Even faster than operating on uncompressed bitmaps in some cases —Used in ORACLE

September, 2002 Operations With BBC Is CPU Bound Bitwise logical operations on BBC compressed bitmaps are CPU bound  Reduce CPU time CPU time is about 80% of total time on a system with 20 MB/s disk suite Two independent implementations of BBC show similar behavior Operation measured: read two files from disk and perform one logical operation in memory

September, 2002 Word-Aligned Hybrid Code Word-aligned hydride code (WAH) —Uses run-length encoding for long sequences of identical bits —Encode / decode bitmaps in word size chunks —Designed for minimal decoding to gain speed

September, 2002 Word-Aligned Hybrid Code ……………… bits 01000… Literal word 100…11111 Fill word 001…111 Literal word Run length is 31 WAH includes three words Groups bits into bit groups Encode each group using one word 31 bits 31*31 bits 31 bits … Merge neighboring groups with identical bits

September, 2002 Information About the Test Setup Hardware and system —Sun enterprise 450 (Ultrasparc II 400MHz) —VARITAS volume manager (stripped disk) – measured IO speed 20 MB/s Real application data from STAR —About 2.2 million records, 500 attributes Synthetic data —100 million records, 10 attributes Terms —Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size —Time reported are wall clock time in seconds

September, 2002 Fraction of Time Spent in CPU On a 2 MB/s disk system On a 20 MB/s disk system Compared to two implementations of BBC, WAH spends smaller fraction of time in CPU

September, 2002 Logical Operation Time Synthetic data 100 million records WAH is 2-20 times faster than BBC

September, 2002 Logical Operation Time STAR data 2.2 million records WAH is 2-60 times faster than BBC

September, 2002 Trade-off of Compression Schemes uncompressed WAH space speed better gzip BBC ExpGol PacBits

September, 2002 Performance of the Full Queries Using the Basic Bitmap Index Bitmap index setup: One bitmap per value (no bins) Equality encoding What is being measured  Time – answering range queries (not individual logical operation):  high cardinality attributes from STAR

September, 2002 WAH index scales linearly with data size STAR: 2.2 mil Combustion: 25 Synthetic: 100 Query processing time is proportional to index size  1 sec  100 MB Range Queries over different datasets

September, 2002 Multi-attribute Range Queries High Cardinality Attributes 2 attributes per query5 attributes per query WAH compressed indexes are 10X faster than ORACLE, 5X faster than our BBC P scan is scanning vertically projection of data table – the simplest option for processing partial range queries on high-dimensional data Queries on 12 most queried attributes, average cardinality 222,000

September, 2002 Summary of Tests on STAR Data Exact answersApproximate answers Indexing Method Size (X data) Time (sec) relative to p scan Time (sec) relative to p scan Native vertical partition (WAH) P Scan bins bins bins No bins WAH vs. BBC ORACLE Scan B-tree Bitmap (no bins) Our bitmap index can be 100 X faster than ORACLE: 10 X due to compression scheme, 10 X due to binning

September, 2002 Using Bitmaps for Feature Tracking Adopting Compressed Bitmaps to Operations Outside of the Bitmap Index

September, 2002 Example II: Combustion Direct numerical simulation of auto- ignition process (solution of complex partial differential equations – data computed once but never modified) A simple model has 12 variables per cell, a realistic model may have hundreds Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000 Time steps: 100 >>> 1000s Data size: 1 GB >>> 10 TB Task: identify features and track them across time steps

September, 2002 Tasks Cell identification —Identify cells with values satisfying specified conditions —Typically a partial range query, like, “ ” Region growing (feature identification) —Connect neighboring cells into connected regions Feature tracking —Identify common cells in connected regions from different time steps

September, 2002 Basic Approach Cell identification —Scan data and perform comparisons —Solution is represented as a list of cell IDs Region growing —For each cell in the above list, search all its neighbors —Each region is a list of cell IDs Feature tracking —Sort cell IDs of each region and match cell IDs to identify common cells —Use bounding boxes to reduce unnecessary operations

September, 2002 Our Approach Cell identification —Vertically partition the data —Use bitmap index to speedup searches —Solutions are represented as compressed bitmaps Region growing —Convert the compressed bitmaps into line segments —Connect neighboring line segments into regions —Convert each region into a compressed bitmap Feature tracking —Use bitwise AND to identify common cells —Use bounding boxes to reduce unnecessary operations

September, 2002 Preliminary Performance Data Cell identificationHorizontal partition 75 seconds Vertical partition 5 seconds Bitmap index 0.1 seconds Region growingPoint based algorithm 8 seconds Line based algorithm 1.7 seconds Feature trackingComparing cell Ids 10 seconds Bitmap operations 0.2 seconds Total time (sec) time steps, 600 X 600 grid, condition HO 2 >10 -7 Compressed bitmaps can be efficiently used for feature tracking

September, 2002 Summary The size of WAH compressed bitmap index is modest even in the worse case —For most high cardinality attributes with N records, the index size is about 2N words. Never more than 4N words The WAH compressed index is efficient on attributes of any cardinality —On range queries, it is faster than uncompressed bitmap index (3X), BBC compressed index (2~20X), B+-tree index (20~200X), and scanning vertically partitioned table (4~50X) Compressed bitmaps can also be efficiently used for feature tracking

September, 2002 Sizes of Compressed Bitmap Indexes 10 8 records Test attribute: 1,2,3,…,1,2,3,… (worst case in terms of index size) B+-tree size (observed): 3~4 x 10 8 words WAH compressed index is not larger than B+-tree

September, 2002 Summary of Tests on STAR Data (I) Bitmap index B+-treeP scanOracleBBCWAH Low cardinality case Size (MB) Query processing (seconds) 1-attribute attribute attribute High cardinality case Size (MB) Query processing (seconds) 1-attribute attribute attribute Compressed bitmap index is more efficient for range queries than B+-tree or no index (p scan) A WAH compressed index uses more space than a BBC compressed index, but is more efficient

September, 2002 Multi-attribute Range Queries Low Cardinality Attributes 2 attributes per query5 attributes per query WAH compressed indexes are faster than BBC compressed indexes (3X) and uncompressed indexes (3X) Query box is the relative volume of the box formed by the query condition 12 lowest cardinality attributes of star, average attribute cardinality 26

September, 2002 Total Effect of Compression and Encoding Schemes Bottom line on queries —Compression scheme determines efficiency of logical operations —Encoding scheme determines number of operations Range & interval – only one logical operation over 2 bitmaps Equality – many operations depending on number of bins —But, space may be a consideration What is the trade-off?

September, 2002 Interval Encoding Is Better Overall (WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries

September, 2002 Storing Bitmaps As Files Is Efficient BMI – store bitmaps in Objectivity IBIS – store bitmaps in files IBIS answers queries about 4 times faster than BMI using WAH BMI with WAH is up to ten times faster than BMI with BBC Joint work with Kurt Stockinger (CERN)