March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani
March, 2002 Problem Statement Main objective: maps logical requests to qualified objects —A logical request: <=eventTime & 200<energy<300 … —Objects: Set of object ids; Set of files containing the objects; Offsets within the files, …
March, 2002 Application: STAR OIDdsthistmEvent Number mEvent Time mRun Number NLb OIDn_clus_tpc_ in[13] numberOf Primary Tracks Charged Particles_ Means[1] Primary VertexX qxb[2]zdc2Energy A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.
March, 2002 Application: Combustion Direct numerical simulation of auto-ignition process (solution of complex partial differential equations) A dozen or more variables are computed at each time step and each grid point Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000 Time steps: 100 >>> 1000s Data size: 1 GB >>> 10 TB Task: identify features and track them across time steps E.G. Find flame front across time Find “600<temp<700” for 1 billion points per time step, and discover overlap between time steps Use compressed bitmaps to accelerate both feature extraction and feature tracking 1000 X 1000 X 1000
March, 2002 Building a Bitmap Index 1.Partition each property into bins (binning) —e.g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200,400)… 2.Generate a bit vector for each bin (encoding) —Bit i of bit vector j is 1 iff NLb[i] is in bin j 3.Compress each bit vector property property property n
March, 2002 Advantages of Bitmap Index Bitmap index: specialized index that takes advantage —Read-mostly data: data produced from scientific experiments can be appended in large groups Fast operations —“Predicate queries” can be performed with bitwise logical operations Predicate ops: =,, =, range, Logical ops: AND, OR, XOR, NOT —They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently used ones can be cached in memory
March, 2002 Operation-efficient Compression Methods Best known: byte-aligned bitmap code (BBC) —Uses run-length encoding (next slide) —Byte alignment, optimized for space efficiency —Decoding on bit level, not optimal for operations —Used in oracle We developed a new word-aligned scheme: WAH —Uses run-length encoding —Word alignment —Designed for minimal decoding to gain speed
March, 2002 Operation-efficient Compression Methods Uncompressed: Compressed: 12, 4, 1000,1,8,1000 Store very short sequences as-is Advantage: Can perform: AND, OR, COUNT operations on compressed data Based on variations of Run Length Compression
March, 2002 Trade-off of Compression Schemes uncompressed WAH space speed better gzip BBC ExpGol PacBits
March, 2002 Information About the Test Machines Hardware and system —Sun enterprise 450 (Ultrasparc II 400mhz) —4GB RAM —VARITAS volume manager (stripped disk) Real application data from STAR —Above 2 million objects, 12 attributes Synthetic data —100 million objects, 10 attributes Terms —Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size —Time reported are wall clock time in seconds
March, 2002 Logical Operation Time(Synthetic Data) 10X improvement
March, 2002 Logical Operation Time (STAR Data) Also 10X improvement
March, 2002 Encoding Schemes – Main Idea Equality encoding Range encoding Interval encoding 12 bins Interval, Range encoding: operates on 2 bins only!
March, 2002 Total Effect of Compression and Encoding Schemes Bottom line on queries —Compression scheme determines efficiency of logical operations —Encoding scheme determines number of operations Range & interval – only one logical operation over 2 bitmaps Equality – many operations depending on number of bins —But, space may be a consideration What is the trade-off?
March, 2002 Interval Encoding Is Better Overall (WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries
March, 2002 Timing Results MethodIndex (X data) Time (sec) Speed ORACLEScan060.1 B-tree Native vertical partition Scan bins bins bins
March, 2002 Summary Compressed bitmap indices are effective for range queries Better compression scheme —50% more space, but 12 time faster !!! Among the different encoding schemes —The interval encoding is the overall winner
March, 2002 Future Work Support NULL value and categorical values On-line update: add new data and update index without interrupting request processing Recovery mechanism for robustness Potential new applications: climate, astrophysics, biology (microarrays) Study non-uniform binning strategies Study more encoding schemes Integrate with conventional database system: to better handle metadata, to provide more versatile front-end
March, 2002 How Many Bins for Continuous Domains? Range(x) Range(y) Edge bin More bins Less objects in edge bins Searching edge bins: skip-scan over “attribute vertical partition”