Download presentation
Presentation is loading. Please wait.
Published byAlan Harvey Payne Modified over 9 years ago
1
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani
2
March, 2002 Problem Statement Main objective: maps logical requests to qualified objects —A logical request: 20001015<=eventTime & 200<energy<300 … —Objects: Set of object ids; Set of files containing the objects; Offsets within the files, …
3
March, 2002 Application: STAR OIDdsthistmEvent Number mEvent Time mRun Number NLb 0159625159627263520000827.0 11759 12390291341 1159625159627263620000827.0 11759 12390291470 2159625159627263720000827.0 11759 12390291663 OIDn_clus_tpc_ in[13] numberOf Primary Tracks Charged Particles_ Means[1] Primary VertexX qxb[2]zdc2Energy 09091228266.56-26.4048 112431415317.46-29.0853 212851533281.53-6.7548 A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.
4
March, 2002 Application: Combustion Direct numerical simulation of auto-ignition process (solution of complex partial differential equations) A dozen or more variables are computed at each time step and each grid point Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000 Time steps: 100 >>> 1000s Data size: 1 GB >>> 10 TB Task: identify features and track them across time steps E.G. Find flame front across time Find “600<temp<700” for 1 billion points per time step, and discover overlap between time steps Use compressed bitmaps to accelerate both feature extraction and feature tracking 1000 X 1000 X 1000
5
March, 2002 Building a Bitmap Index 1.Partition each property into bins (binning) —e.g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200,400)… 2.Generate a bit vector for each bin (encoding) —Bit i of bit vector j is 1 iff NLb[i] is in bin j 3.Compress each bit vector 000000000000000000000000000000 000010001000000000010001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 property 1 000001110111011000001110111011 101100000000000101100000000000 010000001000000010000001000000 000000000000100000000000000100 000000000000000000000000000000 property 2 000000000000000000000000000000 000000001000000000000001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 property n 000010000000000000010000000000...
6
March, 2002 Advantages of Bitmap Index Bitmap index: specialized index that takes advantage —Read-mostly data: data produced from scientific experiments can be appended in large groups Fast operations —“Predicate queries” can be performed with bitwise logical operations Predicate ops: =,, =, range, Logical ops: AND, OR, XOR, NOT —They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently used ones can be cached in memory
7
March, 2002 Operation-efficient Compression Methods Best known: byte-aligned bitmap code (BBC) —Uses run-length encoding (next slide) —Byte alignment, optimized for space efficiency —Decoding on bit level, not optimal for operations —Used in oracle We developed a new word-aligned scheme: WAH —Uses run-length encoding —Word alignment —Designed for minimal decoding to gain speed
8
March, 2002 Operation-efficient Compression Methods Uncompressed: 0000000000001111000000000......0000001000000001111111100000000.... 000000 Compressed: 12, 4, 1000,1,8,1000 Store very short sequences as-is Advantage: Can perform: AND, OR, COUNT operations on compressed data Based on variations of Run Length Compression
9
March, 2002 Trade-off of Compression Schemes uncompressed WAH space speed better gzip BBC ExpGol PacBits
10
March, 2002 Information About the Test Machines Hardware and system —Sun enterprise 450 (Ultrasparc II 400mhz) —4GB RAM —VARITAS volume manager (stripped disk) Real application data from STAR —Above 2 million objects, 12 attributes Synthetic data —100 million objects, 10 attributes Terms —Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size —Time reported are wall clock time in seconds
11
March, 2002 Logical Operation Time(Synthetic Data) 10X improvement
12
March, 2002 Logical Operation Time (STAR Data) Also 10X improvement
13
March, 2002 Encoding Schemes – Main Idea Equality encoding Range encoding Interval encoding 12 bins 123456789101112 Interval, Range encoding: operates on 2 bins only!
14
March, 2002 Total Effect of Compression and Encoding Schemes Bottom line on queries —Compression scheme determines efficiency of logical operations —Encoding scheme determines number of operations Range & interval – only one logical operation over 2 bitmaps Equality – many operations depending on number of bins —But, space may be a consideration What is the trade-off?
15
March, 2002 Interval Encoding Is Better Overall (WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries
16
March, 2002 Timing Results MethodIndex (X data) Time (sec) Speed ORACLEScan060.1 B-tree3.60.950.6 Native vertical partition Scan00.571 20 bins0.180.115 50 bins0.430.078 100 bins0.900.0511
17
March, 2002 Summary Compressed bitmap indices are effective for range queries Better compression scheme —50% more space, but 12 time faster !!! Among the different encoding schemes —The interval encoding is the overall winner
18
March, 2002 Future Work Support NULL value and categorical values On-line update: add new data and update index without interrupting request processing Recovery mechanism for robustness Potential new applications: climate, astrophysics, biology (microarrays) Study non-uniform binning strategies Study more encoding schemes Integrate with conventional database system: to better handle metadata, to provide more versatile front-end
19
March, 2002 How Many Bins for Continuous Domains? Range(x) Range(y) Edge bin...................................................... More bins Less objects in edge bins Searching edge bins: skip-scan over “attribute vertical partition”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.