Download presentation
Presentation is loading. Please wait.
Published byMerryl Bradford Modified over 9 years ago
1
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory
2
September, 2002 Outline Introduction —Example application: high-energy physics data —Task: range queries on high-dimensional data —Approach: bitmap index —To make it work: compression, encoding, binning New compression scheme —Best known scheme (BBC): CPU bound —Improve CPU efficiency: 10 X Compressed bitmap index —Index size smaller than b-tree —Answer queries faster than b-tree, … Applying bitmaps for a feature tracking problem
3
September, 2002 Example I: High-energy Physics Selected attributes of STAR summary data (tags). Actual size (January 2002): 20 million objects, 502 attributes OIDRunEventNLbtpcTracksParticlesVertexqxb[2]Energy 01239029263513419091228266.56-26.4048 112390292636147012431415317.46-29.0853 212390292637166312851533281.53-6.7548 Typical data processing steps: Collect raw data: collision events, … (done once) Generate summary data (done once): 10-100 attributes per event Access data according to summary attributes (performed by many scientists): 20001015<=Run & 200<Energy<300 …
4
September, 2002 Range Queries on High-dimensional Data Typical query: partial range query 20001015<=Run & 200<Energy<300 … Characteristics of data —Large: millions or billions of records —High-dimensional: hundreds of attributes per object —Appends in batches —Most attributes are not categorical (integer, floating- point values) Known solutions —Sequential scan —R-tree etc. are usually slower than sequential scan —Bitmap index is faster in some cases
5
September, 2002 Basic Bitmap Index Bitmap index is efficient for processing range queries on read-only data (P. O’Neil, 1987). 000000000000000000000000000000 000010001000000000010001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 NLb 000001110111011000001110111011 101100000000000101100000000000 010000001000000010000001000000 000000000000100000000000000100 000000000000000000000000000000 Qxb[2] 000000000000000000000000000000 000000001000000000000001000000 000001110111011000001110111011 101100000000000101100000000000 010000000000000010000000000000 000000000000100000000000000100 000000000000000000000000000000 eventTime 000010000000000000010000000000... NLb=0 NLb=1 NLb=6 The basic bitmap index
6
September, 2002 Features of Bitmap Index Main operations are bitwise logical operations and they are fast Index sizes are small for categorical attributes with low cardinality Each individual bitmap is small and frequently used ones can be cached in memory X Scientific datasets have mostly non-categorical attributes Index size may be large Query processing may be slow
7
September, 2002 Effective Bitmap Index To make bitmap index effective for scientific datasets: 1.Binning: reduce the number of bitmaps —Say 0 <= NLb < 4000, we can use 20 equal size bins [0,200)[200,400)[400,600) 2.Encoding: reduce the number of bitmaps or reduce the number of operations —Basic: equality encoding: generates on bitmap for each bin (shown above) —Other: range encoding, interval encoding, … 3.Compression: reduce the size of each bitmap, may also speedup the logical operations —Find an efficient compression scheme to reduce query processing time —This talk only addresses the issue of compression
8
September, 2002 Efficient Compression Schemes Word-aligned Hybrid Code
9
September, 2002 Efficient Compression Schemes Best known compression scheme for bitmap indexes --- byte-aligned bitmap code (BBC) —Uses run-length encoding —Encode/decode bitmaps 8 bits (one byte) at a time —Compresses nearly as well as LZ77 (gzip) —Bitwise logical operations can be performed on compressed bitmaps directly —Operations are usually faster compared to other compression schemes, e.g., ExpGol, … —Even faster than operating on uncompressed bitmaps in some cases —Used in ORACLE
10
September, 2002 Operations With BBC Is CPU Bound Bitwise logical operations on BBC compressed bitmaps are CPU bound Reduce CPU time CPU time is about 80% of total time on a system with 20 MB/s disk suite Two independent implementations of BBC show similar behavior Operation measured: read two files from disk and perform one logical operation in memory
11
September, 2002 Word-Aligned Hybrid Code Word-aligned hydride code (WAH) —Uses run-length encoding for long sequences of identical bits —Encode / decode bitmaps in word size chunks —Designed for minimal decoding to gain speed
12
September, 2002 Word-Aligned Hybrid Code 10000000000000000000011100000000000000000000000000000……………….00000000000000000000000000000001111111111111111111111111 1023 bits 01000… Literal word 100…11111 Fill word 001…111 Literal word Run length is 31 WAH includes three words Groups bits into 33 31-bit groups Encode each group using one word 31 bits 31*31 bits 31 bits … Merge neighboring groups with identical bits
13
September, 2002 Information About the Test Setup Hardware and system —Sun enterprise 450 (Ultrasparc II 400MHz) —VARITAS volume manager (stripped disk) – measured IO speed 20 MB/s Real application data from STAR —About 2.2 million records, 500 attributes Synthetic data —100 million records, 10 attributes Terms —Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size —Time reported are wall clock time in seconds
14
September, 2002 Fraction of Time Spent in CPU On a 2 MB/s disk system On a 20 MB/s disk system Compared to two implementations of BBC, WAH spends smaller fraction of time in CPU
15
September, 2002 Logical Operation Time Synthetic data 100 million records WAH is 2-20 times faster than BBC
16
September, 2002 Logical Operation Time STAR data 2.2 million records WAH is 2-60 times faster than BBC
17
September, 2002 Trade-off of Compression Schemes uncompressed WAH space speed better gzip BBC ExpGol PacBits
18
September, 2002 Performance of the Full Queries Using the Basic Bitmap Index Bitmap index setup: One bitmap per value (no bins) Equality encoding What is being measured Time – answering range queries (not individual logical operation): high cardinality attributes from STAR
19
September, 2002 WAH index scales linearly with data size STAR: 2.2 mil Combustion: 25 Synthetic: 100 Query processing time is proportional to index size 1 sec 100 MB Range Queries over different datasets
20
September, 2002 Multi-attribute Range Queries High Cardinality Attributes 2 attributes per query5 attributes per query WAH compressed indexes are 10X faster than ORACLE, 5X faster than our BBC P scan is scanning vertically projection of data table – the simplest option for processing partial range queries on high-dimensional data Queries on 12 most queried attributes, average cardinality 222,000
21
September, 2002 Summary of Tests on STAR Data Exact answersApproximate answers Indexing Method Size (X data) Time (sec) relative to p scan Time (sec) relative to p scan Native vertical partition (WAH) P Scan00.571 20 bins0.180.1150.0160 50 bins0.430.0780.0160 100 bins0.900.05110.0160 No bins1.650.0511 WAH vs. BBC ORACLE Scan06.50.09 B-tree3.60.950.6 Bitmap (no bins) 0.980.660.86 Our bitmap index can be 100 X faster than ORACLE: 10 X due to compression scheme, 10 X due to binning
22
September, 2002 Using Bitmaps for Feature Tracking Adopting Compressed Bitmaps to Operations Outside of the Bitmap Index
23
September, 2002 Example II: Combustion Direct numerical simulation of auto- ignition process (solution of complex partial differential equations – data computed once but never modified) A simple model has 12 variables per cell, a realistic model may have hundreds Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000 Time steps: 100 >>> 1000s Data size: 1 GB >>> 10 TB Task: identify features and track them across time steps
24
September, 2002 Tasks Cell identification —Identify cells with values satisfying specified conditions —Typically a partial range query, like, “600 10 -7 ” Region growing (feature identification) —Connect neighboring cells into connected regions Feature tracking —Identify common cells in connected regions from different time steps
25
September, 2002 Basic Approach Cell identification —Scan data and perform comparisons —Solution is represented as a list of cell IDs Region growing —For each cell in the above list, search all its neighbors —Each region is a list of cell IDs Feature tracking —Sort cell IDs of each region and match cell IDs to identify common cells —Use bounding boxes to reduce unnecessary operations
26
September, 2002 Our Approach Cell identification —Vertically partition the data —Use bitmap index to speedup searches —Solutions are represented as compressed bitmaps Region growing —Convert the compressed bitmaps into line segments —Connect neighboring line segments into regions —Convert each region into a compressed bitmap Feature tracking —Use bitwise AND to identify common cells —Use bounding boxes to reduce unnecessary operations
27
September, 2002 Preliminary Performance Data Cell identificationHorizontal partition 75 seconds Vertical partition 5 seconds Bitmap index 0.1 seconds Region growingPoint based algorithm 8 seconds Line based algorithm 1.7 seconds Feature trackingComparing cell Ids 10 seconds Bitmap operations 0.2 seconds Total time (sec)93232.0 69 time steps, 600 X 600 grid, condition HO 2 >10 -7 Compressed bitmaps can be efficiently used for feature tracking
28
September, 2002 Summary The size of WAH compressed bitmap index is modest even in the worse case —For most high cardinality attributes with N records, the index size is about 2N words. Never more than 4N words The WAH compressed index is efficient on attributes of any cardinality —On range queries, it is faster than uncompressed bitmap index (3X), BBC compressed index (2~20X), B+-tree index (20~200X), and scanning vertically partitioned table (4~50X) Compressed bitmaps can also be efficiently used for feature tracking
29
September, 2002 Sizes of Compressed Bitmap Indexes 10 8 records Test attribute: 1,2,3,…,1,2,3,… (worst case in terms of index size) B+-tree size (observed): 3~4 x 10 8 words WAH compressed index is not larger than B+-tree
30
September, 2002 Summary of Tests on STAR Data (I) Bitmap index B+-treeP scanOracleBBCWAH Low cardinality case Size (MB)3700747 Query processing (seconds) 1-attribute0.900.510.0050.0150.004 2-attribute2.100.560.0240.0260.006 5-attribute2.140.670.0430.0830.017 High cardinality case Size (MB)4080111118186 Query processing (seconds) 1-attribute0.950.510.010.030.05 2-attribute2.150.560.390.170.04 5-attribute2.230.672.420.760.17 Compressed bitmap index is more efficient for range queries than B+-tree or no index (p scan) A WAH compressed index uses more space than a BBC compressed index, but is more efficient
31
September, 2002 Multi-attribute Range Queries Low Cardinality Attributes 2 attributes per query5 attributes per query WAH compressed indexes are faster than BBC compressed indexes (3X) and uncompressed indexes (3X) Query box is the relative volume of the box formed by the query condition 12 lowest cardinality attributes of star, average attribute cardinality 26
32
September, 2002 Total Effect of Compression and Encoding Schemes Bottom line on queries —Compression scheme determines efficiency of logical operations —Encoding scheme determines number of operations Range & interval – only one logical operation over 2 bitmaps Equality – many operations depending on number of bins —But, space may be a consideration What is the trade-off?
33
September, 2002 Interval Encoding Is Better Overall (WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries
34
September, 2002 Storing Bitmaps As Files Is Efficient BMI – store bitmaps in Objectivity IBIS – store bitmaps in files IBIS answers queries about 4 times faster than BMI using WAH BMI with WAH is up to ten times faster than BMI with BBC Joint work with Kurt Stockinger (CERN)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.