Download presentation
Presentation is loading. Please wait.
1
CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor: Prof. Elke Rundensteiner April 8, 2004
2
CS561-S2004 strategies for processing ad hoc queries 2 Outline Motivation for designing software Many large scientific data warehouses need to process ad hoc queries Lack of efficient indices Issues to discuss Vertical partitioning Bitmap index Compression – how to store the bitmaps Persistent storage – where to store the bitmaps
3
CS561-S2004 strategies for processing ad hoc queries 3 Example: High-Energy Physics Experiment STAR Current data size 20 million collision events each event ~10 KB in size Production data rate 100 million records / year ~ 1 TB per year Scientists may query any of the 500 or so attributes Each query may involve conditions on 5 ~ 8 attributes Energy > 100 & Particles > 500 & … Near real-time evaluation desired
4
CS561-S2004 strategies for processing ad hoc queries 4 Many Scientific Applications Involve Large Datasets Sloan Digital Sky Survey: http://www.sdss.orghttp://www.sdss.org Earth Observing System: http://eos.nasa.govhttp://eos.nasa.gov Large Hadron Collider: http://lhc.web.cern.chhttp://lhc.web.cern.ch Genomes to life: http://doegenomestolife.orghttp://doegenomestolife.org Combustion: http://scidac.psc.eduhttp://scidac.psc.edu PCMDI: http://www-pcmdi.llnl.govhttp://www-pcmdi.llnl.gov
5
CS561-S2004 strategies for processing ad hoc queries 5 Searching and Indexing Requirements Some common features of the large scientific datasets Read-mostly: data warehouses Large high-dimensional data: millions or billions of records, each record with tens or hundreds of attributes Many queries are high-dimensional partial range queries Most users desire to modify queries interactively Existing database software not specialized for these tasks: slow Need new special purpose software BMI: bitmap index, CERN IBIS: independent bitmap index and search, LBNL
6
CS561-S2004 strategies for processing ad hoc queries 6 Issues to Be Discussed Organization of the primary data, i.e., the user data Viewing the primary data as a 2-D table Horizontal partition: used in transactional systems Vertical partition: good for partial range queries Indexing strategies: Tree based schemes: not effective for dimensions > 10 Bitmap index: well suited for partial range queries Storage scheme for the index data BMI: Store bitmaps as objects in an object-oriented database (ODBMS) IBIS: Store bitmaps as simple files
7
CS561-S2004 strategies for processing ad hoc queries 7 Horizontal vs. Vertical Partitioning Horizontal partitioning Data elements of a record are stored consecutively Good for accessing one record at a time Used in relational DBMS systems where records are frequently updated Typically 60~70% of bytes of each page is used Vertical partitioning All records of an attribute are stored consecutively Good for accessing multiple records by attribute selection Suitable for data warehousing systems where records are rarely modified May use 100% of bytes of each page
8
CS561-S2004 strategies for processing ad hoc queries 8 Performance Advantage of Vertical Partitioning Experiment with 2.2 million records of STAR data (10 attributes only) The figure on the right shows the time to search without an index Query box size is the relative volume of the hypercube formed by range conditions The disk system supports about 20 MB/s sustained reading For answering a query like “A > 5”, the time used by a relational DBMS is proportional to number of attributes in the table 500 attributes, 500 times slower Vertical partitioning is effective for partial range queries
9
CS561-S2004 strategies for processing ad hoc queries 9 Brief Overview of Index Data Structures One dimensional index data structures: Total order for one-dimension Hash-based: Optimized for exact match queries, e.g. E = 106 Tree-based: Optimized for range queries, e.g. E < 106 Most widely used: B+-tree (1972): Multidimensional index data structures No total order for all dimensions Hash-based: Grid-File, Bang-File, … Tree based: R-Trees, Pyramid-Tree, … Bitmap Indices: Effective for data warehousing environments Linearize to introduce total order, then use one-dimensional indices
10
CS561-S2004 strategies for processing ad hoc queries 10 Basic Bitmap Index a) List of attributes b) Bitmap Index (equality encoding) a) List of 12 attributes with 10 distinct attribute values, i.e attribute cardinality = 10 b) For each distinct attribute value, one bit slice is created, i.e bitmap index consists of 10 bitmaps (E0 to E9) Bit Slice E2 encodes attributes with value 2
11
CS561-S2004 strategies for processing ad hoc queries 11 Pros and Cons of Bitmap Indices Pros: Easy to build and to maintain Easy to identify records that satisfy a complex multi- attribute predicate (multi-dimensional ad-hoc queries) Very space efficient for attributes with low cardinality (number of distinct attribute values, e.g. “Yes”, “No”) Cons: Space inefficient for attributes with high cardinality An effective strategy: Bitmap Compression Other strategies: binning, encoding
12
CS561-S2004 strategies for processing ad hoc queries 12 Bitmap Compression Advantages: Less disk space for storing indices Indices can be read from disk faster More indices can be cached in memory Possible problems: Increases the complexity of the software If bitmaps must be decompressed before performing Boolean operations, the decompression overhead might outweigh the advantages of compression Use compression schemes that work directly on compressed data
13
CS561-S2004 strategies for processing ad hoc queries 13 Various Bitmap Compression Algorithms Run Length Encoding (RLE): one-sided (asymmetric) vs. two-sided (symmetric) Gzip (Lempel-Ziv, LZ): verbatim (uncompressed) bitmap is compressed via zlib ExpGol: Variable bit length encoding (RLE-bitmap is compressed) Byte-Aligned Bitmap Compression (BBC): Variable byte length encoding (Oracle patent) One-sided vs. two-sided (BBC1 vs. BBC2) Word-Aligned Hybrid (WAH): Fixed word based encoding
14
CS561-S2004 strategies for processing ad hoc queries 14 Relative Strength of Different Compression Schemes uncompressed WAH space speed better gzip BBC ExpGol PacBits
15
CS561-S2004 strategies for processing ad hoc queries 15 WAH Compression & Bitmap Index Implementations Compression Schemes Designed for reducing the CPU-complexity of logical operations when compared to BBC, 10 X speedup However, lower compression factor, i.e. the sizes of the WAH-compressed bitmaps are some 40-60% larger than BBC-compressed bitmaps Storage scheme BMI: Bitmap Index implementation on top of ODBMS (CERN) IBIS: Bitmap Index implementation based on plain files (LBL)
16
CS561-S2004 strategies for processing ad hoc queries 16 Test Setup Real application data (STAR) : 2.2 million records Synthetic dataset I: 100 million records Synthetic dataset II: 5 million records Only the performance of the bitwise logical operation “AND” is reported Other logical operations such as OR, XOR, etc. show similar relative differences Most of the benchmarks were executed on three different machines with various CPU and I/O subsystems
17
CS561-S2004 strategies for processing ad hoc queries 17 In Memory Logical Operation “AND” WAH is always the fastest, 2X – 20X On dm, 450MHz UltraSPARC
18
CS561-S2004 strategies for processing ad hoc queries 18 Search Time (Including File IO) On dm, 20MB/s IOOn tin, 2MB/s IO To answer the queries: read two bitmaps from files, perform one logical “AND” Unless using a very slow disk, it is worth-while to use WAH compression
19
CS561-S2004 strategies for processing ad hoc queries 19 With BBC, Searching Operation Spends Little Time in IO On dm, 20MB/s IOOn tin, 2MB/s IO The percentage of time spent in IO on different bitmaps This percentage is expected to be high, but it is actually low with BBC WAH reduce CPU time, and searching is again IO bound
20
CS561-S2004 strategies for processing ad hoc queries 20 Sizes of Compressed Bitmaps The total size of a bitmap index compressed with WAH is typically 40-60% larger than that compressed with BBC BBC-s: simplified (LBL) BBC-f: full (AT&T + CERN)
21
CS561-S2004 strategies for processing ad hoc queries 21 Sizes of Compressed Bitmaps The figure on the right plot the maximum size of the bitmap index against the attribute cardinality of an attribute with 100 million (10 8 ) records In the worst case, the size of the compressed bitmap index is about 400 million words, 4 times the size of the primary data For most high-cardinality attributes, the compressed bitmap index size is smaller than that of a typical B-tree index(~ 3X primary data) The compressed bitmap index sizes are usually smaller than B-tree B-tree
22
CS561-S2004 strategies for processing ad hoc queries 22 Query Performance IBIS vs. RDBMS Size(MB)Create(sec)Query(sec) IBIS WAH166910.7 IBIS BBC-s1171162.9 RDBMS123(247)28903.1 Accessing bitmaps in files (IBIS) has about the same efficiency as accessing bitmaps within an RDBMS The DBMS tested uses a BBC compressed bitmap index similar to BBC compressed index Used real application data WAH compressed index is 4X more efficient than BBC compressed index
23
CS561-S2004 strategies for processing ad hoc queries 23 Query Performance File (IBIS) vs. ODBMS (BMI) b) “warm” filesa) “cold” files Figures on the left time needed to process 5- dimensional queries on tin Queries on synthetic data IBIS with WAH uses the least amount of time ODBMS overhead 4X Due to file system caching, IBIS is ~10X faster on files that have been accessed before (“warm” files)
24
CS561-S2004 strategies for processing ad hoc queries 24 Conclusions We have shown that BBC is CPU-bound rather than I/O- bound as assumed in the past WAH is much more (10X) CPU-efficient than BBC Building bitmap indices on top of ODBMS introduces about 4X overhead when compared to using plain files Building bitmap indices inside DBMS (as in many commercial systems) shows higher efficiency Processing multi-dimensional range queries is efficient with WAH compressed bitmap indices Read-only data should be vertically partitioned
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.