CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

The HV-tree: a Memory Hierarchy Aware Version Index Rui Zhang University of Melbourne Martin Stradling University of Melbourne.
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
10 REASONS Why it makes a good option for your DB IN-MEMORY DATABASES Presenter #10: Robert Vitolo.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Presented by Russell Myers Paper by Ming-Chuan Wu and Alejandro P. Buchmann.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Chapter 11: File System Implementation
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
BTrees & Bitmap Indexes
1 Overview of Storage and Indexing Chapter 8 (part 1)
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Lecture 11: DMBS Internals
C-Store: A Column-oriented DBMS Speaker: Zhu Xinjie Supervisor: Ben Kao.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
Memory Management Chapter 7.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
DBMS Implementation Chapter 6.4 V3.0 Napier University Dr Gordon Russell.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Chapter 4 Logical & Physical Database Design
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
Improved Query Performance With Variant Indexes Patrick O’Neil, Dallan Quass Presented by Bo Han.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
NETW3005 Virtual Memory. Reading For this lecture, you should have read Chapter 9 (Sections 1-7). NETW3005 (Operating Systems) Lecture 08 - Virtual Memory2.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Module 11: File Structure
How To Build a Compressed Bitmap Index
Indexing Structures for Files and Physical Database Design
CS522 Advanced database Systems
Multidimensional Access Structures
COMP 430 Intro. to Database Systems
Computer Architecture
Lecture 15: Bitmap Indexes
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Presentation transcript:

CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor: Prof. Elke Rundensteiner April 8, 2004

CS561-S2004 strategies for processing ad hoc queries 2 Outline  Motivation for designing software  Many large scientific data warehouses need to process ad hoc queries  Lack of efficient indices  Issues to discuss  Vertical partitioning  Bitmap index  Compression – how to store the bitmaps  Persistent storage – where to store the bitmaps

CS561-S2004 strategies for processing ad hoc queries 3 Example: High-Energy Physics Experiment STAR  Current data size  20 million collision events  each event ~10 KB in size  Production data rate  100 million records / year  ~ 1 TB per year  Scientists may query any of the 500 or so attributes  Each query may involve conditions on 5 ~ 8 attributes  Energy > 100 & Particles > 500 & …  Near real-time evaluation desired

CS561-S2004 strategies for processing ad hoc queries 4 Many Scientific Applications Involve Large Datasets  Sloan Digital Sky Survey:  Earth Observing System:  Large Hadron Collider:  Genomes to life:  Combustion:  PCMDI:

CS561-S2004 strategies for processing ad hoc queries 5 Searching and Indexing Requirements  Some common features of the large scientific datasets  Read-mostly: data warehouses  Large high-dimensional data: millions or billions of records, each record with tens or hundreds of attributes  Many queries are high-dimensional partial range queries  Most users desire to modify queries interactively  Existing database software not specialized for these tasks: slow  Need new special purpose software  BMI: bitmap index, CERN  IBIS: independent bitmap index and search, LBNL

CS561-S2004 strategies for processing ad hoc queries 6 Issues to Be Discussed  Organization of the primary data, i.e., the user data  Viewing the primary data as a 2-D table  Horizontal partition: used in transactional systems  Vertical partition: good for partial range queries  Indexing strategies:  Tree based schemes: not effective for dimensions > 10  Bitmap index: well suited for partial range queries  Storage scheme for the index data  BMI: Store bitmaps as objects in an object-oriented database (ODBMS)  IBIS: Store bitmaps as simple files

CS561-S2004 strategies for processing ad hoc queries 7 Horizontal vs. Vertical Partitioning Horizontal partitioning  Data elements of a record are stored consecutively  Good for accessing one record at a time  Used in relational DBMS systems where records are frequently updated  Typically 60~70% of bytes of each page is used Vertical partitioning  All records of an attribute are stored consecutively  Good for accessing multiple records by attribute selection  Suitable for data warehousing systems where records are rarely modified  May use 100% of bytes of each page

CS561-S2004 strategies for processing ad hoc queries 8 Performance Advantage of Vertical Partitioning  Experiment with 2.2 million records of STAR data (10 attributes only)  The figure on the right shows the time to search without an index  Query box size is the relative volume of the hypercube formed by range conditions  The disk system supports about 20 MB/s sustained reading  For answering a query like “A > 5”, the time used by a relational DBMS is proportional to number of attributes in the table  500 attributes, 500 times slower Vertical partitioning is effective for partial range queries

CS561-S2004 strategies for processing ad hoc queries 9 Brief Overview of Index Data Structures  One dimensional index data structures:  Total order for one-dimension  Hash-based: Optimized for exact match queries, e.g. E = 106  Tree-based: Optimized for range queries, e.g. E < 106  Most widely used: B+-tree (1972):  Multidimensional index data structures  No total order for all dimensions  Hash-based: Grid-File, Bang-File, …  Tree based: R-Trees, Pyramid-Tree, …  Bitmap Indices: Effective for data warehousing environments  Linearize to introduce total order, then use one-dimensional indices

CS561-S2004 strategies for processing ad hoc queries 10 Basic Bitmap Index a) List of attributes b) Bitmap Index (equality encoding) a) List of 12 attributes with 10 distinct attribute values, i.e attribute cardinality = 10 b) For each distinct attribute value, one bit slice is created, i.e bitmap index consists of 10 bitmaps (E0 to E9) Bit Slice E2 encodes attributes with value 2

CS561-S2004 strategies for processing ad hoc queries 11 Pros and Cons of Bitmap Indices  Pros:  Easy to build and to maintain  Easy to identify records that satisfy a complex multi- attribute predicate (multi-dimensional ad-hoc queries)  Very space efficient for attributes with low cardinality (number of distinct attribute values, e.g. “Yes”, “No”)  Cons:  Space inefficient for attributes with high cardinality  An effective strategy: Bitmap Compression  Other strategies: binning, encoding

CS561-S2004 strategies for processing ad hoc queries 12 Bitmap Compression  Advantages:  Less disk space for storing indices  Indices can be read from disk faster  More indices can be cached in memory  Possible problems:  Increases the complexity of the software  If bitmaps must be decompressed before performing Boolean operations, the decompression overhead might outweigh the advantages of compression  Use compression schemes that work directly on compressed data

CS561-S2004 strategies for processing ad hoc queries 13 Various Bitmap Compression Algorithms  Run Length Encoding (RLE):  one-sided (asymmetric) vs. two-sided (symmetric)  Gzip (Lempel-Ziv, LZ):  verbatim (uncompressed) bitmap is compressed via zlib  ExpGol:  Variable bit length encoding (RLE-bitmap is compressed)  Byte-Aligned Bitmap Compression (BBC):  Variable byte length encoding (Oracle patent)  One-sided vs. two-sided (BBC1 vs. BBC2)  Word-Aligned Hybrid (WAH):  Fixed word based encoding

CS561-S2004 strategies for processing ad hoc queries 14 Relative Strength of Different Compression Schemes uncompressed WAH space speed better gzip BBC ExpGol PacBits

CS561-S2004 strategies for processing ad hoc queries 15 WAH Compression & Bitmap Index Implementations  Compression Schemes  Designed for reducing the CPU-complexity of logical operations when compared to BBC, 10 X speedup  However, lower compression factor, i.e. the sizes of the WAH-compressed bitmaps are some 40-60% larger than BBC-compressed bitmaps  Storage scheme  BMI: Bitmap Index implementation on top of ODBMS (CERN)  IBIS: Bitmap Index implementation based on plain files (LBL)

CS561-S2004 strategies for processing ad hoc queries 16 Test Setup  Real application data (STAR) : 2.2 million records  Synthetic dataset I: 100 million records  Synthetic dataset II: 5 million records  Only the performance of the bitwise logical operation “AND” is reported  Other logical operations such as OR, XOR, etc. show similar relative differences  Most of the benchmarks were executed on three different machines with various CPU and I/O subsystems

CS561-S2004 strategies for processing ad hoc queries 17 In Memory Logical Operation “AND” WAH is always the fastest, 2X – 20X On dm, 450MHz UltraSPARC

CS561-S2004 strategies for processing ad hoc queries 18 Search Time (Including File IO) On dm, 20MB/s IOOn tin, 2MB/s IO To answer the queries: read two bitmaps from files, perform one logical “AND” Unless using a very slow disk, it is worth-while to use WAH compression

CS561-S2004 strategies for processing ad hoc queries 19 With BBC, Searching Operation Spends Little Time in IO On dm, 20MB/s IOOn tin, 2MB/s IO  The percentage of time spent in IO on different bitmaps  This percentage is expected to be high, but it is actually low with BBC  WAH reduce CPU time, and searching is again IO bound

CS561-S2004 strategies for processing ad hoc queries 20 Sizes of Compressed Bitmaps The total size of a bitmap index compressed with WAH is typically 40-60% larger than that compressed with BBC BBC-s: simplified (LBL) BBC-f: full (AT&T + CERN)

CS561-S2004 strategies for processing ad hoc queries 21 Sizes of Compressed Bitmaps  The figure on the right plot the maximum size of the bitmap index against the attribute cardinality of an attribute with 100 million (10 8 ) records  In the worst case, the size of the compressed bitmap index is about 400 million words, 4 times the size of the primary data  For most high-cardinality attributes, the compressed bitmap index size is smaller than that of a typical B-tree index(~ 3X primary data) The compressed bitmap index sizes are usually smaller than B-tree B-tree

CS561-S2004 strategies for processing ad hoc queries 22 Query Performance IBIS vs. RDBMS Size(MB)Create(sec)Query(sec) IBIS WAH IBIS BBC-s RDBMS123(247)  Accessing bitmaps in files (IBIS) has about the same efficiency as accessing bitmaps within an RDBMS  The DBMS tested uses a BBC compressed bitmap index similar to BBC compressed index  Used real application data WAH compressed index is 4X more efficient than BBC compressed index

CS561-S2004 strategies for processing ad hoc queries 23 Query Performance File (IBIS) vs. ODBMS (BMI) b) “warm” filesa) “cold” files  Figures on the left time needed to process 5- dimensional queries on tin  Queries on synthetic data  IBIS with WAH uses the least amount of time  ODBMS overhead 4X  Due to file system caching, IBIS is ~10X faster on files that have been accessed before (“warm” files)

CS561-S2004 strategies for processing ad hoc queries 24 Conclusions  We have shown that BBC is CPU-bound rather than I/O- bound as assumed in the past  WAH is much more (10X) CPU-efficient than BBC  Building bitmap indices on top of ODBMS introduces about 4X overhead when compared to using plain files  Building bitmap indices inside DBMS (as in many commercial systems) shows higher efficiency  Processing multi-dimensional range queries is efficient with WAH compressed bitmap indices  Read-only data should be vertically partitioned