How To Build a Compressed Bitmap Index

Slides:



Advertisements
Similar presentations
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Advertisements

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Hashing and Indexing John Ortiz.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
BTrees & Bitmap Indexes
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:
Query Execution Chapter 15 Section 15.1 Presented by Khadke, Suvarna CS 257 (Section II) Id
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.
Bitmap Indices for Speeding Up End User Physics Analysis Main Results of Ph.D. Thesis Kurt Stockinger Database Group, IT-Division, CERN Formerly affiliated.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
University of Maryland Baltimore County
Introduction to Optimization
HUFFMAN CODES.
3.3 Fundamentals of data representation
UNIT 11 Query Optimization
Computer Science Higher
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Database Management Systems (CS 564)
COMP 430 Intro. to Database Systems
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Relational Algebra Chapter 4, Part A
Chapter 15 QUERY EXECUTION.
Query Execution Presented by Khadke, Suvarna CS 257
File Processing : Query Processing
Database Implementation Issues
Introduction to Optimization
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Cse 344 APRIL 23RD – Indexing.
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Lecture 15: Bitmap Indexes
Indexing and Hashing Basic Concepts Ordered Indices
Advanced Algorithms Analysis and Design
Selected Topics: External Sorting, Join Algorithms, …
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Lecture 2- Query Processing (continued)
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Design and Programming
DATABASE IMPLEMENTATION ISSUES
Query Execution Presented by Jiten Oswal CS 257 Chapter 15
Introduction to Optimization
Chapter 12 Query Processing (1)
Probabilistic Databases
Chapter 11 Database Performance Tuning and Query Optimization
ICOM 5016 – Introduction to Database Systems
Indexing 4/11/2019.
Database Implementation Issues
CS4433 Database Systems Indexing.
Huffman Coding Greedy Algorithm
Database Implementation Issues
Presentation transcript:

How To Build a Compressed Bitmap Index Theodore Johnson

What Are Bitmap Indices? Let R be a relation (a table of records), with records 1 .. N. A bitmap of predicate P on R is: bit i is set to 1 O(Ri), else bit i is set to zero. Typical use: P is of the form, “attribute x has value y”. State Name State Has_cell_phone Bob NY Y Mary NJ N Pete CT N Sue CT Y Anne NJ Y Gus NY Y Ken NJ N NY CT 1

Why Use Bitmap Indices? Efficient representation of duplicates in an index. A record ID requires 32 bits / indexed record, a bitmap index requires 1 bit /record / attribute value. Efficient evaluation of complex Boolean selection conditions. Word-wise Boolean operations. “New England states” vs. “Tri-state area” vs “Coastal states” “People who work in a state different than their residence, drive a BMW or a Chevy, are married, and do not have a cell phone”. Other special tricks : fast COUNT aggregates.

Why NOT Use Bitmap Indices? High-cardinality attributes pose a problem - More than 32 attribute values => high space overhead. - Inherently sequential bitmaps => expensive to perform large range queries.

Is There a Solution? Bitslice indices Compression Represent integer values as bits: 5 = 101 Create a bitmap for each bit: B4, B2, B1 Special algorithms for range queries, etc. Problem: useful only in highly specialized cases. Compression Compress the bitmaps using, e.g. gzip or one of the special bitmap compression algorithms. Oracle B-tree indices: all duplicates represented with a compressed bitmap. Problem: I have no guidelines for optimizing the index. What do I compress? using which algorithm? How do I perform Boolean operations? HELP! S. Sarawagi, Data Engineering Bulletin, 1997, “Indexing OLAP Data”

The Performance of Compressed Bitmap Indices There are many bitmap compression algorithms None compression (verbatim) Gzip (using the zlib library). Run Length Encoding (list of # of 0’s between 1’s) Variable bit length encoding (compressed RLE) Variable byte length encoding (a hybrid technique). Which algorithm is best? Best compression Fastest for Boolean operations Is there an overall best algorithm, or should I choose on a per-bitmap basis?

Special Bitmap Compression Codes Variable bit length codes (ExpGol). Compress a run length encoding. Use fewer bits for smaller runs. Gamma code: 1, 010, 011, 00100, 00101, 00110, 00111, 0001000, ... ExpGol: near optimal, N of M bits set => N(log(M/N)+2) Variable byte length codes (BBC). Compress the bitmap in units of bytes, create code words that are byte sequences. Hybrid: represent long runs of zeros with a “gap” code, short runs with a piece of the verbatim bitmap. Lots of code word packing tricks. Operations can be fast because you avoid bit manipulations. 1-pass compression => you can perform Boolean operations directly on the compressed representation But the algorithms are very compelx and hard to code.

Algorithms for Boolean Operations Bitmaps are used in data warehouses to perform compex Boolean selections. Compression and decompression time is secondary. I want fast algorithms to perform an operation Operation between a compressed operand an a foundset Four main algorithms: Basic : Uncompress the bitmap, byte-wise operations with the foundset. Inplace : Extract the list of set bits from the compressed bitmap, operate on the foundset in place. Merge : The foundset and the bitmap are in RLE form. The list of bits is merged. Direct : Encoding specific.

Basic output: verbatim foundset: verbatim operand : verbatim operand &= |=

Inplace output: verbatim foundset: verbatim operand : RLE, ExpGol, BBC

Merge output: RLE foundset: RLE operand : RLE foundset operand And / OR output foundset

Direct output: BBC foundset: BBC operand : BBC The idea is to create a new merged code word from the code words of the foundset and the bitmap. Creating BBC code words is a completely local process, so extensive partial results do not need to be saved. Unfortunately, the details are very complex.

Compression, Synthetic Data

Compression, Real Bitmap Indices ExpGol : 7.3 bits per tuple BBC 2S: 1.04 bits per tuple

Performing an Operation

Performing an Operation

Trends

What Is The Best Compression? Space : It depends on the properties of the bitmap. Density, clusteredness. Using the best compressor for all bitmaps in an index gives a 10% to 20% space savings. Time : It depends on the bitmap and the operations Simple analysis: I can store bitmaps in the Verbatim or the ExpGol format. Assume the use of Inplace, 3X as many ORs as ANDs. Include the cost of reading the bitmap from disk. Compress the bitmap if the bit density is .05 or less.

What is the best way to evaluate a Boolean expression? Parse the Boolean expression into an operator tree. Assign an evaluation algorithm to each node. Requires a global optimization. Rewrite the expression ? Joint work with Sihem Amer-Yahia.

Assigning Evaluation Algorithms Assume that the expression tree and bitmap index encodings are fixed. We can estimate algorithm costs from the properties of the operand bitmaps. But different algorithms expect and produce results in different formats. An additional translation step might be required. Global effects of local decisions => we need a global optimizer. Algorithm Foundset Operand Output Basic verbatim Inplace RLE, ExpGol, BBC Merge RLE Direct BBC

Bitmap Format Translations

Dynamic Programming Algorithm Take advantage of the fact that the decisions are localized. The cost to use an algorithm to evaluate an operator is: The cost of the algorithm The cost of generating bitmaps from the subtrees in the necessary format. The cost of translating the output to the desired output format. cost for each output format translation cost for each algorithm op cost for each output format cost for each output format

Rewriting the Boolean Expression Eliminating the NOT operation X And Not Y ==> X And_Not Y X Or Not Y ==> X Or_Not Y NOT can be an expensive operation, because it can generally is implemented using a Direct algorithm only. However, And_Not and Or_Not have fast algorithms: And_Not : Inplace and Merge Or_Not : Inplace Rewrite collections of AND, OR clauses to encourage the use of fast algorithms OR: Convert the densest to Verbatim, then use Inplace AND: Gather sparse operands and use Merge

Current Status Compressed Bitmap Representation: Index: Optimizer Convert between any pair of representations (verbatim, BBC, etc.) Perform AND, OR, NOT, AND_NOT, OR_NOT operations using Basic, Inplace, Merge, or Direct algorithms. Index: One index per data file in delimited ascii format. Simple load-on-demand index to bitmap blocks for attribute values. Optimizer Cost model for operations, transforms Dynamic programming optimizer Expression rewriting. In combination: Works for simple expressions, still testing.

Furthermore ... Test data set: 5 attributes, up to 100 unique values per attribute. 2.6 Mbytes. Gzip : .8 Mbytes Compressed bitmap index on all attributes: .9 Mbytes So, I can compress a data set almost as efficiently as gzip, but with the data fully indexed. There are fast algorithms for computing COUNT aggregates over bitmaps. Can I use the bitmap index to answer some OLAP queries? Conventional OLAP structures do not handle high dimensional data well, but high dimensionality poses only minor problems for bitmap indices.

Query Processing for OLAP Queries are: Select G1, ..., Gn, count(*) From Fact_Table Where C Group By G1, ... , Gn • Strategy: Compute the bitmap representing C For each value gij of Gi, compute the bitmap of each gij Compute count(C And g1j1 And ... And gnjn)