BITMAP INDEXES E0 261 Jayant Haritsa Computer Science and Automation

Slides:



Advertisements
Similar presentations
Bitmap Index Design and Evaluation Ariel Noy Data representation and retrieval seminar By: Chee-Yong Chan Yannis E.Ioannidis.
Advertisements

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
BTrees & Bitmap Indexes
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
ITIS 5160 Indexing. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Sorting CS 202 – Fundamental Structures of Computer Science II Bilkent.
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
CSC 41/513: Intro to Algorithms Linear-Time Sorting Algorithms.
The Fundamentals: Algorithms, the Integers & Matrices.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
Sec 14.7 Bitmap Indexes Shabana Kazi. Introduction A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (commonly.
CS4432: Database Systems II Query Processing- Part 2.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 5 Index and Clustering
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Indexing Structures Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts – 6 th Edition.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
Linear Programming Revised Simplex Method, Duality of LP problems and Sensitivity analysis D Nagesh Kumar, IISc Optimization Methods: M3L5.
ITIS 5160 Indexing.
Query Processing Exercise Session 4.
Database Management System
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Database System Implementation CSE 507
The Variable-Increment Counting Bloom Filter
Database Management Systems (CS 564)
Chapter 12: Query Processing
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
COST ESTIMATION FOR THE RELATIONAL ALGEBRA OPERATIONS MIT 813 GROUP 15 PRESENTATION.
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
Chapter 11: Indexing and Hashing
On Spatial Joins in MapReduce
Relational Operations
Solution of Equations by Iteration
Chap 3. The simplex method
King Fahd University of Petroleum and Minerals
Chapter 3 The Simplex Method and Sensitivity Analysis
Lecture 15: Bitmap Indexes
Indexing and Hashing Basic Concepts Ordered Indices
Ch. 8 Priority Queues And Heaps
Dual Bitmap Index: Space-Time Efficient Bitmap
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Lecture 2- Query Processing (continued)
Copyright © Cengage Learning. All rights reserved.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Chapter 11 Indexing And Hashing (1)
Linear-Time Sorting Algorithms
The Fundamental Theorem of Calculus
CUBE MATERIALIZATION E0 261 Jayant Haritsa
Overview of Query Evaluation
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
General External Merge Sort
Wavelet-based histograms for selectivity estimation
Linear Time Sorting.
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Chapter 2. Simplex method
Presentation transcript:

BITMAP INDEXES E0 261 Jayant Haritsa Computer Science and Automation Indian Institute of Science

Low-Cardinality Domains e.g. in IISc: Gender, Program, Home state B-trees or hashing don’t work well for low-cardinality domains, which are common in decision-support environments Intersecting lists of RIDS for satisfying conjunction of predicates is an expensive operation

Bitmap Index SRno name gender grade Bit-vector: 1 bit for each Bitmap Index is defined on attribute A as a sequence of M bitmaps, where M is the number of distinct values of A, and record ‘j’ has a 1 in the bitmap corresponding to its value, and 0 otherwise – the height of each bitmap is equal to the relation’s cardinality. Grade A+ A B C D Gender M F SRno name gender grade Bit-vector: 1 bit for each possible value. Query: Male students with A+ grade - answer derived by ANDing the Male and A+ bitvectors

Features well suited for low cardinality columns utilizes bitwise operations like AND, OR, XOR, NOT which are efficiently supported by hardware uses small amount of space performs efficiently with columns involving scalar functions (e.g.,COUNT) many variations to reduce space requirement and improve query performance

Bitmap Design Space Attribute Value Decomposition Determines structure of index - number and size of index component Index Encoding Scheme Determines how index components are encoded Value-list index corresponds to the examples thus far; bit-sliced is what is recommended in paper

(1) Attribute Value Decomposition Given C, the attribute value cardinality, and a sequence of n numbers <bn , bn-1 ,… , b1 >, such that C ≤ bi , attribute value A is decomposed into n digits AnAn-1… A1 where Ai is a base-bi digit Example C = 1000 and attribute value A = 256 <bn , bn-1 ,… , b1 > Decomposition of A <1000> 256 <50,20> 12 (20) + 16 <32,32> 8 (32) + 0 <5,20,10> 1(20)(10) + 5(10) + 6 Think of b_0 = 1 Each <bn , bn-1 ,… , b1 > base of index defines an n-component index.

AVD Details Consider an attribute value v and a sequence of (n-1) numbers < bn-1 ,… , b 1 > and define bn = Then v can be decomposed into sequence of n digits <vn , vn-1 ,… ,v 1 > : v = V1 v = 256 and index <5,20,10> = V2b1 + v1 = 25 (10) + 6 = V3(b2 b1) + v2b1 + v1 = 1 (20)(10) + 5 (10) + 6 = V4(b3 b2 b1) + v3(b2 b1) + v2b1 + v1 = …… = vn + vi + …. + v2b1 + v1 where vi = Vi mod bi, Vi = 1 < i ≤ n, and each digit vi is in the range 0 ≤ vi < bi. If bn = bn-1 =….= b 1 = b, base is called uniform and the index is called base-b.

(2) Bitmap Encoding Schemes Two basic ways to encode a base bi value x (i.e.,0≤ x< bi) using bi bits Equality Encoded Bitmap = { records with Ai = x } Range Encoded Bitmap = { records with Ai ≤ x } is not materialized since all its bits are set to 1.

Equality Encoded Base <3,3> Index The projection is with retention of duplicates; Note that 9 bitmaps of original Value list have now become 6 bitmaps with AVD

Range Encoded Base <3,3> Index

Evaluation Algorithm for Selection Queries on Range-Encoded Bitmaps RangeEval [SIGMOD97] Evaluates each range predicate by computing two bitmaps The BEQ bitmap Either BGT or BLT RangeEval-Opt (this paper): avoids the intermediate equality predicate evaluation by evaluating each range predicate in term of only ≤ . A < v == A ≤ v-1 A > v == ! (A ≤ v) A ≥ v == !(A ≤ v-1) Working with only one bitmap B which reduces number bitmap operations by 50% one less bitmap scan for range predicate evaluation

Algorithm: RangeEval BLT = BLT V (BEQ Λ ) BGT = BGT V (BEQ Λ ) BEQ = BEQ Λ Some complications due to handling null values, rare in a warehouse. Bnn is for this; Start with n and go down to 1 because you want to figure out less than and greater than from MSB Also, the “op” of less than, greater than, etc. is not shown explicitly here – but important to know that less than uses only LT and EQ, no computation of GT or GE is done at all.

RangeEval: <3,3> Example b2 = 3, b1 =3 Consider selection predicate A ≤ 5 <v2,v1> = <1,2> = [1,0][0,0] i= 2, v2 = 1 BLT V (BEQ Λ ) = BLT BEQ Λ = BEQ 1 Λ 1 = 1 1 Λ 0 = 0 0 V ( 1 Λ 0 ) = 0 0 V ( 1 Λ 1 ) = 1 Start with BLT = all 0; BEQ = all 1 (if no null values) In this slide B20 , then B21 and B20 in the exclusive-OR

RangeEval (Example continued) Consider selection predicate A ≤ 5 <v2,v1> = <1,2> = [1,0][0,0] i= 1, v1 = 2 BLT V (BEQ Λ ) = BLT BEQ Λ Bibi-2 = BEQ BLE =BLT V BEQ 0 V ( 1 Λ 1 ) = 1 1 V ( 0 Λ 0 ) = 1 1 V ( 0 Λ 1 ) = 1 0 V ( 0 Λ 1 ) = 0 0 V ( 0 Λ 0 ) = 0 1 Λ 0 = 0 0 Λ 1 = 0 0 Λ 0 = 0 0 Λ 0 = 0 0 Λ 1 = 0 1 Here, first B11, then (B11 bar)

RangeEval-Opt

RangeEval-Opt (Example) b2 = 3, b1 =3 Consider selection predicate A ≤ 5 <v2,v1> = <1,2> = [1,0][0,0] i= 2 , v2 = 1 V ( Λ B ) = B 0 V (1 Λ 1) = 1 1 V (1 Λ 1) = 1 0 V (0 Λ 1) = 0 Range-Eval-Opt is bottom-up, while Range-Eval is top-down! (i.e. i = 1 to n, vs. i = n to 1 in algo) Lines 7 and 8 done jointly in B expression; B20 and B21

Worst-case Analysis For RangeEval Algorithm For RangeEval- Opt Algorithm

Experimental comparison (a) Average Number of Bitmap Scans as a Function of (Uniform) Base Number. (b) Average Number of Bitmap Operations as a Function of (Uniform) Base Number.

Space/Time Cost Models Space(I) number of bitmaps stored Time(I) Expected number of bitmap scans for a selection query queries in the query space Q are uniformly distributed, where Q = { A op v : op ∊ {≤,≥,<,>,=, ≠}, 0 ≤ v < C } I/O and CPU are correlated, hence ignore CPU

Comparison of Encoding Schemes for Selection Queries Theorem 5.1 For equality encoded Bitmap Index For range encoded Bitmap Index [with Range-Eval-Opt] For the case where bi = 2 for equality-encoding, only one bitmap needs to be stored explicitly since the other bitmap can be obtained by complementing the materialized bitmap.

Proof of Time for Range Encoded Index with Range-Eval-Opt Computed expected number of scans for r (range predicate) and e (equality predicate) Range

Proof (contd) Equality

Space Optimal Indexes Theorem 6.1 The number of bitmaps in an n-component space-optimal index is given by n(b-2)+r, where and r is the smallest positive integer such that br (b-1)n-r ≥ C. Index with base is an n-component space-optimal index. The space-efficiency of space-optimal indexes is a non-decreasing function of the number of components.

Proof for Theorem 6.1 Proof: Lemma 1 For any n-component range-encoded bitmap index In with attribute cardinality C, there exists another n-component range-encoded bitmap index I’n with attribute cardinality C such that Space(In) = Space(I’n) the difference between any two base numbers of I’nis at most one, and the product of all the base numbers of I’n is ≥ that of In Proof: Let <bn , bn-1 ,… , b 1 > be the base of In. Let bmax = max { b i } , 1 ≤ i≤n bmin = min { b i } , 1 ≤ i≤n Consider the non-trivial case where bmax - bmin ≥ 2 . Let bp = bmax and bq = bmin for some 1 ≤ p, q ≤ n. Define another n-component bitmap index I’n with base <b’n , b’n-1 ,… , b ’1 >, where Clearly, Space(In) = Space(I’n) and b’pb’q = (bp -1)(bq +1) = bpbq + (bp - bq) - 1 and (bp – bq) ≥ 2, leading to π b’i > π bi By a finite number of applications of the above base number adjustment to In, we will obtain an n-component bitmap index I’n such that all the three properties hold.

Proof for Theorem 6.1 (contd) Main Proof of (1): For any n-component range encoded bitmap index with attribute cardinality C, its maximum base number must be at least equal to b , because (b-1)n < C ≤ (b)n Also by Lemma 1 it follows that is an n-component space optimal index with attribute cardinality C. Number of bitmaps = (n-r) (b-2) + r (b-1) = n(b-2) + r Make r as small as possible, subject to br (b-1)n-r ≥ C.

Proof for Theorem 6.1 (contd) 2) The space-efficiency of space-optimal indexes is a non-decreasing function of the number of components. Proof: Let < bn , bn-1 ,… , b 1 > be the base of an n-component space-optimal bitmap index In with attribute cardinality C, where n < n <  1 ≤ p ≤ n such that bp > 2 Define an (n+1)-component bitmap index In+1 with attribute cardinality C and base < b’n+1, b’n ,… , b’1 > where In+1 is well-defined since Using Equation Space(In ) – Space(In+1) = (since bp > 2). It follows that an (n+1)-component space-optimal bitmap index is at least as space-efficient as an n-component space-optimal bitmap index. Space(In ) – Space(In+1) = (bp – 1) – [(bn+1 – 1) + (ceil[bp /2] – 1)] = (bp – 1) – [(2 – 1) + (ceil[bp /2] – 1)] = (floor[bp /2] – 1)

Time Optimal Indexes 3) The base of the n-component time-optimal index is Proof : By the equation Time(In) is minimized when b1 ≥ bi, 1 < i ≤ n, and the base numbers are as small as possible (subject to the constraint ). Thus it follows that Time(In) is minimized when bi = 2 , 1 < i ≤ n. leading to b1 =

Time Optimal Indexes 4)The time-efficiency of time-optimal indexes is a non-increasing function of the number of components. Proof: By the equation and result (3) Time(Intime) = where bn,1 = . So Time (In+1time) – Time(Intime) = Since bn,1 ≥ bn+1,1 ≥ 2  Time (In+1time) ≥ Time(Intime) Max value of the expression in the second term (4/3 …) is 1/3. Therefore 1 – 1/3 is greater than zero.

Bitmap Index with optimal Space-time Tradeoff (Knee) knee index corresponds to the index with the best space-time tradeoff. approximate knee of all index with knee of space-optimal indexes (Fig10) Space-time tradeoff for space optimal indexes the knee of the space-time tradeoff graph for space-optimal indexes usually corresponds to a 2-component index Number of components

Theorem 7.1 The base of the most time-efficient 2-component space-optimal index is given by < b2 - δ, b1 + δ >, where b1 = b2 = and Proof : Let I be a 2- component index with base <b2,b1> as per above values. By Theorem 6.1, I is a 2-component space-optimal bitmap index. Time(I) is minimized if b1≥b2 because Also, by Theorem 8.1, Time (I) is minimized when b2 is minimized  δ is largest non-negative integer value such that (b2- δ)*(b1+ δ) ≥C The idea here is to use the gap between ceil of C and C itself. That is if, C is 1000 and the components are 32 apiece, instead of 32, 32, can use 28,36 because product is > 1000, and will be time-optimal.

IMPLEMENTATION

Bitmap Index storage schemes Bitmap-Level Storage (BS) Each bitmap is stored individually in a single bitmap file of N bits. Thus, the bitmap index is stored in n N-bit bitmap files. Component-Level Storage (CS) Each index component is stored individually in a row-major order in a single bitmap file of N * ni bits, 1≤ i ≤ k. Thus, the bitmap index is stored in k bitmap files. CS has compression better than BS and IS Index-Level Storage (IS) The entire index is stored in a row-major order in a single bitmap file of Nn bits.

Experimental Setup Dataset 1 : Dataset 2 : Small attribute cardinality Large attribute cardinality The data compression code used is from library zlib library

Compression Comparison Compressibility is the size of cSchemeX/BS; so low numbers are good – hence cCS is the best in terms of space-efficiency

Performance Comparison cCS-indexes are most space-efficient cBS-indexes are most time-efficient since no additional baggage has to be carried/decompressed Compressibility is the size of cSchemeX/BS; so low numbers are good – hence cCS is the best in terms of space-efficiency

Efficiency Metrics Space Metric Time Metric Total space of all its bitmaps (in MBytes) Time Metric Average predicate evaluation time (in seconds) The time to read the bitmaps The time for in-memory bitmap decompression (for compressed bitmap) The time for bitmap operations

Experimental Results (a) Time Vs Number of components

Experimental Results (b) Space Vs Number of components

Experimental Results (c) Time Vs Space

Conclusions This paper gives the general framework to study the design space of bitmap indexes for selection queries and examines several space-time tradeoff issues. Range-encoded bitmap indexes offer, in most cases, better space-time tradeoff than equality-encoded bitmap indexes. cBS-indexes are most time efficient and cCS-indexes are most space efficient. But BS, cBS-indexes have comparable space time tradeoff which is better than that of cCS

END BITMAP INDEXES E0 261