Lecture 15: Bitmap Indexes

Slides:



Advertisements
Similar presentations
Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
Advertisements

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 8 – File Structures.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
BTrees & Bitmap Indexes
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
CS4432: Database Systems II Record Representation 1.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
Indexing Structures Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts – 6 th Edition.
CS4432: Database Systems II
BITMAP INDEXES Barot Rushin (Id :- 108).
Database System Architecture and Implementation Execution Costs 1 Slides Credit: Michael Grossniklaus – Uni-Konstanz.
Storage and File Organization
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
DB storage architectures: Rows, Columns, LSM trees
Module 11: File Structure
How To Build a Compressed Bitmap Index
CPS216: Data-intensive Computing Systems
CHP - 9 File Structures.
Record Storage, File Organization, and Indexes
Indexing Goals: Store large files Support multiple search keys
Database Management System
CS522 Advanced database Systems
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Database System Implementation CSE 507
Database Management Systems (CS 564)
Methodology – Physical Database Design for Relational Databases
Database Management Systems (CS 564)
Implementation Issues & IR Systems
Searching.
Evaluation of Relational Operations
Lecture 10: Buffer Manager and File Organization
Database Implementation Issues
DB storage architectures: Rows, Columns, LSM trees
CS222P: Principles of Data Management Lecture #2 Heap Files, Page structure, Record formats Instructor: Chen Li.
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
Database Implementation Issues
File organization and Indexing
Chapter 11: Indexing and Hashing
Lecture 12 Lecture 12: Indexing.
Introduction to Database Systems
Lecture 19: Data Storage and Indexes
CS222/CS122C: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.
Database Management Systems (CS 564)
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Lecture 2- Query Processing (continued)
Chapter 13: Data Storage Structures
Database Systems (資料庫系統)
Database Design and Programming
DATABASE IMPLEMENTATION ISSUES
CSE 544: Lecture 11 Storing Data, Indexes
CS222/CS122C: Principles of Data Management Lecture #2 Storing Data: Disks and Files Instructor: Chen Li.
CS222p: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.
ICOM 5016 – Introduction to Database Systems
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Database Implementation Issues
Chapter 11: Indexing and Hashing
Advance Database System
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Lecture #2 Storing Data: Record/Page Formats Instructor: Chen Li.
Database Implementation Issues
Lecture 20: Representing Data Elements
Database Implementation Issues
MIS 451 Building Business Intelligence Systems
Presentation transcript:

Lecture 15: Bitmap Indexes

What you will learn about in this section Lecture 15 What you will learn about in this section Bitmap Indexes Storing a bitmap index Bitslice Indexes

Lecture 15 1. Bitmap indexes

Motivation Consider the following table: CREATE TABLE Tweets ( uniqueMsgID INTEGER, -- unique message id tstamp TIMESTAMP, -- when was the tweet posted uid INTEGER, -- unique id of the user msg VARCHAR (140), -- the actual message zip INTEGER, -- zipcode when posted retweet BOOLEAN -- retweeted? ); SELECT * FROM Tweets WHERE uid = 145; Consider the following query, Q1: And, the following query, Q2: SELECT * FROM Tweets WHERE zip BETWEEN 53000 AND 54999 Speed-up queries using a B+-tree for the uid and the zip values.

In a B+-tree, how many bytes do we use for each record? Motivation Consider the following table: CREATE TABLE Tweets ( uniqueMsgID INTEGER, -- unique message id tstamp TIMESTAMP, -- when was the tweet posted uid INTEGER, -- unique id of the user msg VARCHAR (140), -- the actual message zip INTEGER, -- zipcode when posted retweet BOOLEAN -- retweeted? ); In a B+-tree, how many bytes do we use for each record? At least key + rid, so key-size+rid-size Can we do better, i.e. an index with lower storage overhead? Especially for attributes with small domain cardinalities? Bit-based indices: Two flavors Bitmap indices and Bitslice indices

Bitmap Indices Consider building an index to answer equality queries on the retweet attribute Issues with building a B-tree: Three distinct values: True, False, NULL Lots of duplicates for each distinct value Sort of an odd B-tree with three long rid lists Bitmap Index: Build three bitmap arrays (stored on disk), one for each value. The ith bit in each bitmap correspond to the ith tuple (need to map ith position to a rid) 2

Bitmap Example Table (stored in a heapfile) Bitmap index on “retweet” uniqueMsgID … zip retweet 1 11324 Y 2 53705 3 53706 N 4 NULL 5 90210 1,0000,000,000 R-Yes 1 … R-No 1 … R-Null 1 … Could omit building one of the bitmaps and build it on-the-fly, but reconstruction could be expensive. SELECT * FROM Tweets WHERE retweet = ‘N’ Scan the R-No Bitmap file For each bit set to 1, compute the tuple # Fetch the tuple # (s)

Critical Issue Implications of #1? Need an efficient way to compute a bit position Layout the bitmap in page id order. Need an efficient way to map a bit position to a record id. How? If you fix the # records per page in the heapfile And lay the pages out so that page #s are sequential and increasing Then can construct rid (page-id, slot#) page-id = Bit-position / #records-per-page slot# = Bit-position % #records-per-page With variable length records, have to set the limit based on the size of the largest record, which may result in under-filled pages.

Other Queries Table (stored in a heapfile) Bitmap index on “retweet” uniqueMsgID … zip retweet 1 11324 Y 2 53705 3 53706 N 4 NULL 5 90210 1,0000,000,000 R-Yes 1 … R-No 1 … R-Null 1 … Count efficiently using a indirection SELECT COUNT(*) FROM Tweets WHERE retweet = ‘N’ SELECT * FROM Tweets WHERE retweet IS NOT NULL

Lecture 15 2. Storing a bitmap index

Storing the Bitmap index One bitmap for each value, and one for Nulls Need to store each bitmap Simple method: 1 file for each bitmap Can compress the bitmap! Index size? #tuples * (cardinality of the domain + 1) bits When is a bitmap index more space efficient than a B+-tree? #distinct values < data entry size in the B+-tree

Lecture 15 3. Bit-sliced Index

Bit-sliced Index: Motivation (Re)consider the following table: CREATE TABLE Tweets ( uniqueMsgID INTEGER, -- unique message id tstamp TIMESTAMP, -- when was the tweet posted uid INTEGER, -- unique id of the user msg VARCHAR (140), -- the actual message zip INTEGER, -- zipcode when posted retweet BOOLEAN -- retweeted? ); SELECT * FROM Tweets WHERE zip = 53706 There are approximately 43,000 ZIP Codes in the United States = 16 bits There are more than 45,000,000 ZIP+4 codes of U.S The number of zip codes in the US periodically changes as new codes are added in response to increasing mail volume. In a period of one year, November 2006 to 2007, around 200 zip codes were added in the U.S. Would we build a bitmap index on zipcode?

Bit-sliced index Why do we have 17 bits for zipcode? Table Bit-sliced index (1 slice per bit) Slice 0 Slice 1 Slice 16 uniqueMsgID … zip retweet 1 11324 Y 2 53705 3 53706 N 4 NULL 5 90210 1,0000,000,000 00010110000111100 01101000111001001 01101000111001010 10110000001100010 … 1 … Query evaluation: Walk through each slice constructing a result bitmap e.g. zip ≤ 11324, skip entries that have 1 in the first three slices (16, 15, 14) (Null bitmap is not shown)

Bitslice Indices Can also do aggregates with Bitslice indices E.g. SUM(attr): Add bit-slice by bit-slice. First, count the number of 1s in the slice17, and multiply the count by 217 Then, count the number of 1s in the slice16, and multiply the count by … Store each slice using methods like what you have for a bitmap. Note once again can use compression

Bitmap v/s Bitslice Bitmaps better for low cardinality domains Bitslice better for high cardinality domains Generally easier to “do the math” with bitmap indices