Database System Implementation CSE 507

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Indexing (Cont.) These slides are a modified version of the slides of the book “Database System Concepts” (Chapter 12), 5th Ed., McGraw-Hill,McGraw-Hill.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 11: Indexing.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Bitmap Indexes.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
©Silberschatz, Korth and Sudarshan12.1Database System Concepts B + -Tree Index Files Indexing mechanisms used to speed up access to desired data.  E.g.,
Oracle Data Block Oracle Concepts Manual. Oracle Rows Oracle Concepts Manual.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Computing & Information Sciences Kansas State University Tuesday, 03 Apr 2007CIS 560: Database System Concepts Lecture 29 of 42 Tuesday, 03 April 2007.
Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Improved Query Performance With Variant Indexes Patrick O’Neil, Dallan Quass Presented by Bo Han.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 11: Indexing.
Indexing Structures Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts – 6 th Edition.
CS4432: Database Systems II
Data Indexing Herbert A. Evans.
Database Management System
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Database System Implementation CSE 507
COMP 430 Intro. to Database Systems
Chapter 11: Indexing and Hashing
Database Management Systems (CS 564)
Chapter 12: Query Processing
Evaluation of Relational Operations
Hashing Chapter 11.
Evaluation of Relational Operations: Other Operations
Database Implementation Issues
File organization and Indexing
Chapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing
Relational Operations
Dynamic Hashing Good for database that grows and shrinks in size
Chapter 12: Indexing and Hashing
Introduction to Database Systems
Lecture 15: Bitmap Indexes
Indexing and Hashing Basic Concepts Ordered Indices
Lecture 2- Query Processing (continued)
Chapter 12: Indexing and Hashing
BITMAP INDEXES E0 261 Jayant Haritsa Computer Science and Automation
Database Design and Programming
DATABASE IMPLEMENTATION ISSUES
INDEXING.
Chapter 12: Indexing and Hashing
Lecture 13: Query Execution
Evaluation of Relational Operations: Other Techniques
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Data Dictionary Storage
Database Implementation Issues
Chapter 11: Indexing and Hashing
Evaluation of Relational Operations: Other Techniques
Database Implementation Issues
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Database System Implementation CSE 507 Indexing Structures Some slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts – 6th Edition. And O’Neil et. Al., “Improved Query Performance with Variant Indexes,” ACM SIGMOD 1997.

Duplicates in a B+ tree Mens* MensFormals MensJeans R200 Hugo Boss 2PC Leaf Node MensFormals MensJeans R200 Hugo Boss 2PC $600 R190 Levis- Medium $50 R150 Gap - Large $45 … RID Sales Table indexed on the Department column

Bitmaps– A better way to index duplicates A bitmap is simply an array of bits Records in a relation are numbered sequentially from 0 to n. Applicable on attributes that take on a relatively small number of distinct values E.g. gender, country, state, … E.g. income-level (income broken up into a small number of levels such as 0-9999, 10000-19999, 20000-50000, 50000- infinity) Material adapted from Silberchatz, Korth and Sudarshan

Bitmap Indices A bitmap index on an attribute has a bitmap for each value of the attribute Bitmap has as many bits as records In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise. Material adapted from Silberchatz, Korth and Sudarshan

Queries on Bitmap Indices Queries are answered using bitmap operations Intersection (and) Union (or) Complementation (not) Material adapted from Silberchatz, Korth and Sudarshan

Queries on Bitmaps Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap E.g. 100110 AND 110011 = 100010 100110 OR 110011 = 110111 NOT 100110 = 011001 Males with income level L1: 10010 AND 10100 = 10000 Can then retrieve required tuples. Counting number of matching tuples is also fast. Material adapted from Silberchatz, Korth and Sudarshan

Queries on Bitmaps Bitmaps need to be updated after insert operations. Deletion needs to be handled properly. Renumbering rows and shifting bits in bitmaps becomes expensive. Existence bitmap to note if there is a valid record at a record location Usually one has to “AND” the result with existence bitmap to get only the valid rows. Example for complementation not(A=v): (NOT bitmap-A-v) AND ExistenceBitmap

Some Implementation Details Bitmap indices generally very small compared with relation size. Density of a bitmap index: Said to be dense if number of 1-bits are large. For a column with 32 values  avg density = 1/32 Typically we would have millions of RowIDs in a bitmap. In such a case bitmaps can be broken into fragments of equal size. Each fragment fits into a single disk page/block. If bitmaps are sparse, then convert into a RowID list representation. If a column has many unique values then put a B+ tree on top of bitmaps for a column.

Some Implementation Details A series of bitmap fragments making up the entry for “department = sports” for a bitmap index on the “department” column of a sales table.

Some Implementation Details Bitmap and RowID-list representations are interchangeable. When Bitmaps are dense, then prefer bitmap representation Else switch to a RowID representation. Indeed a Bitmap index can part RowID lists and part Bitmaps. Authors call this hybrid form as Value-List Index.

Projection Indexes Col1 Col2 v1 v2 . vk Col3 Col4 Col2 v1 v2 . vk Reminiscent of vertical partitioning of a table. A projection index for column duplicates all column values for lookup by ordinal number . Col1 Col2 v1 v2 . vk Col3 Col4 Col2 v1 v2 . vk projection index for col2

Projection Indexes Col2 v1 v2 . vk projection index for col2 If Column length = 4Bytes; Page size = 4000 Bytes then index blocking factor = 1000 Given a row number r, page p = r/1000 ; slot s = r%1000. Projection index Vs Plain layout (i.e., no index): Assume a case where we are joining two tables T1 and T2. Assume T1 is spread across X1 blocks and T2 is spread across X2 blocks. Cost of performing the join on Column c1 in T1 and Column c2 in T2 without index = X1 * X2 + X1 disk reads----(1) Cost of performing join with projection index: <blocks spanned by c1> * <blocks spanned by c2> + <blocks spanned by c1> disk reads ----(2) We have: (2) =< (1) Col2 v1 v2 . vk projection index for col2

Bit-Sliced Indexes A bitmap index on the “bit-level representation” of the column values. Consider a SALES table which contains rows for all sales made in last month. We will build a bit sliced index on the “Rupees_amount” Interpret each amount in term of N+1 bits. A function D(n,i) is defined for a row number n in the table: D(n, 0) = 1 if the 1st (LSB) bit for “Rupees_amount” in row number n is on. D(n, 1) = 1 if the 2nd bit for “Rupees_amount” in row number n is on. … D(n, i) = 1 if the ith bit for “Rupees_amount” in row number n is on.

Bit-Sliced Indexes B5 0 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 bit-slice B4 B3 B2 B1 B0 Col2: 20 52 20 62 10 34 1 49 Bnn: bitmap representing set of rows non null values in the indexed column; We can have a different Existence Bitmap in addition to Bnn

Bit-Sliced Indexes Now for each value of i = 0, 1,…. N. Such that D(n, i) > 0 for some row in the SALES table. We define Bitmap Bi whose nth bit would contain D(n, i) If we consider each bit to be 1 Paisa; Then for just N = 25, we can represent up to Rs 3.35 Lakhs Much more than a typical transaction in a departmental stores.

Question: Compare bit-sliced against a traditional bitmap index for “Rupees_amount”

Using Indexes for Aggregation SALES table: 100 Million rows; Each row 200 bytes; File blocking factor = 20; Size of Page/disk block = 4000 Bytes Query: Select SUM(Rupee_amount) From SALES Where condition Assume the following: The Where condition returns 2,000,000 rows. These are uniformly distributed and have already been determined. Bf denotes a bitmap of this result set. We can assume it to be in Main Mem (about 12 Mb in size)

Direct Access for Aggregation Query Plan 1: Direct access to the table to calculate the SUM Each disk page contains 20 rows Total number of pages in the file = 5,000,000 Result set is about 1/50 of the rows in the table. We will loop through the Bf and retrieve “Rupees_amount” from all the rows spread across 5,000,000 pages. Under uniform distribution, we can expect to get about 0.4 records per page. In other words we will read 2 records for every 5 pages. In worst case, we will end in looking up 2,000,000 pages (each record on a different page).

Projection Index for Aggregation Query Plan 2: Use a Projection on “Rupees_Amount”; Column width = 4Bytes. Each disk page contains 4000/4 = 1000 values of the “Rupees_Amount” Total number of pages in the index file = 100,000 Result set is about 1/50 of the rows in the SALES table. We will loop through the Bf and retrieve “Rupees_amount” from all the rows spread across 100,000 pages. Under uniform distribution, we can expect to get 20 records in each page. Thus we need to read all the 100,000 pages to get our 2Million records marked in Bf

Value-List Index for Aggregation Query Plan 3: Use a Value-List index (Bitmap) on “Rupees_Amount”; IF (COUNT(Bf AND Bnn) == 0) /*All rows in the result set have NULL for rupees*/ Return NULL; SUM = 0.0 For each non-null value v in the bitmap index of “Rupees_Amount”{ Designate the set of rows with the value v as Bv SUM += v* COUNT(Bf AND Bv); } Return SUM

Value-List Index for Aggregation (1/3) IF (COUNT(Bf AND Bnn) == 0) Return NULL; SUM = 0.0 For each non-null value v in the index of “Rupees_Amount”{ Designate the set of rows with the value v as Bv SUM += v* COUNT(Bf AND Bv); } Values in “Rupees_Amount” are counted in Paisa with 20 bits each, we can have about 10,000 distinct values 10,001 COUNTs and 10,001 ANDs If Bv is in RowIDs of 4 bytes each. Under uniform dist  each Bv 10,000 RowIDs; or 1000 per page over 10pages;

Value-List Index for Aggregation (2/3) IF (COUNT(Bf AND Bnn) == 0) Return NULL; SUM = 0.0 For each non-null value v in the index of “Rupees_Amount”{ Designate the set of rows with the value v as Bv SUM += v* COUNT(Bf AND Bv); } If Bv is in RowIDs of 4 bytes each. Under uniform dist  each Bv 10,000 RowIDs; or 1000 per page over 10pages; Algo brings in rowID rep bitmaps of all the 10000 distinct values into main mem. Total cost = 10,000 * 10 disk reads + leaf scan of B+ over 10,000 distinct values of “Rupees_Amount” + comp cost of ANDs and COUNTs If Bf is also in secondary memory then additional 3125 pages for first time

Value-List Index for Aggregation (3/3) IF (COUNT(Bf AND Bnn) == 0) Return NULL; SUM = 0.0 For each non-null value v in the index of “Rupees_Amount”{ Designate the set of rows with the value v as Bv SUM += v* COUNT(Bf AND Bv); } If Bv is in Bitmap form. Each Bv is 100 Million bits; or 12,500,000 bytes; 3125 pages. Our algo brings in bitmaps of all the 10000 distinct values into main mem. Total cost = 10,000 * 3125 disk reads + leaf scan of B+ over 10,000 distinct values of “Rupees_Amount” + comp cost of ANDs and Counts. If Bf is also in secondary memory then additional 3125 pages for first time.

Bit-Sliced Index for Aggregation (1/2) IF (COUNT(Bf AND Bnn) == 0) /*All rows in the result set have NULL for rupees*/ Return NULL; SUM = 0.0 For I = 0 to 19 { SUM += 2^I * COUNT(Bf AND Bi); } Return SUM; Adds bit-­‐slice by bit-­‐slice. First counts the number of 1s in the 2^0 slice then multiplies by 2^0 Then, counts the number of 1s in the 2^1 slice then multiplies by 2^1 …..

Bit-Sliced Index for Aggregation (2/2) IF (COUNT(Bf AND Bnn) == 0) Return NULL; SUM = 0.0 For I = 0 to 19 { SUM += 2^I * COUNT(Bf AND Bv); } Return SUM; 21 ANDs and 21 COUNTS Assuming Bf in main memory: Bv is 100 Million Bits; or 12,500,000 bytes; or 3125 pages Total cost = 21 * 3125 disk reads + comp cost of ANDs and COUNTs.

Using Indexes for Range Predicates SALES table: 100 Million rows; Each row 200 bytes; File blocking factor = 20; Size of Page/disk block = 4000 Bytes Query: Select Target-List From SALES Where C-Range and <Condition> Assume the following: C is a column in SALES <condition> is a general condition based on “equality” C-range is a range-predicate C> c1, C between c1 and c2,…, etc.

Value-List Index for Range Predicates Br = Empty Set For each entry v in the index for C that satisfies the range C { Designate the set of rows with the value v as Bv Br = Br OR Bv } BF = Bf AND Br /* Bf is the result of the <condition> */

Bit-Sliced Index for == Predicate BEQ = BNN For each Bit-Slice Bi for C in decreasing significance{ If bit i is on in the constant c1 BEQ = BEQ AND Bi else BEQ = BEQ AND (NOT Bi) } BEQ = BEQ AND Bf ;//Bf is a bit vector of result set from a // where clause (if any) ==  BEQ Not Null  BNN

Bit-Sliced Index for Range Predicates BGT = BLT = the empty set; BEQ = BNN For each Bit-Slice Bi for C in decreasing significance{ If bit i is on in the constant c1 BLT = BLT OR (BEQ AND NOT(Bi)) BEQ = BEQ AND Bi else BGT = BGT OR (BEQ AND Bi) BEQ = BEQ AND (NOT Bi) } BEQ = BEQ AND Bf; …. (similarly for BGT, BLT, …) BLE = BLT OR BEQ; BGE = BGT OR BEQ >  BGT <  BLT ==  BEQ =<  BLE >=  BGE Not Null  BNN