Bitmap Indexes.

Slides:



Advertisements
Similar presentations
Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
Advertisements

SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.
B-trees - Hashing. 11.2Database System Concepts Review: B-trees and B+-trees Multilevel, disk-aware, balanced index methods primary or secondary dense.
Quick Review of Apr 17 material Multiple-Key Access –There are good and bad ways to run queries on multiple single keys Indices on Multiple Attributes.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Processing (overview)
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Processing and Optimization
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Quick Review of Apr 22 material Sections 13.1 through 13.3 in text Query Processing: take an SQL query and: –parse/translate it into an internal representation.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Processing & Optimization
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Dr. Kalpakis CMSC 461, Database Management Systems Query Processing.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
Query Processing Chapter 12
Database Management 9. course. Execution of queries.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Department of Computer Science and Engineering, HKUST Slide Query Processing and Optimization Query Processing and Optimization.
©Silberschatz, Korth and Sudarshan7.1 Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join Operation Other Operations.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.
SCUHolliday - COEN 17814–1 Schedule Today: u Query Processing overview.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan Chapter 13: Query Processing.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
Chapter 13: Query Processing Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join Operation Other Operations.
Computing & Information Sciences Kansas State University Tuesday, 03 Apr 2007CIS 560: Database System Concepts Lecture 29 of 42 Tuesday, 03 April 2007.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Query Processing.
Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Chapter 13: Query Processing
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Processing.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
13.1 Chapter 13: Query Processing n Overview n Measures of Query Cost n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation.
Chapter 13: Query Processing. Overview Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing and Query Optimization Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System.
Computing & Information Sciences Kansas State University Wednesday, 02 Apr 2008CIS 560: Database System Concepts Lecture 27 of 42 Wednesday, 02 April 2008.
Indexing Structures Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts – 6 th Edition.
Chapter 13: Query Processing
Chapter 4: Query Processing
Database Management System
Query Processing.
Chapter 13: Query Processing
Chapter 12: Query Processing
Chapter 13: Query Processing
COST ESTIMATION FOR THE RELATIONAL ALGEBRA OPERATIONS MIT 813 GROUP 15 PRESENTATION.
File Processing : Query Processing
File Processing : Query Processing
Query Processing and Optimization
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Query Processing.
Chapter 13: Query Processing
Chapter 13: Query Processing
Lecture 2- Query Processing (continued)
Chapter 13: Query Processing
Chapter 12 Query Processing (1)
Chapter 13: Query Processing
Chapter 13: Query Processing
Chapter 13: Query Processing
Chapter 13: Query Processing
Presentation transcript:

Bitmap Indexes

Bitmap Indices Bitmap indices are a special type of index designed for efficient querying on multiple keys Very effective on attributes that take on a relatively small number of distinct values E.g. gender, country, state, … E.g. income-level (income broken up into a small number of levels such as 0-9999, 10000-19999, 20000-50000, 50000- infinity) A bitmap is simply an array of bits For each gender, we associate a bitmap, where each bit represents whether or not the corresponding record has that gender.

Bitmap Indices (Cont.) In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute Bitmap has as many bits as records In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise

Bitmap Indices (Cont.) Bitmap indices are useful for queries on multiple attributes not particularly useful for single attribute queries Queries are answered using bitmap operations Intersection (and) Union (or) Complementation (not) Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap E.g. 100110 AND 110011 = 100010 100110 OR 110011 = 110111 NOT 100110 = 011001 Males with income level L1: And’ing of Males bitmap with Income Level L1 bitmap 10010 AND 10100 = 10000 Can then retrieve required tuples. Counting number of matching tuples is even faster

Bitmap Indices (Cont.) Bitmap indices generally very small compared with relation size E.g. if record is 100 bytes, space for a single bitmap is 1/800 of space used by relation. If number of distinct attribute values is 8, bitmap is only 1% of relation size Deletion needs to be handled properly Existence bitmap to note if there is a valid record at a record location Needed for complementation not(A=v): (NOT bitmap-A-v) AND ExistenceBitmap Should keep bitmaps for all values, even null value To correctly handle SQL null semantics for NOT(A=v): intersect above result with (NOT bitmap-A-Null)

Index Definition in SQL Create a B-tree index (default in most databases) create index <index-name> on <relation-name> (<attribute-list>) -- create index b-index on branch(branch_name) -- create index ba-index on branch(branch_name, account) -- concatenated index -- create index fa-index on branch(func(balance, amount)) – function index Use create unique index to indirectly specify and enforce the condition that the search key is a candidate key. Hash indexes: not supported by every database (but implicitly in joins,…) PostgresSQL has it but discourages due to performance Create a bitmap index create bitmap index <index-name> on <relation-name> (<attribute-list>) For attributes with few distinct values Mainly for decision-support(query) and not OLTP (do not support updates efficiently) To drop any index drop index <index-name>

Query Processing

General Overview Relational model - SQL Functional Dependencies Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing Query Processing and Optimization

Review Data Retrieval at the physical level: Indices: data structures to help with some query evaluation: SELECTION queries (ssn = 123) RANGE queries (100 <= ssn <=200) Index choices: Primary vs secondary, dense vs sparse, ISAM vs B+- tree vs Extendible Hashing vs Linear Hashing But what about join queries? Or other queries not directly supported by the indices? How do we evaluate these queries? Sometimes, indexes not useful, even for SELECTION queries. When? What decides when to use them? A: Query Processing (one of the most complex components of a database system)

QP & O SQL Query Query Processor Data: result of the query

QP & O SQL Query Query Processor Parser Query Optimizer Evaluator Algebraic Expression Execution plan Evaluator Data: result of the query

QP & O Query Optimizer Algebraic Query Rewriter Representation Algebraic Representation Plan Generator Data Stats Query Execution Plan

Query Processing and Optimization Parser / translator (1st step) Input: SQL Query (or OQL, …) Output: Algebraic representation of query (relational algebra expression) Eg SELECT balance FROM account WHERE balance < 2500 balance(balance2500(account)) or balance balance2500 account

QP & O Plan Evaluator (last step) Input: Query Execution Plan Output: Data (Query results) Query execution plan Algorithms of operators that read from disk: Sequential scan Index scan Merge-sort join Nested loop join …..

QP & O Query Rewriting Input: Algebraic representation of query Output: Algebraic representation of query Idea: Apply heuristics to generate equivalent expression that is likely to lead to a better plan e.g.: amount > 2500 (borrower loan) borrower (amount > 2500(loan)) Why is 2nd better than 1st?

QP & O Plan Generator Input: Algebraic representation of query Output: Query execution plan Idea: generate alternative plans on evaluating a query amount > 2500 Estimate cost for each plan Choose the plan with the lowest cost Sequential scan Index scan

QP & O Goal: generate plan with minimum cost (i.e., fast as possible) Cost factors: CPU time (trivial compared to disk time) Disk access time main cost in most DBs Network latency Main concern in distributed DBs Our metric: count disk accesses

Cost Model How do we predict the cost of a plan? Ans: Cost model For each plan operator and each algorithm we have a cost formula Inputs to formulas depend on relations, attributes Database maintains statistics about relations for this (Metadata)

Metadata Given a relation r, DBMS likely maintains the following metadata: Size (# of tuples) nr Size (# of blocks) br Block size (#tuples) fr (typically br =  nr / fr  ) Tuple size (in bytes) sr Attribute Variance (for each attribute r, # of different values) V(att, r) Selection Cardinality (for each attribute in r, expected size of a selection: att = K (r ) ) SC(att, r)

Example naccount = 6 saccount = 33 bytes faccount = 4K/33 V(balance, account) = 3 V(acct_no, account) = 6 S(balance, account) = 2 ( nr / V(att, r))

Some typical plans and their costs Query: att = K (r ) A1 (linear search). Scan each file block and test all records to see whether they satisfy the selection condition. Cost estimate (number of disk blocks scanned) = br br denotes number of blocks containing records from relation r If selection is on a key attribute, cost = (br /2) stop on finding record (on the average in the middle of the file) Linear search can be applied regardless of selection condition or ordering of records in the file, or availability of indices

Selection Operation (Cont.) Query: att = K (r ) A2 (binary search). Applicable if selection is an equality comparison on the attribute on which file is ordered. Requires that the blocks of a relation are stored contiguously Cost estimate: log2(br) — cost of locating the first tuple by a binary search on the blocks Plus number of blocks containing records that satisfy selection condition EA2 = log2(br) + sc(att, r) / fr -1 What is the cost if att is a key? EA2 = log2(br)

Example V(bname, account) = 50 Query: bname =“Perry” ( account ) V(bname, account) = 50 naccount = 10K faccount = 20 tuples/block Primary index on bname Key: acct_no Cost Estimates: A1: EA1 = naccount / faccount  = 500 I/O’s A2: EA2 = log2(br) + sc(att, r) / fr -1 = 9 + 9 = 18 I/O’s

More Plans for selection What if there is an index on att? We need metadata on size of index (i). DBMS keeps that of: Index height: HTi Index “Fan Out”: fi Average # of children per node (not same as order..) Index leaf nodes: LBi Note: HTi ~ logfi(LBi) + 1

More Plans for selection Query: att = K (r ) A3: Index scan, Primary Index What: Follow primary index, searching for key K Prereq: Primary index on att, i Cost: EA3 = HTi + 1, if att is a candidate key EA3 = HTi + SC(att, r) / fr, if not

A5: Index scan, Secondary Index What: Follow according index, searching for key K Prereq: Secondary index on att, i Cost: if att not a key: EA4 = HTi + 1 + SC(att, r) Else, if att is a key: EA4 = HTi + 1 bucket read Index block reads File block reads (in worst case, each tuple on different block)

Cardinalities Cardinality: the size (number of tuples) in the query result Why do we care? Ans: Cost of every plan depends on nr e.g. Linear scan: br +  nr / fr Primary Index: HTi +1 ~ logfi(LBi) +2 ≤ logfi(nr / fr )+2 But, what if r is the result of another query? Must now the size of query results as well as cost Size of att = K (r ) ? SC(att, r)

Selections Involving Comparisons Query: Att  K (r ) A6 (primary index, comparison). (Relation is sorted on Att) For Att  V(r) use index to find first tuple  v and scan relation sequentially from there For AttV (r) just scan relation sequentially till first tuple > v; do not use index Cost: EA5 =HTi + c / fr (where c is the cardinality of result) HTi k ... k

Query: Att  K (r ) Cardinality: More metadata on r are needed: min (att, r) : minimum value of att in r max(att, r): maximum value of att in r Then the selectivity of Att = K (r ) is estimated as: (or nr /2 if min, max unknown) Intuition: assume uniform distribution of values between min and max min(attr, r) K max(attr, r)

Plan generation: Range Queries Att K (r ) A6: (secondary index, comparison). Cost: EA6 = HTi -1+ #of leaf nodes to read + # of file blocks to read = HTi -1+ LBi * (c / nr) + c, if att is a candidate key HTi ... k, k+1 k+m ... k+1 k+m k

Plan generation: Range Queries A6: (secondary index, range query). If att is NOT a candidate key HTi ... k, k+1 k+m ... k k+1 k+m k ... ...

Cost: EA6 = HTi -1+ #of leaf nodes to read + #of file blocks to read +#buckets to read = HTi -1+ LBi * (c / nr) + c + x

Join Operation Metadata: ncustomer = 10,000 ndepositor = 5000 Size and plans for join operation Running example: depositor customer Metadata: ncustomer = 10,000 ndepositor = 5000 fcustomer = 25 fdepositor = 50 bcustomer= 400 bdepositor= 100 V(cname, depositor) = 2500 (each customer has on average 2 accts) cname in depositor a foreign key for customer depositor(cname, acct_no) customer(cname, cstreet, ccity)

Cardinality of Join Queries What is the cardinality (number of tuples) of the join? E1: Cartesian product: ncustomer * ndepositor = 50,000,000 E2: Attribute cname common in both relations, 2500 different cnames in depositor Size: ncustomer * (avg# of tuples in depositor with same cname) = ncustomer * (ndepositor / V(cname, depositor)) = 10,000 * (5000 / 2500) = 20,000

Cardinality of Join Queries E3: cname is a foreign key for depositor on customer Size: ndepositor * (avg # of tuples in customer with same cname) = ndepositor * 1 = 5000 Note: If cname is a key for customer but NOT a foreign key for depositor, then 5000 an UPPER BOUND Some customer names may not match w/ any customers in customer

Cardinality of Joins in general Assume join: R S If R, S have no common attributes: nr * ns If R,S have attribute A in common: (take min) If R, S have attribute A in common and: A is a candidate key for R: ≤ ns A is candidate key in R and candidate key in S : ≤ min(nr, ns) A is a key for R, foreign key for S: = ns

Nested-Loop Join Algorithm 1: Nested Loop Join Idea: Query: R S t1 u1 Blocks of... t2 u2 t3 u3 R S results Compare: (t1, u1), (t1, u2), (t1, u3) ..... Then: GET NEXT BLOCK OF S Repeat: for EVERY tuple of R

Nested-Loop Join Algorithm 1: Nested Loop Join Query: R S Algorithm 1: Nested Loop Join for each tuple tr in R do for each tuple us in S do test pair (tr,us) to see if they satisfy the join condition if they do (a “match”), add tr • us to the result. R is called the outer relation and S the inner relation of the join.

Nested-Loop Join (Cont.) Cost: Worst case, if buffer size is 3 blocks br + nr  bs disk accesses. Best case: buffer big enough for entire INNER relation + 2 br + bs DAs. Assuming worst case memory availability cost estimate is 5000  400 + 100 = 2,000,100 disk accesses with depositor as outer relation, and 10000  100 + 400 = 1,000,400 disk accesses with customer as the outer relation. If smaller relation (depositor) fits entirely in memory, the cost estimate will be 500 disk accesses. (actually we need 2 more blocks)

Join Algorithms Algorithm 2: Block Nested Loop Join Idea: Query: R S Blocks of... t2 u2 t3 u3 R S results Compare: (t1, u1), (t1, u2), (t1, u3) (t2, u1), (t2, u2), (t2, u3) (t3, u1), (t3, u2), (t3, u3) Then: GET NEXT BLOCK OF S Repeat: for EVERY BLOCK of R

Block Nested-Loop Join for each block BR of R do for each block BS of S do for each tuple tr in BR do for each tuple us in Bs do begin Check if (tr,us) satisfy the join condition if they do (“match”), add tr • us to the result.

Block Nested-Loop Join (Cont.) Cost: Worst case estimate: br  bs + br block accesses. Best case: br + bs block accesses. Same as nested loop. Improvements to nested loop and block nested loop algorithms for a buffer with M blocks: In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output Cost = br / (M-2)  bs + br If equi-join attribute forms a key or inner relation, stop inner loop on first match Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement)