Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan.

Slides:



Advertisements
Similar presentations
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
Advertisements

CS4432: Database Systems II
CS CS4432: Database Systems II Logical Plan Rewriting.
Completing the Physical-Query-Plan. Query compiler so far Parsed the query. Converted it to an initial logical query plan. Improved that logical query.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Execution Since our SQL queries are very high level the query processor does a lot of processing to supply all the details. An SQL query is translated.
Notions of clustering Clustered relation: tuples are stored in blocks mostly devoted to that relation. Clustering index: tuples (of the relation) with.
Cost-Based Transformations. Why estimate costs? Well, sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
COMP 451/651 Optimizing Performance
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 183 Database Systems II Query Compiler.
Algebraic Laws For the binary operators, we push the selection only if all attributes in the condition C are in R.
CS 4432query processing - lecture 141 CS4432: Database Systems II Lecture #14 Query Processing – Size Estimation Professor Elke A. Rundensteiner.
Notions of clustering Clustered file: e.g. store movie tuples together with the corresponding studio tuple. Clustered relation: tuples are stored in blocks.
Estimating the Cost of Operations We don’t want to execute the query in order to learn the costs. So, we need to estimate the costs. How can we estimate.
CS CS4432: Database Systems II Query Processing – Size Estimation.
Relational Algebra on Bags A bag is like a set, but an element may appear more than once. –Multiset is another name for “bag.” Example: {1,2,1,3} is a.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinAkshay Shenoy Computer Science Dept San Jose State University.
Cost based transformations Initial logical query plan Two candidates for the best logical query plan.
CS Spring 2002Notes 61 CS 277: Database System Implementation Notes 6: Query Processing Arthur Keller.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Cost-Based Transformations. Why estimate costs? Sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g. Pushing.
CS 4432query processing1 CS4432: Database Systems II.
Query Execution :Nested-Loop Joins Rohit Deshmukh ID 120 CS-257 Rohit Deshmukh ID 120 CS-257.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
CSCE Database Systems Chapter 15: Query Execution 1.
Cost based transformations Initial logical query plan Two candidates for the best logical query plan.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
DBMS 2001Notes 6: Query Compilation1 Principles of Database Management Systems 6: Query Compilation and Optimization Pekka Kilpeläinen (partially based.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
CS 245Notes 61 CS 245: Database System Principles Notes 6: Query Processing Hector Garcia-Molina.
CS 245Notes 61 CS 245: Database System Principles Notes 6: Query Processing Hector Garcia-Molina.
DBMS 2001Notes 5: Query Processing1 Principles of Database Management Systems 5: Query Processing Pekka Kilpeläinen (partially based on Stanford CS245.
Chapters 15-16a1 (Slides by Hector Garcia-Molina, Chapters 15 and 16: Query Processing.
Estimating the Cost of Operations. Suppose we have parsed a query and transformed it into a logical query plan (lqp) Also suppose all possible transformations.
More Relation Operations 2014, Fall Pusan National University Ki-Joune Li.
CS 4432estimation - lecture 161 CS4432: Database Systems II Lecture #16 Query Processing : Estimating Sizes of Results Professor Elke A. Rundensteiner.
Query Optimization.  Parsing Queries  Relational algebra review  Relational algebra equivalencies  Estimating relation size  Cost based plan selection.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CS 245Notes 61 CS 245: Database System Principles Notes 6: Query Processing Hector Garcia-Molina.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
CSCE Database Systems Chapter 15: Query Execution 1.
Lu Chaojun, SJTU 1 Extended Relational Algebra. Bag Semantics A relation (in SQL, at least) is really a bag (or multiset). –It may contain the same tuple.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS4432: Database Systems II Query Processing- Part 1 1.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinAkshay Shenoy Computer Science Dept San Jose State University.
1/14/2005Yan Huang - CSCI5330 Database Implementation – Query Optimization Query Optimization.
Query Processing Exercise Session 4.
Database Management System
Lecture 26: Query Optimizations and Cost Estimation
Lecture 27: Size/Cost Estimation
CS 245: Database System Principles
Database Management Systems (CS 564)
Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
CS 245: Database System Principles
Outline - Query Processing
One-Pass Algorithms for Database Operations (15.2)
Lecture 27: Optimizations
Query Execution Index Based Algorithms (15.6)
CPSC-608 Database Systems
Lecture 11: B+ Trees and Query Execution
Outline - Query Processing
CPSC-608 Database Systems
Lecture 26: Wednesday, December 4, 2002.
Lecture 27 Wednesday, December 5, 2001.
Presentation transcript:

Estimating the Cost of Operations

From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan into a physical plan. –We do so by considering many different p.q.p. that are derived from the l.q.p., and evaluating or estimating the cost of each. Example Initial logical query plan Two candidates for the best logical query plan

Estimating Costs How can we estimate the number of tuples in an intermediate relation? We don’t want to execute the query in order to learn the costs. So, we need to estimate them. Rules about estimation formulas: 1.Give (somehow) accurate estimates 2.Easy to compute

Projection Projection  retains duplicates, so the number of tuples in the result is the same as in the input. Result tuples are usually shorter than the input tuples. The size of a projection is the only one we can compute exactly.

Selection Let S =  A=c (R) We can estimate the size of the result as T(S) = T(R) / V(R,A) Let S =  A<c (R) On average, T(S) would be T(R)/2, but more properly: T(R)/3 Let S =  A  c (R), Then, an estimate is: T(S) = T(R)*[(V(R,A)-1)/V(R,A)], or simply T(S) = T(R)

Selection... Let S =  C AND D (R) =  C (  D (R)) and U =  D (R). First estimate T(U) and then use this to estimate T(S). Example S =  a=10 AND b<20 (R) T(R) = 10,000, V(R,a) = 50 T(S) = (1/50)* (1/3) * T(R) = 67 Note: Watch for selections like:  a=10 AND a>20 (R)

Selection... Let S =  C OR D (R). Simple estimate: T(S) = T(  C (R)) + T(  D (R)). Problem: It’s possible that T(S)  T(R)! A more accurate estimate Let: –T(R)=n, –m 1 = size of selection on C, and –m 2 = size of selection on D. Then T(S) = n(1-(1-m 1 /n)(1-m 2 /n)) Why? Example: S =  a=10 OR b<20 (R). T(R) = 10,000, V(R,a) =50 Simple estimation: T(S) = 3533 More accurate: T(S) = 3466

Natural Join R(X,Y)  S(Y,Z) Anything could happen! Extremes No tuples join: T(R  S) = 0 All tuples join: i.e. R.Y=S.Y = a. Then, T(R  S) = T(R)*T(S)

Two Assumptions Containment of value sets If V(R,Y) ≤ V(S,Y), then every Y-value in R is assumed to occur as a Y-value in S When such thing can happen? For example when: Y is foreign key in R, and key in S Preservation of set values If A is an attribute of R but not S, then it is assumed that V(R  S, A)=V(R, A) This may be violated when there are dangling tuples in R There is no violation when: Y is foreign key in R, and key in S

Natural Join size estimation Let, R(X,Y) and S(Y,Z), and suppose Y is a single attribute. What’s the size of T(R  S)? Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s join? Suppose V(R,Y)  V(S,Y) By the containment of set values we infer that: –Every Y’s value in R appears in S. So, the tuple r of R surely is going match with some tuples of S, but what’s the probability it matches with s? It’s 1/V(S,Y). So, T(S)/V(S,Y) tuples of S would match with tuple r. Hence, T(R  S) = T(R)*T(S)/V(S,Y) By a similar reasoning, for the case when V(S,Y)  V(R,Y), we get T(R  S) = T(R)*T(S)/V(R,Y). Summarizing we have as an estimate: T(R  S) = T(R)*T(S)/max{V(R,Y),V(S,Y)}

Example: R(a,b), T(R)=1000, V(R,b)=20 S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100 U(c,d), T(U)=5000, V(U,c)=500 Estimate the size of R  S  U T(R  S) = 1000*2000 / 50 = 40,000 T((R  S)  U) = * 5000 / 500 = 400,000 T(S  U) = 20,000 T(R  (S  U)) = 1000*20000 / 50 = 400,000 The equality of results is not a coincidence. Note 1: estimate of final result should not depend on the evaluation order Note 2: intermediate results could be of different sizes

Natural join with multiple join attrib. R(x,y 1,y 2 )  S(y 1,y 2,z) T(R  S) = T(R)*T(S)/m 1 *m 2, where m 1 = max{V(R,y 1 ),V(S,y 1 )} m 2 = max{V(R,y 2 ),V(S,y 2 )} Why? Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s agree on y 1 ? From the previous reasoning, it’s 1/max{V(R,y 1 ),V(S,y 1 )} Similarly, what’s the probability that r and s agree on y 2 ? It’s 1/max{V(R,y 2 ),V(S,y 2 )} Assuming that agreements on y 1 and y 2 are independent we estimate: T(R  S) = T(R)*T(S)/[max{V(R,y 1 ),V(S,y 1 )} * max{V(R,y 2 ),V(S,y 2 )}] Example: T(R)=1000, V(R,b)=20, V(R,c)=100 T(S)=2000, V(S,d)=50, V(S,e)=50 R(a,b,c)  R.b=S.d AND R.c=S.e S(d,e,f) T(R  S) = (1000*2000)/(50*100)=400

Another example: (one of the previous) R(a,b), T(R)=1000, V(R,b)=20 S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100 U(c,d), T(U)=5000, V(U,c)=500 Estimate the size of R  S  U Observe that R  S  U = (R  U)  S T(R  U) = 1000*5000 = 5,000,000 Note that the number of b’s in the product is 20 (=V(R,b)), and the number of c’s is 500 (=V(U,c)). T((R  U)  S) = 5,000,000 * 2000 / (50 * 500) = 400,000

Size estimates for other operations Cartesian product: T(R  S) = T(R) * T(S) Bag Union: sum of sizes Set union: larger + half the smaller. Why? Because a set union can be as large as the sum of sizes or as small as the larger of the two arguments. Something in the middle is suggested. Intersection: half the smaller. Why? Because intersection can be as small as 0 or as large as the sizes of the smaller. Something in the middle is suggested. Difference: T(R-S) = T(R) - 1/2*T(S) Because the result can be between T(R) and T(R)-T(S). Something in the middle is suggested.

Size estimates for other operations Duplicate elimination  in (R(a 1,...,a n )): The size ranges from 1 to T(R). T(  (R))= V(R,[a 1...a n ]), if available (but usually not available). Otherwise: T(  (R))= min[V(R,a 1 )*...*V(R,a n ), 1/2*T(R)] is suggested. Why? V(R,a 1 )*...*V(R,a n ) is the upper limit on the number of distinct tuples that could exist 1/2*T(R) is because the size can be as small as 1 or as big as T(R) Grouping and Aggregation: similar to , but only with respect to grouping attributes.

Computing the statistics Computation of statistics is triggered automatically or manually. T(R)’s, and V(R,A)’s are just aggregation queries (COUNT queries). However, they are expensive to be computed.

Incremental computation of statistics Maintaining T(R): Add 1 for every insertion and subtract 1 for every deletion. –What’s the problem? If there is a B-Tree on any attribute of R, then: Just keep track of the B-Tree blocks and infer the approximate size of the relation. Requires effort only when? On B-Tree changes, which is relative rare compared with the rate of insertions and deletions.

Incremental computation of statistics Maintaining V(R,A): If there is an index on attribute A of a relation R, then: –On insert into R, we must find the A-value for the new tuple in the index anyway, and so we can determine whether there is already such a value for A. If not increment V(R,A). –On deletion… If there isn’t an index on A, the system could in effect create a rudimentary index by keeping a data structure (e.g. B-Tree) that holds every value of A. Final option: Sampling the relation.

Histograms Equal width Most frequent values 4 7 rest Advantage: more accurate estimate of the size of a join.

Example (freq. Values histogram) Estimate U = R(a,b)  S(b,c) V(R,b) = 14. Histogram for R.b: 0:150, 1:200, 5:100, rest: 550 V(S,b) = 13. Histogram for S.b: 0:100, 1:80, 2:70, rest: 250 Tuples in U –on 0: 100*150 = 15,000 –on 1: 200*80 = 16,000 –on 2: 70 * (550/(14-3)) = 3500 –on 5: 100 * (250/(13-3)) = 2500 –on the 9 other values: 9*(550/11)*(250/10) = 9*1250 Total T(U) = *1250 = 48,250 Simple estimate (equal occurrence assumption) T(U) = 1000*500/14 = 35,714 We have 9 values, because V(S,b)<V(R,b), and by the preservation of value sets assumption, all the 9 values we didn’t consider yet in S, will be in R as well.

Example (equal width histogram) Schemas: Jan(day,temp) July(day,temp) Query: Find the pairs of days in Jan and Jul that had the same temperature. SELECT Jan.day, July.day FROM Jan, July WHERE Jan.temp=July.temp; Size of join of each band is T1*T2/Width –On band 40-49: 10*5/10 = 5 –On band 50-59: 5*20/10 = 10  size of the result is thus 5+10 = 15 Without using the histogram we would estimate the size as –245*245/100 = 600 !!