The Query Compiler. 16. 1 Parsing and Preprocessing. Meghna Jain(205) The Query Compiler 16.1 Parsing and Preprocessing Meghna Jain(205) Dr. T. Y. Lin
Presentation Outline 16.1 Parsing and Preprocessing 16.1.1 Syntax Analysis and Parse Tree 16.1.2 A Grammar for Simple Subset of SQL 16.1.3 The Preprocessor 16.1.4 Processing Queries Involving Views
Query compilation is divided into three steps 1. Parsing: Parse SQL query into parser tree. 2. Logical query plan: Transforms parse tree into expression tree of relational algebra. 3.Physical query plan: Transforms logical query plan into physical query plan. . Operation performed . Order of operation . Algorithm used . The way in which stored data is obtained and passed from one operation to another.
Form a query to a logical query plan Parser Preprocessor Logical Query plan generator Query rewrite Preferred logical query plan Form a query to a logical query plan
Syntax Analysis and Parse Tree Parser takes the sql query and convert it to parse tree. Nodes of parse tree: 1. Atoms: known as Lexical elements such as key words, constants, parentheses, operators, and other schema elements. 2. Syntactic categories: Subparts that plays a similar role in a query as <Query> , <Condition>
Grammar for Simple Subset of SQL <Query> ::= <SFW> <Query> ::= (<Query>) <SFW> ::= SELECT <SelList> FROM <FromList> WHERE <Condition> <SelList> ::= <Attribute>,<SelList> <SelList> ::= <Attribute> <FromList> ::= <Relation>, <FromList> <FromList> ::= <Relation> <Condition> ::= <Condition> AND <Condition> <Condition> ::= <Tuple> IN <Query> <Condition> ::= <Attribute> = <Attribute> <Condition> ::= <Attribute> LIKE <Pattern> <Tuple> ::= <Attribute> Atoms(constants), <syntactic categories>(variable), ::= (can be expressed/defined as)
Query and Parse T ree StarsIn(title,year,starName) MovieStar(name,address,gender,birthdate) Query: Give titles of movies that have at least one star born in 1960 SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE '%1960%' );
Another query equivalent SELECT title FROM StarsIn, MovieStar WHERE starName = name AND birthdate LIKE '%1960%' ;
Parse Tree <Query> <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> , <FromList> AND title StarsIn <RelName> MovieStar <Query> <Condition> <Condition> <Attribute> = <Attribute> <Attribute> LIKE <Pattern> starName name birthdate ‘%1960’
The Preprocessor Functions of Preprocessor . If a relation used in the query is virtual view then each use of this relation in the form-list must replace by parser tree that describe the view. . It is also responsible for semantic checking 1. Checks relation uses : Every relation mentioned in FROM- clause must be a relation or a view in current schema. 2. Check and resolve attribute uses: Every attribute mentioned in SELECT or WHERE clause must be an attribute of same relation in the current scope. 3. Check types: All attributes must be of a type appropriate to their uses.
StarsIn(title,year,starName) MovieStar(name,address,gender,birthdate) Query: Give titles of movies that have at least one star born in 1960 SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE '%1960%' );
Preprocessing Queries Involving Views When an operand in a query is a virtual view, the preprocessor needs to replace the operand by a piece of parse tree that represents how the view is constructed from base table. Base Table: Movies( title, year, length, genre, studioname, producerC#) View definition : CREATE VIEW ParamountMovies AS SELECT title, year FROM movies WHERE studioName = 'Paramount'; Example based on view: SELECT title FROM ParamountMovies WHERE year = 1979;
Query Compiler [Figure 16.16 & Figure16.18] CS 257 [section 1] Aakanksha Pendse. Roll number : 127
To verify the claim given in the figure 16.18 using the figure 16.16 Problem Statement To verify the claim given in the figure 16.18 using the figure 16.16
What is figure 16.16? The tree in this diagram represents the following query. “find movies with movie-stars born in 1960” Database tables with its attributes are: StarsIn(movieTitle, movieYear, starName) MovieStar(name, address, gender, birthdate) SQL Query: SELECT movieTitle FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ );
This diagram (fig.16.14) is derived from figure 16.14 This is done by applying the Rule which handles the two-argument selection with a condition involving IN In this query the sub-query is uncorrelated.
Figure 16.14
Figure 16.16 : applying the rule for IN condition
Rule
What is figure 16.18? The tree in this diagram represents the following query. “Find the movies where the average age of the stars was at most 40 when the movie was made.” SQL Query : SELECT DISTINCT m1.movieTitle, m1.movieYear FROM StarsIn m1 WHERE m1.movieYear – 40 <= (SELECT AVG(birthdate) FROM StarsIn m2, MovieStar s WHERE m2.starName = s.name AND m1.movieTitle = m2.movieTitle AND m1.movieYear = m2.movieYear); This is a co-related sub-query.
Rules for translating a co-related sub-query to relational algebra. Correlated sub queries contains unknown values defined outside themselves. Because of this reason, co-related sub-queries cannot be translated in isolation. These types of sub queries need to be translated so that they produce a relation in which certain extra attributes appear. These attributes must later be compared with the externally defined attributes. Conditions that relate attributes from the sub query to attributes outside are then applied to this relation. The extra attributes which are not necessary, can be projected out. In this strategy, care should be taken of not forming duplicate tuples at the end.
Steps of formation of tree in figure 16.18 The tree in figure 16.18 is formed by parsing of the query and partial translation to relational algebra. The WHERE-clause of the sub query is split into two. It is used to covert the product of relations to an equijoin The aliases m1, m2 and s are made the nodes of the tree.
Figure 16.18 : partially transformed parse tree
16.2 ALGEBRAIC LAWS FOR IMPROVING QUERY PLANS Ramya Karri ID: 206
Optimizing the Logical Query Plan The translation rules converting a parse tree to a logical query tree do not always produce the best logical query tree. It is often possible to optimize the logical query tree by applying relational algebra laws to convert the original tree into a more efficient logical query tree. Optimizing a logical query tree using relational algebra laws is called heuristic optimization
Relational Algebra Laws These laws often involve the properties of: commutativity - operator can be applied to operands independent of order. E.g. A + B = B + A - The “+” operator is commutative. associativity - operator is independent of operand grouping. E.g. A + (B + C) = (A + B) + C - The “+” operator is associative.
Associative and Commutative Operators The relational algebra operators of cross-product (×), join (⋈), union, and intersection are all associative and commutative. Commutative R X S = S X R R ⋈ S = S ⋈ R R S = S R R ∩ S = S ∩ R Associative (R X S) X T = S X (R X T) (R ⋈ S) ⋈ T= S ⋈ (R ⋈ T) (R S) T = S (R T) (R ∩ S) ∩ T = S ∩ (R ∩ T)
Laws Involving Selection Complex selections involving AND or OR can be broken into two or more selections: (splitting laws) σC1 AND C2 (R) = σC1( σC2 (R)) σC1 OR C2 (R) = ( σC1 (R) ) S ( σC2 (R) ) Example R={a,a,b,b,b,c} p1 satisfied by a,b, p2 satisfied by b,c σp1vp2 (R) = {a,a,b,b,b,c} σp1(R) = {a,a,b,b,b} σp2(R) = {b,b,b,c} σp1 (R) U σp2 (R) = {a,a,b,b,b,c}
Laws Involving Selection (Contd..) Selection is pushed through both arguments for union: σC(R S) = σC(R) σC(S) Selection is pushed to the first argument and optionally the second for difference: σC(R - S) = σC(R) - S σC(R - S) = σC(R) - σC(S)
Laws Involving Selection (Contd..) All other operators require selection to be pushed to only one of the arguments. For joins, may not be able to push selection to both if argument does not have attributes selection requires. σC(R × S) = σC(R) × S σC(R ∩ S) = σC(R) ∩ S σC(R ⋈ S) = σC(R) ⋈ S σC(R ⋈D S) = σC(R) ⋈D S
Laws Involving Selection (Contd..) Example Consider relations R(a,b) and S(b,c) and the expression σ (a=1 OR a=3) AND b<c (R ⋈S) σ a=1 OR a=3(σ b<c (R ⋈S)) σ a=1 OR a=3(R ⋈ σ b<c (S)) σ a=1 OR a=3(R) ⋈ σ b<c (S)
Laws Involving Projection Like selections, it is also possible to push projections down the logical query tree. However, the performance gained is less than selections because projections just reduce the number of attributes instead of reducing the number of tuples.
Laws Involving Projection Laws for pushing projections with joins: πL(R × S) = πL(πM(R) × πN(S)) πL(R ⋈ S) = πL((πM(R) ⋈ πN(S)) πL(R ⋈D S) = πL((πM(R) ⋈D πN(S))
Laws Involving Projection Laws for pushing projections with set operations. Projection can be performed entirely before union. πL(R UB S) = πL(R) UB πL(S) Projection can be pushed below selection as long as we also keep all attributes needed for the selection (M = L attr(C)). πL ( σC (R)) = πL( σC (πM(R)))
Laws Involving Join We have previously seen these important rules about joins: Joins are commutative and associative. Selection can be distributed into joins. Projection can be distributed into joins.
Laws Involving Duplicate Elimination The duplicate elimination operator (δ) can be pushed through many operators. R has two copies of tuples t, S has one copy of t, δ (RUS)=one copy of t δ (R) U δ (S)=two copies of t
Laws Involving Duplicate Elimination Laws for pushing duplicate elimination operator (δ): δ(R × S) = δ(R) × δ(S) δ(R S) = δ(R) δ(S) δ(R D S) = δ(R) D δ(S) δ( σC(R) = σC(δ(R))
Laws Involving Duplicate Elimination The duplicate elimination operator (δ) can also be pushed through bag intersection, but not across union, difference, or projection in general. δ(R ∩ S) = δ(R) ∩ δ(S)
Laws Involving Grouping The grouping operator (γ) laws depend on the aggregate operators used. There is one general rule, however, that grouping subsumes duplicate elimination: δ(γL(R)) = γL(R) The reason is that some aggregate functions are unaffected by duplicates (MIN and MAX) while other functions are (SUM, COUNT, and AVG).
Query Compiler By:Payal Gupta Roll No:106(225) Professor :Tsau Young Lin
Pushing Selections It is, replacing the left side of one of the rules by its right side. In pushing selections we first a selection as far up the tree as it would go, and then push the selections down all possible branches.
Let’s take an example: S t a r s I n ( t i t l e , year, starName) Movie(title, year, length, incolor, studioName, producerC#) Define view MoviesOf 1996 by: CREATE VIEW MoviesOfl996 AS SELECT * FROM Movie ,WHERE year = 1996;
"which stars worked for which studios in 1996 "which stars worked for which studios in 1996?“ can be given by a SQL Query: SELECT starName, studioName FROM MoviesOfl996 NATURAL JOIN StarsIn;
ΠstarName,studioName O StarsIn Year=1996 Movie Logical query plan constructed from definition of a query and view
Improving the query plan by moving selections up and down the tree ΠstarName,studioName O O Year=1996 Year=1996 StarsIn Movie
Laws Involving Projection "pushing" projections really involves introducing a new projection somewhere below an existing projection. projection keeps the number of tuples the same and only reduces the length of tuples. To describe the transformations of extended projection Consider a term E + x on the list for a projection, where E is an attribute or an expression involving attributes and constants and x is an output attribute.
Example Let R(a, b, c) and S(c, d, e) be two relations. Consider the expression x,+,,,, b+y(R w S). The input attributes of the projection are a,b, and e, and c is the only join attribute. We may apply the law for pushing projections below joins to get the equivalent expression: Πa+e->x,b->y(Πa,b,c(R) Πc,e(S)) Eliminating this projection and getting a third equivalent expression:Πa+e->x, b->y( R Πc,e(S))
In addition, we can perform a projection entirely before a bag union In addition, we can perform a projection entirely before a bag union. That is: ΠL(R UB S)= ΠL(R) )UB ΠL(S)
Laws About Joins and Products laws that follow directly from the definition of the join: R c S = c( R * S) R S = ΠL( c ( R * S) ) , where C is the condition that equates each pair of attributes from R and S with the same name. and L is a list that includes one attribute from each equated pair and all the other attributes of R and S. We identify a product followed by a selection as a join of some kind. O O
Laws Involving Duplicate Elimination The operator δ which eliminates duplicates from a bag can be pushed through many but not all operators. In general, moving a δ down the tree reduces the size of intermediate relations and may therefore beneficial. Moreover, sometimes we can move δ to a position where it can be eliminated altogether,because it is applied to a relation that is known not to possess duplicates.
δ (R)=R if R has no duplicates δ (R)=R if R has no duplicates. Important cases of such a relation R include: a) A stored relation with a declared primary key, and b) A relation that is the result of a γ operation, since grouping creates a relation with no duplicates.
Several laws that "push" δ through other operators are: δ (R*S) =δ(R) * δ(S) δ (R S)=δ(R) δ(S) δ (R c S)=δ(R) c δ(S) δ ( c (R))= c (δ(R)) We can also move the δ to either or both of the arguments of an intersection: δ (R ∩B S) = δ(R) ∩B S = R ∩B δ (S) = δ(R) ∩B δ (S) O O
Laws Involving Grouping and Aggregation When we consider the operator γ, we find that the applicability of many transformations depends on the details of the aggregate operators used. Thus we cannot state laws in the generality that we used for the other operators. One exception is that a γ absorbs a δ . Precisely: δ(γL(R))=γL(R)
let us call an operator γ duplicate-impervious if the only aggregations in L are MIN and/or MAX then: γ L(R) = γ L (δ(R)) provided γL is duplicate- impervious.
Example Suppose we have the relations MovieStar(name , addr , gender, birthdate) StarsIn(movieTitle, movieyear, starname) and we want to know for each year the birthdate of the youngest star to appear in a movie that year. We can express this query as: SELECT movieyear, MAX(birth date) FROM MovieStar, StarsIn WHERE name = starName GROUP BY movieyear;
γ movieYear, MAX ( birthdate ) name = starName MovieStar StarsIn Initial logical query plan for the query O
Some transformations that we can apply to Fig are 1. Combine the selection and product into an equijoin. 2.Generate a δ below the γ , since the γ is duplicate- impervious. 3. Generate a Π between the γ and the introduced δ to project onto movie-Year and birthdate, the only attributes relevant to the γ
γ movieYear, MAX ( birthdate ) Π movieYear, birthdate δ name = starName MovieStar StarsIn Another query plan for the query
γ movieYear, MAX ( birthdate ) Π movieYear, birthdate name = starName δ δ Π birthdate,name Π movieYear,starname MovieStar StarsIn third query plan for Example
CS 255: Database System Principles slides: From Parse Trees to Logical Query Plans By:- Arunesh Joshi Id:-006538558
Agenda Conversion to Relational Algebra. Removing Sub queries From Conditions. Improving the Logical Query Plan. Grouping Associative/Commutative Operators.
Parsing Goal is to convert a text string containing a query into a parse tree data structure: leaves form the text string (broken into lexical elements) internal nodes are syntactic categories Uses standard algorithmic techniques from compilers given a grammar for the language (e.g., SQL), process the string and build the tree
Example: SQL query Assume we have a simplified grammar for SQL. SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ ); (Find the movies with stars born in 1960) Assume we have a simplified grammar for SQL.
SELECT <SelList> FROM <FromList> WHERE <Condition> Example: Parse Tree <Query> <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Tuple> IN <Query> title StarsIn <Attribute> ( <Query> ) starName <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Attribute> LIKE <Pattern> name MovieStar birthDate ‘%1960’
The Preprocessor It replaces each reference to a view with a parse (sub)-tree that describes the view (i.e., a query) It does semantic checking: are relations and views mentioned in the schema? are attributes mentioned in the current scope? are attribute types correct?
Convert Parse Tree to Relational Algebra The complete algorithm depends on specific grammar, which determines forms of the parse trees Here is a flavor of the approach
Conversion att-list(cond( X (rel-list))) Suppose there are no subqueries. SELECT att-list FROM rel-list WHERE cond is converted into PROJatt-list(SELECTcond(PRODUCT(rel-list))), or att-list(cond( X (rel-list)))
SELECT <SelList> FROM <FromList> WHERE <Condition> SELECT movieTitle FROM StarsIn, MovieStar WHERE starName = name AND birthdate LIKE '%1960'; <Query> <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> , <FromList> AND <Condition> movieTitle StarsIn <RelName> <Attribute> LIKE <Pattern> MovieStar birthdate '%1960' <Condition> <Attribute> = <Attribute> starName name
Equivalent Algebraic Expression Tree movieTitle starname = name AND birthdate LIKE '%1960' X StarsIn MovieStar
Handling Subqueries Recall the (equivalent) query: SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ ); Use an intermediate format called two-argument selection
title Example: Two-Argument Selection StarsIn <condition> <tuple> IN name <attribute> birthdate LIKE ‘%1960’ starName MovieStar
Converting Two-Argument Selection To continue the conversion, we need rules for replacing two-argument selection with a relational algebra expression Different rules depending on the nature of the sub query Here is shown an example for IN operator and uncorrelated query (sub query computes a relation independent of the tuple being tested)
Rules for IN C R <Condition> X R t IN S S C is the condition that equates attributes in t with corresponding attributes in S
title starName=name birthdate LIKE ‘%1960’ Example: Logical Query Plan title starName=name StarsIn name birthdate LIKE ‘%1960’ MovieStar
What if Subquery is Correlated? Example is when subquery refers to the current tuple of the outer scope that is being tested More complicated to deal with, since subquery cannot be translated in isolation Need to incorporate external attributes in the translation Some details are in textbook
Improving the Logical Query Plan There are numerous algebraic laws concerning relational algebra operations By applying them to a logical query plan judiciously, we can get an equivalent query plan that can be executed more efficiently Next we'll survey some of these laws
title birthdate LIKE ‘%1960’ Example: Improved Logical Query Plan starName=name StarsIn name birthdate LIKE ‘%1960’ MovieStar
Associative and Commutative Operations product natural join set and bag union set and bag intersection associative: (A op B) op C = A op (B op C) commutative: A op B = B op A
Laws Involving Selection Selections usually reduce the size of the relation Usually good to do selections early, i.e., "push them down the tree" Also can be helpful to break up a complex selection into parts
Selection Splitting C1 AND C2 (R) = C1 ( C2 (R)) C1 OR C2 (R) = ( C1 (R)) Uset ( C2 (R)) if R is a set C1 ( C2 (R)) = C2 ( C1 (R))
Selection and Binary Operators Must push selection to both arguments: C (R U S) = C (R) U C (S) Must push to first arg, optional for 2nd: C (R - S) = C (R) - S C (R - S) = C (R) - C (S) Push to at least one arg with all attributes mentioned in C: product, natural join, theta join, intersection e.g., C (R X S) = C (R) X S, if R has all the atts in C
Pushing Selection Up the Tree Suppose we have relations StarsIn(title,year,starName) Movie(title,year,len,inColor,studioName) and a view CREATE VIEW MoviesOf1996 AS SELECT * FROM Movie WHERE year = 1996; and the query SELECT starName, studioName FROM MoviesOf1996 NATURAL JOIN StarsIn;
The Straightforward Tree starName,studioName year=1996 StarsIn Remember the rule C(R S) = C(R) S ? Movie
The Improved Logical Query Plan starName,studioName year=1996 StarsIn Movie starName,studioName year=1996 Movie StarsIn starName,studioName year=1996 year=1996 Movie StarsIn push selection up tree push selection down tree
Grouping Assoc/Comm Operators Groups together adjacent joins, adjacent unions, and adjacent intersections as siblings in the tree Sets up the logical QP for future optimization when physical QP is constructed: determine best order for doing a sequence of joins (or unions or intersections) U D E F U D E F U A B C A B C
16.4 Estimating the Cost of Operations Project Guide Prepared By Dr. T. Y. Lin Akshay Shenoy Computer Science Dept San Jose State University
Introduction Possible Physical Plan Estimating Sizes of Intermediate Relations Estimating the Size of a Projection Estimating the Size of a Selection Estimating the Size of a Join Natural Joins With Multiple Join Attributes Joins of Many Relations Estimating Sizes of Other Operations
Physical Plan For each physical plan select An order and grouping for associative-and-commutative operations like joins, unions. An Algorithm for each operator in the logical plan. eg: whether nested loop join or hash join to be used Additional operators that are needed for the physical plan but that were not present explicitly in the logical plan. eg: scanning, sorting The way in which arguments are passed from one operator to the next.
Estimating Sizes of Intermediate Relations Rules for estimating the number of tuples in an intermediate relation: Give accurate estimates Are easy to compute Are logically consistent Objective of estimation is to select best physical plan with least cost.
Estimating the Size of a Projection The projection is different from the other operators, in that the size of the result is computable. Since a projection produces a result tuple for every argument tuple, the only change in the output size is the change in the lengths of the tuples.
Estimating the Size of a Selection(1) Let , ,where A is an attribute of R and C is a constant. Then we recommend as an estimate: T(S) =T(R)/V(R,A) The rule above surely holds if all values of attribute A occur equally often in the database.
Estimating the Size of a Selection(2) If ,then our estimate for T(s) is: T(S) = T(R)/3 We may use T(S)=T(R)(V(R,a) -1 )/ V(R,a) as an estimate. When the selection condition C is the And of several equalities and inequalities, we can treat the selection as a cascade of simple selections, each of which checks for one of the conditions.
Estimating the Size of a Selection(3) A less simple, but possibly more accurate estimate of the size of is to assume that C1 and of which satisfy C2, we would estimate the number of tuples in S as In explanation, is the fraction of tuples that do not satisfy C1, and is the fraction that do not satisfy C2. The product of these numbers is the fraction of R’s tuples that are not in S, and 1 minus this product is the fraction that are in S.
Estimating the Size of a Join two simplifying assumptions: 1. Containment of Value Sets If R and S are two relations with attribute Y and V(R,Y)<=V(S,Y) then every Y-value of R will be a Y-value of S. 2. Preservation of Value Sets Join a relation R with another relation S with attribute A in R and not in S then all distinct values of A are preserved and not lost.V(S R,A) = V(R,A) Under these assumptions, we estimate T(R S) = T(R)T(S)/max(V(R,Y), V(S, Y))
Natural Joins With Multiple Join Attributes Of the T(R),T(S) pairs of tuples from R and S, the expected number of pairs that match in both y1 and y2 is: T(R)T(S)/max(V(R,y1), V(S,y1)) max(V(R, y2), V(S, y2)) In general, the following rule can be used to estimate the size of a natural join when there are any number of attributes shared between the two relations. The estimate of the size of R S is computed by multiplying T(R) by T(S) and dividing by the largest of V(R,y) and V(S,y) for each attribute y that is common to R and S.
Joins of Many Relations(1) rule for estimating the size of any join Start with the product of the number of tuples in each relation. Then, for each attribute A appearing at least twice, divide by all but the least of V(R,A)’s. We can estimate the number of values that will remain for attribute A after the join. By the preservation-of-value-sets assumption, it is the least of these V(R,A)’s.
Joins of Many Relations(2) Based on the two assumptions-containment and preservation of value sets: No matter how we group and order the terms in a natural join of n relations, the estimation of rules, applied to each join individually, yield the same estimate for the size of the result. Moreover, this estimate is the same that we get if we apply the rule for the join of all n relations as a whole.
Estimating Sizes for Other Operations Union: the average of the sum and the larger. Intersection: approach1: take the average of the extremes, which is the half the smaller. approach2: intersection is an extreme case of the natural join, use the formula T(R S) = T(R)T(S)/max(V(R,Y), V(S, Y))
Estimating Sizes for Other Operations Difference: T(R)-(1/2)*T(S) Duplicate Elimination: take the smaller of (1/2)*T(R) and the product of all the V(R, )’s. Grouping and Aggregation: upper-bound the number of groups by a product of V(R,A)’s, here attribute A ranges over only the grouping attributes of L. An estimate is the smaller of (1/2)*T(R) and this product.
16.5 Introduction to Cost-based plan selection Amith KC Student Id: 109
Whether selecting a logical query plan or constructing a physical query plan from a logical plan, the query optimizer needs to estimate the cost of evaluating certain expressions. We shall assume that the "cost" of evaluating an expression is approximated well by the number of disk I/O's performed.
The number of disk I/O’s, in turn, is influenced by: 1. The particular logical operators chosen to implement the query, a matter decided when we choose the logical query plan. 2. The sizes of intermediate results (whose estimation we discussed in Section 16.4) 3. The physical operators used to implement logical operators. e.g.. The choice of a one-pass or two-pass join, or the choice to sort or not sort a given relation. 4. The ordering of similar operations, especially joins 5. The method of passing arguments from one physical operator to the next.
Obtaining Estimates for Size Parameter The formulas of Section 16.4 were predicated on knowing certain important parameters, especially T(R), the number of tuples in a relation R, and V(R, a), the number of different values in the column of relation R for attribute a. A modern DBMS generally allows the user or administrator explicitly to request the gathering of statistics, such as T(R) and V(R, a). These statistics are then used in subsequent query optimizations to estimate the cost of operations. By scanning an entire relation R, it is straightforward to count the number of tuples T(R) and also to discover the number of different values V(R, a) for each attribute a. The number of blocks in which R can fit, B(R), can be estimated either by counting the actual number of blocks used (if R is clustered), or by dividing T(R) by the number of tuples per block
Computation of Statistics Periodic re-computation of statistics is the norm in most DBMS's, for several reasons. First, statistics tend not to change radically in a short time. Second, even somewhat inaccurate statistics are useful as long as they are applied consistently to all the plans. Third, the alternative of keeping statistics up-to-date can make the statistics themselves into a "hot-spot" in the database; because statistics are read frequently, we prefer not to update them frequently too.
The recomputation of statistics might be triggered automatically after some period of time, or after some number of updates. However, a database administrator noticing, that poor- performing query plans are being selected by the query optimizer on a regular basis, might request the recomputation of statistics in an attempt to rectify the problem. Computing statistics for an entire relation R can be very expensive, particularly if we compute V(R, a) for each attribute a in the relation. One common approach is to compute approximate statistics by sampling only a fraction of the data. For example, let us suppose we want to sample a small fraction of the tuples to obtain an estimate for V(R, a).
Heuristics for Reducing the Cost of Logical Query Plans One important use of cost estimates for queries or sub-queries is in the application of heuristic transformations of the query. We have already observed previously how certain heuristics applied independent of cost estimates can be expected almost certainly to improve the cost of a logical query plan. However, there are other points in the query optimization process where estimating the cost both before and after a transformation will allow us to apply a transformation where it appears to reduce cost and avoid the transformation otherwise. In particular, when the preferred logical query plan is being generated, we may consider a number of optional transformations and the costs before and after.
Because we are estimating the cost of a logical query plan, so we have not yet made decisions about the physical operators that will be used to implement the operators of relational algebra, our cost estimate cannot be based on disk I/Os. Rather, we estimate the sizes of all intermediate results using the techniques of Section 16.1, and their sum is our heuristic estimate for the cost of the entire logical plan. For example,
Consider the initial logical query plan of as shown below,
The statistics for the relations R and S be as follows To generate a final logical query plan from, we shall insist that the selection be pushed down as far as possible. However, we are not sure whether it makes sense to push the δ below the join or not. Thus, we generate from the two query plans shown in next slide. They differ in whether we have chosen to eliminate duplicates before or after the join.
(a) (b)
We know how to estimate the size of the result of the selections, we divide T(R) by V(R, a) = 50. We also know how to estimate the size of the joins; we multiply the sizes of the arguments and divide by max(V(R, b), V(S, b)), which is 200.
Approaches to Enumerating Physical Plans Let us consider the use of cost estimates in the conversion of a logical query plan to a physical query plan. The baseline approach, called exhaustive, is to consider all combinations of choices (for each of issues like order of joins, physical implementation of operators, and so on). Each possible physical plan is assigned an estimated cost, and the one with the smallest cost is selected.
There are two broad approaches to exploring the space of possible physical plans: Top-down: Here, we work down the tree of the logical query plan from the root. Bottom-up: For each sub-expression of the logical-query-plan tree, we compute the costs of all possible ways to compute that sub-expression. The possibilities and costs for a sub-expression E are computed by considering the options for the sub-expressions for E, and combining them in all possible ways with implementations for the root operator of E.
Branch-and-Bound Plan Enumeration This approach, often used in practice, begins by using heuristics to find a good physical plan for the entire logical query plan. Let the cost of this plan be C. Then as we consider other plans for sub-queries, we can eliminate any plan for a sub-query that has a cost greater than C, since that plan for the sub-query could not possibly participate in a plan for the complete query that is better than what we already know. Likewise, if we construct a plan for the complete query that has cost less than C, we replace C by the cost of this better plan in subsequent exploration of the space of physical query plans.
Hill Climbing This approach, in which we really search for a “valley” in the space of physical plans and their costs; starts with a heuristically selected physical plan. We can then make small changes to the plan, e.g., replacing one method for an operator by another, or reordering joins by using the associative and/or commutative laws, to find "nearby" plans that have lower cost. When we find a plan such that no small modification yields a plan of lower cost, we make that plan our chosen physical query plan.
Dynamic Programming In this variation of the general bottom-UP strategy, we keep for each sub-expression only the plan of least cost. As we work UP the tree, we consider possible implementations of each node, assuming the best plan for each sub-expression is also used.
Selinger-Style Optimization This approach improves upon the dynamic-programming approach by keeping for each sub-expression not only the plan of least cost, but certain other plans that have higher cost, yet produce a result that is sorted in an order that may be useful higher up in the expression tree. Examples of such interesting orders are when the result of the sub-expression is sorted on one of: The attribute(s) specified in a sort (r) operator at the root The grouping attribute(s) of a later group-by (γ) operator. The join attribute(s) of a later join.
Choosing an Order for Joins Chapter 16.6 by: Chiu Luk ID: 210
Introduction This section focuses on critical problem in cost-based optimization: Selecting order for natural join of three or more relations Compared to other binary operations, joins take more time and therefore need effective optimization techniques
Introduction
Significance of Left and Right Join Arguments The argument relations in joins determine the cost of the join The left argument of the join is Called the build relation Assumed to be smaller Stored in main-memory
Significance of Left and Right Join Arguments The right argument of the join is Called the probe relation Read a block at a time Its tuples are matched with those of build relation The join algorithms which distinguish between the arguments are: One-pass join Nested-loop join Index join
Join Trees Order of arguments is important for joining two relations Left argument, since stored in main-memory, should be smaller With two relations only two choices of join tree With more than two relations, there are n! ways to order the arguments and therefore n! join trees, where n is the no. of relations
Join Trees Order of arguments is important for joining two relations Left argument, since stored in main-memory, should be smaller With two relations only two choices of join tree With more than two relations, there are n! ways to order the arguments and therefore n! join trees, where n is the no. of relations
Join Trees Total # of tree shapes T(n) for n relations given by recurrence: T(1) = 1 T(2) = 1 T(3) = 2 T(4) = 5 … etc
Left-Deep Join Trees Consider 4 relations. Different ways to join them are as follows
In fig (a) all the right children are leaves. This is a left-deep tree In fig (c) all the left children are leaves. This is a right-deep tree Fig (b) is a bushy tree Considering left-deep trees is advantageous for deciding join orders
Join order Join order selection A1 A2 A3 .. An Left deep join trees Dynamic programming Best plan computed for each subset of relations Best plan (A1, .., An) = min cost plan of( Best plan(A2, .., An) A1 Best plan(A1, A3, .., An) A2 …. Best plan(A1, .., An-1)) An An Ai
Dynamic Programming to Select a Join Order and Grouping Three choices to pick an order for the join of many relations are: Consider all of the relations Consider a subset Use a heuristic o pick one Dynamic programming is used either to consider all or a subset Construct a table of costs based on relation size Remember only the minimum entry which will required to proceed
Dynamic Programming to Select a Join Order and Grouping
Dynamic Programming to Select a Join Order and Grouping
Dynamic Programming to Select a Join Order and Grouping
Dynamic Programming to Select a Join Order and Grouping
A Greedy Algorithm for Selecting a Join Order It is expensive to use an exhaustive method like dynamic programming Better approach is to use a join-order heuristic for the query optimization Greedy algorithm is an example of that Make one decision at a time about order of join and never backtrack on the decisions once made
UPDATES Reducing Cost by Heuristics: Applies for logical query plan Estimate cost before and after a transformation Only choose/apply transformation when cost estimations show beneficial JOIN TREES: Order of arguments is important for joining two relations Left argument, since stored in main-memory, should be smaller
Completing the Physical-Query-Plan and Chapter 16 Summary (16.7-16.8) CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung Donavon Norwood
Outline 16.7 Completing the Physical-Query-Plan I. Choosing a Selection Method II. Choosing a Join Method III. Pipelining Versus Materialization IV. Pipelining Unary Operations V. Pipelining Binary Operations VI. Notation for Physical Query Plan VII. Ordering the Physical Operations 16.8 Summary of Chapter 16
Before complete Physical-Query-Plan A query previously has been Parsed and Preprocessed (16.1) Converted to Logical Query Plans (16.3) Estimated the Costs of Operations (16.4) Determined costs by Cost-Based Plan Selection (16.5) Weighed costs of join operations by choosing an Order for Joins
16.7 Completing the Physical-Query-Plan 3 topics related to turning LP into a complete physical plan Choosing of physical implementations such as Selection and Join methods Decisions regarding to intermediate results (Materialized or Pipelined) Notation for physical-query-plan operators
I. Choosing a Selection Method (A) Algorithms for each selection operators 1. Can we use an created index on an attribute? If yes, index-scan. Otherwise table-scan) 2. After retrieve all condition-satisfied tuples in (1), then filter them with the rest selection conditions σC(R) – we access entire relateion R and see which tuples satisfy condition C If C was of the form “ Attribute equals constants” Therefore if C is a constant and we had an index on that attribute then operation σC(R) will be implemented as Aθc Aθc – A is an attribute with an index, c is a constant and is a comparison operator such as ‘=‘ , ‘<‘ or ‘>’
Choosing a Selection Method(A) (cont.) Recall Cost of query = # disk I/O’s How costs for various plans are estimated from σC(R) operation 1. Cost of table-scan algorithm B(R) if R is clustered T(R) if R is not clustered 2. Cost of a plan picking an equality term (e.g. a = 10) w/ index-scan B(R) / V(R, a) clustering index T(R) / V(R, a) nonclustering index 3. Cost of a plan picking an inequality term (e.g. b < 20) w/ index-scan B(R) / 3 clustering index T(R) / 3 nonclustering index
Example Selection: σx=1 AND y=2 AND z<5 (R) T(R)=5000, B(R)=200, - Where parameters of R(x, y, z) are : T(R)=5000, B(R)=200, V(R,x)=100, and V(R, y)=500 Relation R is clustered x, y have nonclustering indexes, only index on z is clustering.
Example (cont.) Selection options: Table-scan filter x, y, z. Cost is B(R) = 200 since R is clustered. Use index on x =1 filter on y, z. Cost is 50 since T(R) / V(R, x) is (5000/100) = 50 tuples, index is not clustering. Use index on y =2 filter on x, z. Cost is 10 since T(R) / V(R, y) is (5000/500) = 10 tuples using nonclustering index. Index-scan on clustering index w/ z < 5 filter x ,y. Cost is about B(R)/3 = 67
Example (cont.) Costs option 1 = 200 option 2 = 50 option 3 = 10 The lowest Cost is option 3. Therefore, the preferred physical plan retrieves all tuples with y = 2 then filters for the rest two conditions (x, z).
II. Choosing a Join Method Determine costs associated with each join algorithms: 1. One-pass join, and nested-loop join devotes enough buffer to joining 2. Sort-join is preferred when attributes are pre-sorted or two or more join on the same attribute such as (R(a, b) S(a, c)) T(a, d) - where sorting R and S on a will produce result of R S to be sorted on a and used directly in next join
Choosing a Join Method (cont.) 3. Index-join for a join with high chance of using index created on the join attribute such as R(a, b) S(b, c) 4. Hashing join is the best choice for unsorted or non- indexing relations which needs multipass join.
III. Pipelining Versus Materialization Materialization (naïve way) store (intermediate) result of each operations on disk Pipelining (more efficient way) Interleave the execution of several operations, the tuples produced by one operation are passed directly to the operations that used it store (intermediate) result of each operations on buffer, which is implemented on main memory
IV. Pipelining Unary Operations Unary = a-tuple-at-a-time or full relation selection and projection are the best candidates for pipelining. Unary operation In buf Out buf R Unary operation In buf Out buf M-1 buffers
Pipelining Unary Operations (cont.) Pipelining Unary Operations are implemented by iterators
V. Pipelining Binary Operations Binary operations : , , - , , x The results of binary operations can also be pipelined. Use one buffer to pass result to its consumer, one block at a time. The extended example shows tradeoffs and opportunities
Example Consider physical query plan for the expression (R(w, x) S(x, y)) U(y, z) Assumption R occupies 5,000 blocks, S and U each 10,000 blocks. The intermediate result R S occupies k blocks for some k. Both joins will be implemented as hash-joins, either one- pass or two-pass depending on k There are 101 buffers available.
Example (cont.) First consider join R S, neither relations fits in buffers Needs two-pass hash-join to partition R into 100 buckets (maximum possible) each bucket has 50 blocks The 2nd pass hash-join uses 51 buffers, leaving the rest 50 buffers for joining result of R S with U.
Example (cont.) Case 1: suppose k 49, the result of R S occupies at most 49 blocks. Steps Pipeline in R S into 49 buffers Organize them for lookup as a hash table Use one buffer left to read each block of U in turn Execute the second join as one-pass join.
Example (cont.) The total number of I/O’s is 55,000 45,000 for two-pass hash join of R and S 10,000 to read U for one- pass hash join of (R S) U.
Example (cont.) Case 2: suppose k > 49 but < 5,000, we can still pipeline, but need another strategy which intermediate results join with U in a 50-bucket, two- pass hash-join. Steps are: Before start on R S, we hash U into 50 buckets of 200 blocks each. Perform two-pass hash join of R and U using 51 buffers as case 1, and placing results in 50 remaining buffers to form 50 buckets for the join of R S with U. Finally, join R S with U bucket by bucket.
Example (cont.) The number of disk I/O’s is: 20,000 to read U and write its tuples into buckets 45,000 for two-pass hash-join R S k to write out the buckets of R S k+10,000 to read the buckets of R S and U in the final join The total cost is 75,000+2k.
Example (cont.) Compare Increasing I/O’s between case 1 and case 2 k 49 (case 1) Disk I/O’s is 55,000 k > 50 5000 (case 2) k=50 , I/O’s is 75,000+(2*50) = 75,100 k=51 , I/O’s is 75,000+(2*51) = 75,102 k=52 , I/O’s is 75,000+(2*52) = 75,104 Notice: I/O’s discretely grows as k increases from 49 50.
Example (cont.) Case 3: k > 5,000, we cannot perform two- pass join in 50 buffers available if result of R S is pipelined. Steps are Compute R S using two-pass join and store the result on disk. Join result on (1) with U, using two-pass join.
Example (cont.) The number of disk I/O’s is: 45,000 for two-pass hash-join R and S k to store R S on disk 30,000 + k for two-pass join of U in R S The total cost is 75,000+4k.
Example (cont.) In summary, costs of physical plan as function of R S size.
VI. Notation for Physical Query Plans Several types of operators: Operators for leaves (Physical) operators for Selection (Physical) Sorts Operators Other Relational-Algebra Operations In practice, each DBMS uses its own internal notation for physical query plan.
Notation for Physical Query Plans (cont.) Operator for leaves A leaf operand is replaced in LQP tree TableScan(R) : read all blocks SortScan(R, L) : read in order according to L IndexScan(R, C): scan index attribute A by condition C of form Aθc. IndexScan(R, A) : scan index attribute R.A. This behaves like TableScan but more efficient if R is not clustered.
Notation for Physical Query Plans (cont.) (Physical) operators for Selection Logical operator σC(R) is often combined with access methods. If σC(R) is replaced by Filter(C), and there is no index on R or an attribute on condition C Use TableScan or SortScan(R, L) to access R If condition C Aθc AND D for condition D, and there is an index on R.A, then we may Use operator IndexScan(R, Aθc) to access R and Use Filter(D) in place of the selection σC(R)
Notation for Physical Query Plans (cont.) (Physical) Sort Operators Sorting can occur any point in physical plan, which use a notation SortScan(R, L). It is common to use an explicit operator Sort(L) to sort relation that is not stored. Can apply at the top of physical-query-plan tree if the result needs to be sorted with ORDER BY clause (г).
Notation for Physical Query Plans (cont.) Other Relational-Algebra Operations Descriptive text definitions and signs to elaborate Operations performed e.g. Join or grouping. Necessary parameters e.g. theta-join or list of elements in a grouping. A general strategy for the algorithm e.g. sort-based, hashed based, or index-based. A decision about number of passed to be used e.g. one- pass, two-pass or multipass. An anticipated number of buffers the operations will required.
Notation for Physical Query Plans (cont.) Example of a physical-query-plan A physical-query-plan in example 16.36 for the case k > 5000 TableScan Two-pass hash join Materialize (double line) Store operator
Notation for Physical Query Plans (cont.) Another example A physical-query-plan in example 16.36 for the case k < 49 TableScan (2) Two-pass hash join Pipelining Different buffers needs Store operator
Notation for Physical Query Plans (cont.) A physical-query-plan in example 16.35 Use Index on condition y = 2 first Filter with the rest condition later on.
VII. Ordering of Physical Operations The PQP is represented as a tree structure implied order of operations. Still, the order of evaluation of interior nodes may not always be clear. Iterators are used in pipeline manner Overlapped time of various nodes will make “ordering” no sense.
Ordering of Physical Operations (cont.) 3 rules summarize the ordering of events in a PQP tree: Break the tree into sub-trees at each edge that represent materialization. Execute one subtree at a time. Order the execution of the subtree Bottom-top Left-to-right All nodes of each sub-tree are executed simultaneously.
Summary of Chapter 16 In this part of the presentation I will talk about the main topics of Chapter 16.
COMPILATION OF QUERIES Compilation means turning a query into a physical query plan, which can be implemented by query engine. Steps of query compilation : Parsing Semantic checking Selection of the preferred logical query plan Generating the best physical plan
THE PARSER The first step of SQL query processing. Generates a parse tree Nodes in the parse tree corresponds to the SQL constructs Similar to the compiler of a programming language
VIEW EXPANSION A very critical part of query compilation. Expands the view references in the query tree to the actual view. Provides opportunities for the query optimization.
SEMANTIC CHECKING Checks the semantics of a SQL query. Examines a parse tree. Checks : Attributes Relation names Types Resolves attribute references.
CONVERSION TO A LOGICAL QUERY PLAN Converts a semantically parsed tree to a algebraic expression. Conversion is straightforward but sub queries need to be optimized. Two argument selection approach can be used.
ALGEBRAIC TRANSFORMATION Many different ways to transform a logical query plan to an actual plan using algebraic transformations. The laws used for this transformation : Commutative and associative laws Laws involving selection Pushing selection Laws involving projection Laws about joins and products Laws involving duplicate eliminations Laws involving grouping and aggregation
ESTIMATING SIZES OF RELATIONS True running time is taken into consideration when selecting the best logical plan. Two factors the affects the most in estimating the sizes of relation : Size of relations ( No. of tuples ) No. of distinct values for each attribute of each relation Histograms are used by some systems.
COST BASED OPTIMIZING Best physical query plan represents the least costly plan. Factors that decide the cost of a query plan : Order and grouping operations like joins, unions and intersections. Nested loop and the hash loop joins used. Scanning and sorting operations. Storing intermediate results.
PLAN ENUMERATION STRATEGIES Common approaches for searching the space for best physical plan . Dynamic programming : Tabularizing the best plan for each sub expression Selinger style programming : sort-order the results as a part of table Greedy approaches : Making a series of locally optimal decisions Branch-and-bound : Starts with enumerating the worst plans and reach the best plan
LEFT-DEEP JOIN TREES Left – Deep Join Trees are the binary trees with a single spine down the left edge and with leaves as right children. This strategy reduces the number of plans to be considered for the best physical plan. Restrict the search to Left – Deep Join Trees when picking a grouping and order for the join of several relations.
PHYSICAL PLANS FOR SELECTION Breaking a selection into an index-scan of relation, followed by a filter operation. The filter then examines the tuples retrieved by the index-scan. Allows only those to pass which meet the portions of selection condition.
PIPELINING VERSUS MATERIALIZING This flow of data between the operators can be controlled to implement “ Pipelining “ . The intermediate results should be removed from main memory to save space for other operators. This techniques can implemented using “ materialization “ . Both the pipelining and the materialization should be considered by the physical query plan generator. An operator always consumes the result of other operator and is passed through the main memory.
Pipelining Versus Materialization UPDATES Pipelining Versus Materialization Materialization (naïve way) store (intermediate) result of each operations on disk Pipelining (more efficient way) Interleave the execution of several operations, the tuples produced by one operation are passed directly to the operations that used it store (intermediate) result of each operations on buffer, which is implemented on main memory