Chapter 7 The Query Compiler Query Processor : Query Parser Tree Logical Query Plan Physical Query Plan Query Structure Relational Algebraic Expression Tree 1 2 3
The Stages of Query Compilation Parser Query Preprocessor Logical query plan generator Preferred logic query plan § 7.1 § 7.3 Query rewriter
Parsing Convert a SQL statement to a parse tree which consists of the following nodes: 1. Atoms: lexical elements such as keywords, names of attributes or relations, constants, parentheses, operators and other schema elements 2. Syntactic categories: names for families of query subparts such,
A Grammar of a Simple Subset of SQL 1. Query : ::= ::= ( ) 2. Select-From-Where : ::= SELECT FROM WHERE
3. Select-Lists: ::=, ::=, ::= ::= 4. From-Lists : ::=, ::=, ::= ::= 5. Conditions: ::= AND ::= AND ::= IN ::= IN ::= = ::= = ::= LIKE ::= LIKE 6. ::=
An Example StarsIn( title, year, starName) MovieStar( name,address, gender, birthdate) Find the movies with stars born in 1960 SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ );
SELECT FROM WHERE IN title StarsIn ( ) starName SELECT FROM WHERE LIKE name MovieStar birthdate ‘%1960’
SELECT title FROM StarsIn, MovieStar WHERE starName =name AND birthdate LIKE ‘%1960’ SELECT FROM WHERE, title StarsIn MovieStar = LIKE starName name birthdate ‘%1960’ AND
Preprocessor 1.View Expansion 2.Semantic Checking Check relation uses Check and resolve attribute uses Check types
Algebraic Laws for Improving Query Plans ▪ Commutative and Associative Laws ▪ Laws Involving Selection ▪ Laws Involving Projection ▪ Laws About Joins and Products ▪ Laws Involving Duplicate Elimination ▪ Laws Involving Grouping and Aggregation
Commutative and Associative Laws R×S=S×R R∞S=S∞R R ∪ S=S ∪ R R ∩ S=S ∩ R ×××× (R×S) ×T=R× (S×T) ∞∞∞∞ (R∞S) ∞T=R∞ (S∞T) ∪∪∪∪ (R ∪ S) ∪ T=R ∪ (S ∪ T) ∩∩∩∩ (R ∩ S) ∩ T=R ∩ (S ∩ T)
Theta Join : R∞ S = S∞ R c c Suppose R(a,b), S(b,c) and T(c,d). (R ∞ S) ∞ T R ∞ ( S ∞ T ) R.a>S.b a S.b a<d
Laws Involving Selection ▪σ C1 AND C2 (R)= σ C1 (σ C2 (R)) ▪σ C1 OR C2 (R)= (σ C1 (R)) ∪ s(σ C2 (R)) σ C2 (σ C1 (R)) = σ C2 (σ C1 (R))
Transformation Examples σ (a=1 OR a=3)AND b<c (R) σ (a=1 OR a=3) ( σ b<c (R)) σ a=1 ( σ b<c (R)) ∪ σ a=1 ( σ b<c (R)) σ (a=1 OR a=3)AND b<c (R) σ b<c ( σ a=1 OR a=3 (R)) σ b<c (σ a=1 (R) ∪ σ a=3 (R))
σLaw for Binary Operators 1. ∪: The selection must be pushed to both arguments. 2. ―: The selection must be pushed to the first argument and optionally may be pushed to the second. 3. Others : It is only required that the selection be pushed to one argument.
σ C (R ∪ S) = σ C (R) ∪ σ C (S) σ C (R―S) = σ C (R)―S = σ C (R)―σ C (S) σ C (R×S) = σ C (R)×S σ C (R∞S) = σ C (R)∞S D D σ C (R∩S) = σ C (R)∩S For example, R(a,b) and S(b,c) σ a=1 OR a=3 ( σ b<c( R∞S) → σ a=1 OR a=3 (R∞ σ b<c (S)) Suppose the relation R has all the attributes mentioned in C
Pushing Selections Sometimes move a selection as far up the tree and then push the selections down all possible branches E.g., StarsIn (title, year, starName) Movie (title, year, length, studioName) View : CREATE VIEW MovieOf1996 AS SELECT * FROM Movie WHERE year=1996; Query: “Which stars worked for which studios in 1996?” SELECT starName, studioName FROM MovieOf1996 NATURAL JOIN StarsIn
ПstarName, studioName ∞ σyear=1996 StarsIn Movie ∵ σ C (R ∞ S) = σ C (R)∞ S ∴ σ year=1996 (Movie) ∞ StarsIn = σ year=1996 (Movie ∞ StarsIn) ∵ σ C (R ∞ S) =σ C (R) ∞ σ C (S) ∴ σ year=1996(Movie ∞ StarsIn) = σ year=1996 (Movie) ∞ σ year=1996 (StarsIn) ПstarName, studioName ∞ σ year=1996 Movie StarsIn
Laws Involving Projection A projection may be introduced anywhere in an expression tree, as long as it eliminates only attributes that are never used by any of the operators above, and are not in the result of the entire expression.
Basic Laws : ▪ ПL(R∞S)=ПL(ПM(R)∞ПN(S)) ▪ ПL(R∞S)=ПL(ПM(R)∞ПN(S)) C C ▪ ПL(R×S)=ПL(ПM(R) ×ПN(S)) where M,N are attributes of R and S respectively or input attributes in L
Suppose there are relations R(a,b,c), S(c,d,e) Пa+e→x,b→y(R∞S) Пa+e→x,b→y(Пa,b,c(R)∞Пc,e(S)) Пa+e→x,b→y( R∞Пc,e(S)) ПL(R ∪ B S)=ПL(R) ∪ B ПL(S) Projections cannot be pushed below ∪ S,―,∩. For example, R(a,b):{(1,2)}; S(a,b): {(1,0)} Пa(R∩S)=Φ , Пa(R)∩Пa(S)={(1)}
Projection Involving Some Computation R(a,b,c), S(c,d,e) Пa+b→x,d+e→y(R∞S) =Пx,y(Пa+b→x,c(R)∞Пd+e→y,c(S)) If x or y is c, we need a temporary name. Пa+b→c,d+e→y(R∞S) =Пz→c,y(Пa+b→z,c(R)∞Пd+e→y,c(S))
Pushing a projection below a selection ПL(σc(R))=ПL(σc(ПM(R))) (M: input attributes of L or mentioned in C) For example, from StarsIn( title, years, starName) to find stars that worked in 1996 SELECT starName FROM StarsIn WHERE year=1996; ПstarName σ year=1996 StarsIn ПstarName σ year=1996 ПstarName , year StarsIn Notice: If there is index on year, it may not improve the plan
Laws About Joins and Products R∞S=σc(R × S) c R∞S=ПL(σc(R × S)) Usually use the rule from right to left ?
Laws Involving Duplicate Elimination ▪ δ(R)=R if R has no duplicates [ R:1) A stored relation with a declared primary key 2) The result of aγ operation] ▪ δ(R ∪ s S)=R ∪ s S the same as ∩s, ―s
Several laws that push δ ▪ δ(R×S) =δ(R)×δ(S) ▪ δ(R ∞ S) =δ(R) ∞δ(S) c c ▪ δ(σc(R))=σc(δ(R)) Notice δ cannot be moved across ∪ B,― B or П
For example, R has two copies of t tuple , S has one copy of t δ( R ∪ B S ) t δ(R) ∪ B δ(S) 2t δ( R ― B S ) t δ(R) ― B δ(S) 0 T(a,b): {(1,2),(1,3)}. δ( Пa(T) ) ={(1)} Пa( δ(T))={(1), (1)}
Laws Involving Grouping and Aggregation General Rules : ▪ δ(γ L ( R )) =γ L (R) ▪ γ L (R)=γ L (П M (R)) (M: attributes of R mentioned in L ) Other Rules : ▪ MIN, MAX: Not affected by duplicates γ L (R)= γ L (δ(R)) ▪ SUM, COUNT, AVG: Affected by duplicates
An Example Relations: MovieStar( name, addr, gender, birthdate) StarsIn( title, year, starName) Query : For each year, find the birthdate of the youngest star to appear in a movie that year SELECT year, MAX (birthdate) FROM MovieStar, StarsIn WHERE name=starName GROUP BY year; γ γ year, MAX (birthdate) name=starName σ name=starName × MovieStar StarsIn
Combine the selection and product into an equijoin Generate a δ belowγ Generate a П between the γ and the introduced δ to project onto year and birthdate γ γ year, MAX (birthdate) name=starName σ name=starName δ ∞ name=starName name=starName MovieStar StarsIn γ γ year, MAX (birthdate) П year,birthdate ∞ name=starName name=starName δ δ δ δ name starName П birthdate, name П year, starName MovieStar StarsIn
From Parse Trees to Logical Query Plans Suppose is a construct , has no subqueries , convert into a relational algebra expression from bottom to top as follows : 1. Product all relations from ; 2. σc, C is the expression ; 3. ПL, L is the list of attributes in the
Translation of A Parse Tree to an Algebraic Expression Tree SELECT FROM WHERE, title StarsIn MovieStar = LIKE starName name birthdate ‘%1960’ AND
Пtitle σ starName=name AND birthdate LIKE‘%1960’ × StarsIn MovieStar
Removing Subqueries From Conditions Two-argument selection Node: σ Left Child: The Relation R Right Child: The Condition C
SELECT FROM WHERE IN title StarsIn ( ) starName SELECT FROM WHERE LIKE name MovieStar birthdate ‘%1960’
П title σ StarsIn IN П name σ birthdate LIKE ‘1960’ starName MovieStar
Replacement of Two-Argument Selection by a One-Argument Selection Uncorrelated Subquery : Two-Argument Selection with a left child for R and right child for t IN S : 1.Replace the by the expression S 2.Replace the two-argument selection σc. 。 3.Give σc an argument that is the product of R and S 。
Uncorrelated Subquery Пtitle σ starName=name × StarsIn Пname σ birthdate LIKE ‘ 1960 ’ MovieStar
Correlated Subquery: SELECT DISTINCT m1.title, m1.year FROM StarsIn m1 WHERE m1.year-40<=( SELECT AVG(birthdate) FROM StarsIn m2, MovieStar s WHERE m2.starName=s.name AND m1.title=m2.title AND m1.year=m2.year ) δ Пm1.title,m1.year σ StarsIn m1 ― ≤ γ Avg(s.birthdate) m1.year 40 σ m2.title=m1.title AND m2.year=m1.year ∞ m2.starName=s.name StarsIn m2 MovieStar s Find the movies where the average age of stars was at most 40 when the movie was made.
δ П m1.title,m1.year σ m1.year-40≤abd ∞ m2.title=m1.title AND m2.year=m1.year StarsIn m1 γ m2.title,m2.year,Avg(s.birthdate)→abd ∞ m2.starName=s.name StarsIn m2 MovieStar s
δ П m2.title,m2.year σ m2.year-40≤abd γ m2.title,m2.year,Avg(s.birthdate)→abd ∞ m2.starName=s.name StarsIn m2 MovieStar s
Improving the Logical Query Plan Pushing down selection. Pushing down projection , or adding new projection. Removing duplicate elimination, or moving to a more convenient position. Turning selection and product into an equijoin.
Пtitle σ starName=name AND birthdate LIKE ‘ %1960 ’ × StarsIn MovieStar Пtitle σ starName=name × StarsIn σ birthdate LIKE ‘1960’ MovieStar Пtitle σ starName=name × StarsIn σ birthdate LIKE ‘1960’ MovieStar Пtitle ∞ starName=name StarsIn σ birthdate LIKE ‘1960’ MovieStar
Grouping Associative/Commutative Operators To group the nodes with the same associative/ commutative operators into a single node with many children In some situation , natural join can be combined with theta-join : –Replace the natural joins with theta-join; –Add a projection ; –The theta-join conditions must be associative ∞ ∞ ∪ ∪ U V W ∪ R ∪ S T ∞ ∪ U V W ∪ U V W R S T
Estimating the Cost of Operations When deriving physical plans from a logical plan, we need select 1.an order and grouping for associative-and-commutative operations ; 2.an algorithm for each operator in the logical plan ; 3.additional operators – scanning, sorting, and so on ; 4.the way in which arguments are passed from one operator to the next
Estimating Sizes of Intermediate Relations Give accurate estimates Are easy to compute Are logically consistent
Estimating the Size of a Projection Suppose R(a, b, c), a, b are integers with 4 bytes respectively , c is a string with 100 bytes. Each tuple header requires 12 bytes and each block header requires 24 bytes , Then each block can hold ( ) /120=8 tuples 。 If T(R)=10,000, then B(R)=10,000/8=1250 For S=П a+b,c (R) , each tuple of S is 116 bytes and each block can only hold ( )/116=8 tuples, B(S)=1250 For U= П a,b (R) , each tuple of U is 20 bytes. Each block can hold 1000/20=50 tuples. B(U)=10,000/50=200
Estimating the Size of a Selection For S=σ A=c (R) , T(S)=T(R)/V(R,A) 。 For S=σ a<10 (R) , T(S)=T(R)/3 。 For S= σ a≠10 (R), T(S)= T(R) T(R)- T(R)/V(R,A)
AND of Conditions Selectivity factor in equality : 1/3 ≠ : 1 A=c : 1/V(R,A) For R(a,b,c), S=σ a=10 AND a>20 (R) , T(R)=10,000,V(R,a)=50. 则 T(S)=T(R)/(50*3)=67 If the condition is contradictory S=σ a=10 AND a>10 (R) then T(S) = 0
OR of Conditions Suppose S=σC1 OR C2(R), 1)the sum of the number of tuples satisfying C1 and those satisfying C2. 2)T(S)=n(1-(1-m1/n)(1-m2/n)) If R has n tuples, m1 of which satisfy C1 and m2 of which satisfy C2. For example : R(a,b), T(R)=10,000. S=σ a=10 OR b<20 (R), V(R,a)=50. m=T(R)/V(R,a)=200. n=T(R)/3=3333 , then T(S)=10,000(1-(1-200/10,000)(1-3333/10,000))=3466
Estimating the Size of a Join 1.The equijoin can be handled as the natural join ; 2.The theta-join can be handled as a selection following a product.
For R(X,Y), S(Y,Z), Y is a single attribute , X and Z represent any set of attributes Two Simplifying Assumptions: –Containment of Value Sets : If V(R,Y)≤V(S,Y), then every Y-value of R will be a Y-value of S. –Preservation of Value Sets : If A is an attribute of R but not of S, Then V(R∞S,A)=V(R,A). Let V(R,Y)≤V(S,Y) , T(R∞S)= T(R)T(S)/V(S,Y); Let V(S,Y) ≤V(R,Y), T(R∞S)= T(R)T(S)/V(R,Y). In general, T(R∞S)=T(R)T(S)/max(V(R,Y),V(S,Y))
R(a,b) S(b,c) U(c,d) T(R)=1000 T(S)=2000 T(U)=5000 V(R,b)=20 V(S,b)=50 V(S,c)=100 V(U,c)=500 Compute Natural Join : R∞S∞U If (R∞S)∞U, then T(R∞S)=T(R)T(S)/max(V(R,b),V(S,b)=1000*2000/50=40,000 T((R∞S)∞U)= T(R∞S)T(U)/max(V(R∞S,c),V(U,c)) = 40,000*5000/max(100,500)= 400,000 If R∞(S∞U), then T(S∞U)=T(S)T(U)/max(V(S,c),V(U,c)) =2000*5000/500=20000 T(R∞(S∞U))= T(S∞U)T(R)/max(V(S∞U,b),V(R,b)) =20,000*1000/max(50,20)= 400,000
Natural Joins With Multiple Join Attributes R(x,y1,y2) ∞ S(y1,y2,z) 1.Probability that r and s agree on attribute y1 1/max(V(R,y1),V(S,y1)) 2. Probability that r and s agree on attribute y 2 1/max(V(R,y2),V(S,y2)) 3. Probability that r and s agree on both y1 and y2 1/(max(V(R,y1),V(S,y1))max(V(R,y2),V(S,y2))) 4. T(R(x,y1,y2) ∞ S(y1,y2,z)) =T(R)T(S)/(max(V(R,y1),V(S,y1))max(V(R,y2),V(S <y2))) T(R∞S)= T(R)T(S)/[max(V(R,y),V(S,y))]* ( y is common to R and S )
R(a,b,c) ∞ S(d,e,f) R.b=S.d AND R.c=S.e R(a,b,c) S(d,e,f) T(R)=1000 T(S)=2000 V(R,b)=20 V(S,d)=50 V(R,c)=100 V(S,e)=50 max(V(R,b), V(S,d))= 50, max(V(R,c),V(S,e))=100 T(R ∞S) =1000*2000/50/100=400
Compute R∞S∞U. R(a,b) S(b,c) U(c,d) T(R)=1000 T(S)=2000 T(U)=5000 V(R,b)=20 V(S,b)=50 V(S,c)=100 V(U,c)=500 (R∞U)∞S T(R∞U)=T(R)T(U)=1000*5000=5,000,000 max(V(R∞U,b),V(S,b))=max(20,50)=50 max(V(R∞U,c),V(S,c))=max(500,100)=500 T(R∞S∞U)=5,000,000*2000/50/500=400,000.
Join of Many Relations S=R1∞R2∞...∞Rn, suppose the attribute A appears in k of the Ri’s , the various values of V(Ri,A) for i=1,2,…k, are v 1 ≤ v 2 ≤ … ≤ v k 。 Select a tuple t from relation having v1. The selected tuple ti from relation having vi has probability 1/vi of agreeing with t1 on A. For all i=2,3,…,k, the probability that all k tuples agree on A is 1/v2v3…vk. The rule for estimating the size of any join: Start with the product of the number of tuples in each relation. Then , for each attribute A appearing at least twice, divide by all but the least of the V(R,A).
For example R(a,b,c)∞S(b,c,d) ∞ U(b,e) R(a,b,c) S(b,c,d) U(b,e) T(R)=1000 T(S)=2000 T(U)=5000 V(R,a)=100 V(R,b)=20 V(S,b)=50 V(U,b)=200 V(R,c)=200 V(S,c)=100 V(S,d)=400 V(U,e)=500 The resulting estimate is 1000*2000*5000/ (( 50*200 ) *200 ) =5000 b c
Estimating Sizes for Other Operations Union : U B : sum of the sizes of the arguments ; U s: as large as the sum of the sizes or as small as the larger of the two arguments ; Intersection : 1. as few as 0 tuples or as many as the smaller of the two arguments ; 2. recognized as the extreme case of the natural join
Difference : R-S: [T(R)+(T(R)-T(S)]/2=T(R)-T(S)/2 Duplicate Eliminationδ (R) : 1. (1+T(R))/2 2. V(R,a1)*V(R,a2)*…*V(R,an) Grouping and Aggregation 1. Product of V(R,A)’s , A is grouping attribute ; 2. [1+T(R)]/2
The Cost Influenced by The Chosen Logical Query Plan The Sizes of Intermediate Relations The Physical Operators Used to Implement Logical Operators The Ordering of Similar Operations The Method of Passing Arguments from One Physical Operator to the Next
Obtaining Estimates for Size Parameters T(R), V( R, a) : Scanning R and counting B(R): Counting the actual number of blocks used
The most common types of histograms 1.Equal-width: the number of tuples with value v in the range v0 <= v < v0+w, v0+w < v < v0+2w, and so on 2.Equal-height: for some fraction p, and list the lowest value, the value that is fraction p from the lowest, the fraction 2p from the lowest, and so on, up to the highest value 3.Most-frequent-values: the most common values and their numbers of occurrences. The sizes of joins can be estimated more accurately.
For example: computing R(a,b)∞S(b,c) 。 R.b: 1:200, 0:150, 5:100, others: 550 S.b: 0:100, 1:80, 2:70, others: 250 Suppose V(R,b)=14, V(S,b)=13. In R, except 1 , 0 , 5 , the average number of the other eleven values is 550/11=50. In S, except 0 , 1 , 2 , the average number of the other ten values is 250/10=25. 1.For R.b=1, S.b=1; R.b=0, S.b=0 200*80+100*150=31 , For S.b=2, 70*50= For R.b=5, 100*25= For nine other b-value 50*25=750 5.SUM : 31,000+3,500+2,500+9*750=48, If estimated by formula in Section 7.4 , T(R)T(S)/max(V(R,b),V(S,b))=1000*500/14=35,714
Given relations Jan( day, temp), July( day, temp) 。 SELECT Jan.day, July.day FROM Jan, July WHERE Jan.temp=July.temp : 10*5/10= : 5*20/10=10 The size of the join: 10+5=15. If computing without the histogram : 245*245/100=600 RangeJanJuly
Incremental Computation of Statistics Maintaining T(R) by adding one every time a tuple is inserted and by subtracting one every time a tuple is deleted. Estimating T(R) by counting only the number of blocks in the B-tree. Maintaining V(R,a) by using an index on attribute a of relation R. If a is a key for R, V(R,a) = T( R ).
Heuristics for Reducing the Cost of Logical Query Plans In order to choose a suitable transformation, we need estimate the cost both before and after a transformation. For example :R(a,b) S(b,c) T(R)=5000 T(S)=2000 V(R,a)=50 V(R,b)=100 V(S,b)=200 V(S,c)=100 δ σa=10 ∞ R S
∞ δ σ a=10 S R T(R)/V(R,a)= 5000/50=100 T(R)/2=100/2=50 50*1000/200= δ ∞ σ a=10 S R For the left plan tree : =1150 For the right plan tree : =1100
Approaches to Enumerating Physical Plans Exhaustive Top-down Bottom-up
Heuristic Selection For σ A=c (R), there is an index on attribute A of R, perform an indexed scan If the above includes the other condition, the indexed scan will be followed by a further selection called filter. If there is an index on the join attributes, perform an index-join If one argument of a join is sorted on the join attributes, perform a sort-join When computing the union or intersection of three or more relations, group the smallest relations first.
Branch-and-Bound Plan Enumeration Use heuristics to find a good physical plan with cost C and then explore the space of physical query plans. Eliminate any plan having the subquery with cost greater than C. Replace the current plan with the new plan having cost less than C.
Hill Climbing Use heuristics to find a good physical plan. Make small changes to the plan to find “nearby” plans that have lower cost by (1) replacing one method for an operator by another. (2) reordering joins by using the associative and/or commutative laws.
Dynamic Programming Variation of the general bottom-up strategy Keep for each subexpression only the plan of least cost. Only the best plan for each subexpression is considered during constructing the plans for a larger subexpression.
Selinger-Style Optimization Keep for each subexpression not only the plan of least cost, but certain other plans that have higher cost but produce a result that is sorted in an order that may be useful higher up in the expression tree. Produce optimal overall plans from plans that are not optimal for certain subexpressions.
Choosing an Order for Joins Selecting an order for the (natural) join of three or more relations. The same ideas can be applied to other binary operations like union or intersection.
Significance of Left and Right Join Arguments One-pass join: The left argument is stored in a main-memory while the right argument is read a block at a time. Nested-loop join: The left argument is the relation of the outer loop. Index-join: The right argument has the index.
Join Trees SELECT title FROM StarsIn, MovieStar WHERE starName=name AND birthdate LIKE ‘%1960’ Π ∞ starName=name StarsIn σ birthdate LIKE’%1960’ MovieStar Π ∞ starName=name σ birthdate LIKE’%1960’ StarsIn MovieStar
Ways to join four relations When the join involves more than two relations, the number of possible join trees grows rapidly ∞ ∞ U ∞ T R S (a) ∞ ∞ ∞ R S T U (b) ∞ R ∞ S ∞ T U (c) Each tree represents 4!=24 different trees when the possible labelings of the leaves are considered. left-deep treeright-deep tree bushy tree
Left-Deep Join Trees Only considering left-deep join trees has the following advantages Limit the search space Interact well with common join algorithms
1.For n relations, there is only one left-deep tree shape, to which we may assign the relations in n! ways 2.The total number of tree shapes T(n): T(1)=1 n-1 T(n)=∑ i=1 T(i)T(n-i) 3.The total number of trees: T(n)×n! Given 6 relations, then T(6)×6!=42×6!=30,240
∞ ∞ U ∞ T R S B(R)+B(R∞S) B(R∞S)+ B((R∞S) ∞T) ∞ R ∞ S ∞ T U B(R)+B(S)+B(T) It is possible that B(R)+B(S)+B(T)< B(R)+B(R∞S) or B(R∞S)+B((R∞S) ∞T) 。 If R is small , we expect B(R∞S)<B(T), B((R∞S) ∞T)< B(U).
∞ ∞ U ∞ T R S ∞ R ∞ S ∞ T U For the right-deep tree , we need construct S∞(T∞U), T∞U in repetitive way. If we store it on disk , we are using extra disk I/Os.
Dynamic Programming to Select a Join Order and Grouping Three choices to pick an order for the join of many relations. Consider them all Consider a subset Use a heuristic to pick one
A table constructed by dynamic programming algorithm 1.The estimated size of the join of these relations. 2.The least cost of computing the join of these relations. 3.The expression that yields the least cost.
Consider the join of four relations R, S, T, and U R(a,b) S(b,c) T(c,d) U(d,a) V(R,a)=100 V(U,a)=50 V(R,b)=200 V(S,b)=100 V(S,c)=500 V(T,c)=20 V(T,d)=50 V(U,d)=1000 {R}{S}{T}{U} Size1000 Cost0000 Best planRSTU
{R,S}{R,T}{R,U}{S,T}{S,U}{T,U} Size50001M M1000 Cost Best plan R∞SR∞TR∞US∞TS∞UT∞U T(R)T(S)/max(V(R,b),V(S,b)=1000*1000/200=5000 {R,S,T}{R,S,U}{R,T,U}{S,T,U} Size Cost Best plan(S∞T) ∞R(R∞S) ∞U(T∞U)∞R(T∞U) ∞S T(S∞T)T(R)/max(V(S,b),V(R,b))=2000*1000/200
Join groupings and their costs grouping cost ((S∞T)∞R)∞U 12,000 ((R∞S)∞U)∞T 55,000 ((T∞U)∞R)∞S 11,000 ((T∞U)∞S)∞R 3,000 (T∞U)∞(R∞S) 6,000 (R∞T)∞(S∞U) 2,000,000 (S∞T)∞(R∞U) 12,000 ∞∞ ∞ B((S∞T) ∞R) + B(S∞T)=10, =12,000 ∞∞ B(T∞U)+B(R∞S)= =6000
Dynamic Programming With More Detailed Cost Functions Use Disk I/O as the cost measure Compute the cost of R1 ∞ R2 by summing the cost of R1, the cost of R2, and the least cost of joining these two relations. Dynamic programming based on the Selinger-style optimization.
A Greedy Algorithm for Selecting a Join Order BASIS: Start with the pair of relations whose estimated join size is the smallest. The join of these relations becomes the current tree. INDUCTION: Find, among all those relations not yet included in the current tree, the relation that, when joined with the current tree, yields the relation of the smallest estimated size. The new current tree has the old current tree as its left argument and the selected relation as its right argument.
Example {R,S}{R,T}{R,U}{S,T}{S,U}{T,U} Size50001M M1000 Cost Best plan R∞S R∞TR∞US∞TS∞UT∞U {R,T,U}{S,T,U} Size Cost1000 Best plan(T∞U)∞R(T∞U) ∞S
Completing the Physical-Query- Plan Selection 1.Selection of algorithms to implement the operations of the query plan. 2.Decision regarding when intermediate results will be materialized and when they will be pipelined 3.Notation for physical-query-plan operators.
Choosing a Selection Method 1.Have an index 2.Are compared to a constant in one of the terms of the selection. 1.Use one comparison of the form A θ c. 2.Retrieve all tuples that satisfy the comparison from 1 3.Consider each tuple selected in (2) to decide whether it satisfies the rest of the selection conditions
Costs for the Various Algorithms The table-scan algorithm (a) B(R) if R is clustered (b) T(R) if R is not clustered The algorithm that picks an equality term (a) B(R)/V(R,a) if the index is clustering (b) T(R)/V(R,a) if the index is not clustering The algorithm that picks an inequality (a) B(R)/3 if the index is clustering (b) T(R)/3 if the index is not clustering
Example : for R(x,y,z) , σ x=1 AND y=2 AND z<5 (R) 。 T(R)=5000, B(R)=200, V(R,x)=100, V(R,y)=500. R is clustered , only the index on z is clustering 。 1.table-scan: B(R)=200; 2.For x=1:T(R)/V(R,x)=5000/200=25; 3.For y=2:T(R)/V(R,y)=5000/500=10; 4.For z<5:B(R)/3=200/3=67.
Choosing a Join Method One-pass join if there is enough buffers to the join. Sort-join when either (1) one or both arguments are already sorted on their join attributes or (2) there are two or more joins on the same attributes. Index-join if there is an index on the join attributes. Hashing –join if it can not satisfy the above conditions.
Pipelining Versus Materialization Pipelining: The tuples produced by one operation are passed directly to the operation that uses it, without ever storing the intermediate tuples on disk. Materialization: The result of each operation is stored on disk until it is needed by another operation
Pipelining Unary Operations Implementation by Iterator : Project : call GetNext() once. Selectionσc : call GetNext() several times until one tuple that satisfies condition C is found. Test for C GetNext GetNext() Tuple that satisfies C Consumer
Pipelining Binary Operation Use one buffer to pass the result to its consumer Example : (R(w,x)∞S(x,y))∞U(y,z) M=101 ∞ ∞ U(y,z) B(U)=10000 R(w,x) S(x,y) B(R)=5000 B(S)=10000 ∞ 1.R∞S: the two-pass hash join , need 3(B(R)+B(S))=45,000 disk I/O ’ s ; 2.If k<=49, one-pass hash join for second join, need B(U)=10000 disk I/O to read U 。 The total is 55 , 000 disk I/O ’ s. B(R∞S)=k Limit the buckets of R to 100 blocks each , we need at least 50 buckets.
If 49<k<=5000, then use two-pass hash-join to join U 1.Before R∞S , hash U into 50 buckets of 200 blocks each. Need 10000(read U)+10000(write back to Disk ) =20000 Disk I/O’s; 2.Perform a two-pass hash R∞S using 51 buckets as before , need disk I/O’s 。 Put each tuple to the corresponding buckets, need k disk I/O’s; 3.Join R∞S with U bucket by bucket , need k disk I/O’s to read R∞S and U.( Because k<=5000, the buckets of R∞S will be of size at most 5000/50 =100 =M-1). The total cost is 75,000+2k disk I/O’s.
If k>5000, we can not perform a two-pass join in the 50 buffer. We use the following algorithm : 1.Use two-pass hash join R∞S , need 45,000 disk I/O’s. Store the result on disk , need k disk I/O’s ; 2.Use two-pass hash join (R∞S)∞U in the 100 buffers , need 30,000+3k disk I/O’s. The total cost is 75 , 000+4k disk I/O’s. Range of kPipeline or Materialize Algorithm for final join Total Disk I/O’s k≤49PipelineOne-pass55,000 49<k≤5000Pipeline 50-bucket, Two-pass 75,000+2k 5000<kMaterialize 100-bucket, Two-pass 75,000+4k
Notation for Physical Query Plans Each operator of the logical plan becomes one or more operators of the physical plan Leaves (stored relations) of the logical plan become one of the scan operators applied to that relation. Materialization would be indicated by a Store operator applied to the intermediate result.
Operators for leaves Each relation R that is a leaf operand of the logical- query-plan tree will be replaced by a scan operator : 1.TableScan(R) : All blocks holding tuples of R are read in arbitrary order. 2.SortScan(R,L): Tuples of R are read in order, sorted according to the attribute(s) on List L 3.IndexScan(R,C): C is a condition of the form Aθc , Tuples of R are accessed through an index on attribute A. 4.IndexScan(R,A) : A is an attribute of R. The entire relation R is retrieved via an index on R.A.
Physical Operators for Selection 1.Replace σc(R) with Filter (C) If R is intermediate relation , no other operator besides Filter is needed. If R is a stored or materialized relation , TableScan, SortScan(L) are used to access R. 2. If condition C can be expressed as Aθc AND D , there is an index on R.A , then a) Use the operator IndexScan(R, Aθc ) to access R ; b) Use Filter(D) in place of the selectionσc(R).
Physical Sort Operators : –Introduce SortScan(R,L) which reads a stored relation R, and produces it sorted according to the list of attributes L 。 Other Relational-Algebra Operations: Replaced by a suitable physical operator : –The operation being performed ; –Necessary parameters ; –A general strategy for the algorithm: sort-based, hash-based, or in some joins, index-based ; –A decision about the number of passes to be used ; –An anticipated number of buffers the operation will require
k<=49 A physical plan from Example 7.38 two-pass hash-join 101 buffers one-pass hash-join 50 buffers two-pass hash-join 101 buffers Two-pass hash-join 101 buffers TableScan(U) TableScan(R) TableScan(S) K > 5000
Annotating a selection to use the most appropriate index Example : for R(x,y,z) , σ x=1 AND y=2 AND z<5 (R) 。 Filter(x=1 AND z<5) IndexScan(R,y=2)
Ordering of Physical Operation 1.Break the tree into subtrees at each edge that represents materialization. 2.Order the execution of the subtrees in a bottom-up, left-to-right manner. ; 3.Execute all nodes of each subtree using a network of iterators
Exercises Ex 7.1.3, Ex (b), (c), (d) Ex (c), Ex Ex (c), (d), (e), Ex Ex 7.6.1, Ex (b), (c)