Download presentation
Presentation is loading. Please wait.
Published byEustacia McDaniel Modified over 8 years ago
1
QUERY PROCESSING AND OPTIMIZATION
2
Overview SQL is a declarative language: Specifies the outcome of the computation without defining any flow of control Will require DBMS to select an execution plan Will allow optimizations
3
Sample query SELECT C, D FROM R, S WHERE R.B = "z" AND S.F = 30 AND R.A = S.D
4
The two tables R ABC 1x100 2y200 3z300 4y400 S DEF 1u10 3v30 5w
5
First execution plan Can use relational algebra to express an execution plan Could be: Cartesian product: R×S Selection: σ R.B = "z" S.F = 30 R.A = S.D (R×S) Projection: π C, E (σ R.B = "z" S.F = 30 R.A = S.D (R×S))
6
Graphical representation R×S σ R.B = "z" S.F = 200 R.A = S.D π C, E
7
R×S ABCDEF 1x1001u10 1x1003v30 1x1005w30 2y2001u10 2y2003v30 2y2005w30 3z3001u10 3z3003v30 3z3005w30 4y4001u10 4y4003v30 4y4005w30
8
Second execution plan Selection: σ B = "z" (R) σ F = 30 (S) Join:σ B = "z" (R)⋈ R.A=S.D σ F = 30 (S) Projection:π C, E (…)
9
The two tables R ABC 1x100 2y200 3z300 4y400 S DEF 1u10 3v30 5w
10
After the selections σ B = "z" (R) ABC 3z300 σ F = 30 (S) DEF 3v30 5w
11
σ B = "z" (R) R.A=S.D σ F = 30 (S) ABCDEF 3z3003v30
12
Discussion Second plan Extracts first relevant rows of tables R and S Uses more efficient join for each row in σ B = "z" (R) : for each row in σ F = 30 (S) : if R.A = S.D : include rows in result Note that inner loop searches the smaller temporary table (σ F = 30 (S))
13
More generally Exclude as quickly as possible: Irrelevant lines Irrelevant attributes Most important when the involved tables reside on different hosts (Distributed DBMS) Whenever possible, ensure that inner join loops search tables that can reside in main memory
14
Caching considerations Cannot rely on LRU to achieve that Will keep in memory recently accessed pages of all tables Must keep All pages of table inside the inner loop No pages of the other table Can either Let DBMS manage the cache Use a scan-tolerant cache algorithm (ARC)
15
A third plan Find lines of R where B = "z" Using index S.D find lines of S where S.D matches R.A for the lines where R.B = "z" Include pair of lines in the join
16
Processing a query (I) Parse the query Convert query parse tree into a logical query plan (LQP) Apply equivalence rules (laws) and try to improve upon extant LQP Estimate result sizes
17
Processing a query (II) Consider possible physical plans Estimate their cost Select the best Execute it Given the high cost of query processing, it make sense to evaluate various alternatives
18
Example (from [GM]) SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ );
19
Relational Algebra plan title StarsIn IN name birthdate LIKE ‘%1960’ starName MovieStar Fig. 7.15: An expression using a two-argument , midway between a parse tree and relational algebra
20
Relational Algebra plan title StarsIn IN name birthdate LIKE ‘%1960’ starName MovieStar Fig. 7.15: An expression using a two-argument , midway between a parse tree and relational algebra
21
Relational algebra plan title StarsIn IN name birthdate LIKE ‘%1960’ starName MovieStar Fig. 7.15: An expression using a two-argument , midway between a parse tree and relational algebra
22
Logical query plan title starName=name StarsIn name birthdate LIKE ‘%1960’ MovieStar Fig. 7.18: Applying the rule for IN conditions Cartesian product could indicate a brute force solution
23
Estmating result sizes Need expected size StarsIn MovieStar
24
Estimate the costs of each option Logical Query Plan P1 P2 …. Pn C1 C2 …. Cn Pick the best!
25
Query optimization At two levels Relational algebra level: Use equivalence rules Detailed query plan level: Takes into account result sizes Considers DB organization How it is stored Presence and types of indexes, …
26
Result sizes do matter Consider the Cartesian product Very costly when its two operands are large tables Less true when the tables are small
27
Equivalence rules for joins R⋈S = S⋈R (R⋈S)⋈T = R⋈(S⋈T) Column order does not matter because the columns have labels
28
Rules for product and union Equivalence rules for Cartesian product: R x S = S x R (R x S) x T = R x (S x T) Equivalence rules for union : R S = S R (R S) T = R (S T) Column order does not matter because the columns have labels
29
Rules for selections and unions Equivalence rules for selection: p1 p2 (R) = p1 ( p2 (R)) p1 p2 (R) = p1 (R) p2 (R) Equivalence rules for union : R S = S R (R S) T = R (S T)
30
Combining projections and joins I f predicate p only involves attributes of R not used in the join p (R⋈S) = p (R)⋈S If predicate q only involves attributes of S not used in the join q (R⋈S) = R⋈ q (S) Warning: π p1, p2 (R) is NOT the same as π p1 ( π p2 (R))
31
Combining selection and joins p q (R⋈S)= p (R)⋈ q (S) p q m (R⋈S)= m [( p R)⋈( q S)] p q (R⋈S)= [ p (R)⋈S] [R⋈( q (S)]
32
Combining projections and selections Let x be a subset of R attributes z the set of attributes of R used in predicate p then π x [σ p (R)] = π x [σ p [π xz (R)]] We can only eliminate attributes that are not used in the selection predicate!
33
Combining projections and joins Let x be a subset of R attributes y a subset of S attributes z the common attributes of R and S then xy (R⋈S) = xy {[ xz (R)]⋈[ yz (S)]}
34
Combining projections, selections and joins Let x, y, z be... z' the union of z and the attributes used in predicate p xy { p (R⋈S)} = xy { p [ xz’ (R)]⋈[ yz'( S)]}
35
Combining selections, projections and Cartesian product Rules are similar Just replace join operator by Cartesian product operator Keep in mind that join is a restricted Cartesian product
36
p (R U S) = p (R) U p (S) p (R - S) = p (R) - S = p (R) - p (S) Combining selections and unions
37
p1 p2 (R) p1 [ p2 (R)] Use successive selections p (R ⋈ S) [ p (R)] ⋈ S Do selections before joins R ⋈ S S ⋈ R x [ p (R)] x { p [ xz (R)]} Do projections before selection Finding the most promising transformations
38
First heuristics Do projections early Example from [GM]: Given R(A,B,C,D,E) and the select predicate P: (A=3) (B=“cat”) Seems a good idea to replace x { p (R)} by E { p { ABE (R)} } What if we have indexes?
39
Same example with indexes Assume attribute A is indexed Use index to locate all tuples where A = 3 Select tuples where B=“cat” Do the projections In other words x { p (R)} is the best solution
40
Second heuristics Do selections early Especially if we can use indexes but no heuristics is always true
41
Estimating cost of query plans Requires Estimating sizes of the results Estimating the number of I/O operations We will generally assume that the cost of a query plan is dominated by the number tuples being read or written
42
Estimating result sizes Relevant data are T(R) : number of tuples in R S(R) : size of each tuple of R (in bytes) B(R): number of blocks required to store R V(R, A) : number of distinct values for attribute A in R
43
Example Relation R T(R)=8 Assuming dates take 8 bytes and strings 20 bytes S(R)=48 bytes B(R)=1 block V(R, Owner)=3, V(R, Pet)=2,V(R, Vax date)=4 OwnerPetVax date AliceCat3/2/15 AliceCat3/2/15 BobDog10/8/14 BobDog10/8/15 CarolDog11/9/14 CarolCat12/7/14
44
Estimating cost of W = R 1 x R 2 T(W) = T(R 1 )×T(R 2 ) S(W) = S(R 1 )+S(R 2 ) Obvious!
45
Estimating cost of W = A=a (R) S(W) = S(R) T(W) = T(R)/V(R, A) but this assumes that the values of A are uniformly distributed over all the tuples
46
Example W = σ owner= Bob (R) As T(R) = 6 and V(R, Owner) = 3 T(W) = 3 OwnerPetVax date AliceCat3/2/15 AliceCat3/2/15 BobDog10/8/14 BobDog10/8/15 CarolDog11/9/14 CarolCat12/7/14
47
Making another assumption Assume now that values in select expression Z = val are uniformly distributed over all possible V(R, Z) values. If W = σ Z=val (R) T(W) = T(R)/V(R, Z)
48
Estimating sizes of range queries Attribute Z of table R has 50 possible values ranging from 1 to 100 If W = σ Z > 80, what is T(W)? Assuming the values in Z are uniformly distributed over [0, 1] T(W) = T(R)×(100 – 80)/(100 – 1 +1) = 0.2×T(R)
49
Explanation T(W) = T(R)×(Query_Range/Value_Range) If query had been W = σ Z ≥ 80 T(W) would have been T(R)×(100 – 80 + 1)/(100 – 1 +1) = 0.21×T(R) 21 possible values
50
Estimating the size of R⋈S queries We consider R(X, Y)⋈S(Y, Z) Special cases: R and S have disjoint values for Y: T(R⋈S) = 0 Y is the key of S and a foreign key in R: T(R⋈S) = T(R) Almost all tuples of R and S have the same value for Y: T(R⋈S) = T(R)T(S)
51
Estimating the size of R⋈S queries General case: Will assume Containment of values: If V(R, Y) ≤ V(S, Y) then all values of Y in R are also in S Preservation of value sets: If A is an attribute of R that is not in S, then V(R⋈S, A) = V(R, A)
52
Estimating the size of R⋈S queries If V(R, Y) ≤ V(S, Y) Every value of R is present in S On average, a given tuple in R is likely to match T(S)/V(S, Y) R has T(R) tuples T(R⋈S) = T(R)×T(S)/V(S, Y)
53
Estimating the size of R⋈S queries If V(R, Y) ≥ V(S, Y) Every value of S is present in R On average, a given tuple in S is likely to match T(R)/V(R, Y) S has T(S) tuples T(R⋈S) = T(R)×T(S)/V(R, Y)
54
Estimating the size of R ⋈ S queries In general T(R⋈S) = T(R)×T(S)/max(V(R, Y), V(R, S))
55
An example (I) Finding all employees who live in a city where the company has a plant: EMPLOYEE( EID, NAME, …., CITY) PLANT(PLANTID, …,CITY) SELECT E.NAME FROM EMPLOYEE E, PLANT P WHERE E.CITY = P.CITY SELECT EMPLOYEE.NAME FROM EMPLOYEE JOIN PLANT ON EMPLOYEE.CITY= PLANT.CITY
56
An example (II) Assume T(E)=5,000V(E, CITY) = 100 T(P)= 200V(P, CITY) = 50 T(E⋈P) = T(E)×T(P)/ MAX(V(E, CITY), V(P, CITY)) = 5,000×200/MAX(100, 50) = 1,000,000/100 = 10,000
57
Estimating the size of multiple joins R ⋈ S ⋈ U R(A, B)S(B,C)U(C,D) T(R)=1,000T(S)=2,000T(U)=5,000 V(R,B)=20V(S,B)=50 V(S,C)=100V(U,C)=500 Left to right: T(R⋈S)= 2,000,000/max(20, 50)=40,000 T(R⋈S⋈U)=200,000,000/max(100, 500)=400,000 Right to left: T(S⋈U)= 10,000,000/max(100, 500)=20,000 T(R⋈S⋈U)=20,000,000/max(20, 50)=400,000
58
Estimating the size of multicondition joins R(X,y 1, y 2,…)⋈S(y 1, y 2,…, Z) If V(R, y 1 )≤V(S, y 1 ) and V(R, y 2 )≤ V(S, y 2 ) … Every value of R is present in S On average, a given tuple in R is likely to match T(S)/(V(S, y 1 )×V(S, y 2 ) …) R has T(R) tuples T(R⋈S) = T(R)×T(S)/(V(S, y 1 )×V(S, y 2 ) …)
59
Multicondition join R(X,y 1, y 2,…)⋈S(y 1, y 2,…, Z) In general T(R⋈S) = T(R)×T(S)/ [max(V(R, y 1 ), V(R, y 1 ))× max(V(R, y 2 ), V(R, y 2 ))× ….]
60
Multicondition join R(X,y 1, y 2,…)⋈S(y 1, y 2,…, Z) If V(R, y 1 )≤V(S, y 1 ) and V(R, y 2 )≤ V(S, y 2 ) … Every value of R is present in S On average, a given tuple in R is likely to match T(S)/(V(S, y 1 )×V(S, y 2 )) R has T(R) tuples T(R⋈S) = T(R)×T(S)/(V(S, y 1 )×V(S, y 2 ))
61
Estimating the size of unions T(R⋃S) for a bag union: T(R⋃S) = T(R)+T(S) exact for a regular union: If the relations are disjoint: T(R⋃S) = T(R)+T(S) If one relation contains the other: T(R⋃S) = max(T(R), T(S)) T(R⋃S)=(max(T(R), T(S))+T(R)+T(S))/2 We take the average!
62
Estimating the size of intersections T(R⋂S) If the relations are disjoint: T(R⋂S) = 0 If one relation contains the other T(R⋂S) = min(T(R), T(S)) T(R⋂S)=min(T(R), T(S))/2 We take the average!
63
Estimating the size of set differences T(R-S) If the relations are disjoint T(R)-T(S) =T(R) If relation R contains relation S: T(R-S) = T(R)-T(S) T(R-S)=(2T(R)+T(S))/2 We take the average!
64
Estimating the cost of eliminating duplicates δ(R) If all tuples are duplicates : T(δ(R)) = 1 If no tuples are duplicates : T(δ(R)) = T(R) T(δ(R)) = T(R)/2 If R(a 1, a 2, …) and we know the V(R, a i ) T(δ(R)) = Π i V(R, a i )
65
Collecting statistics Can explicitly request statistics Maintain them incrementally Can collect histograms Give an idea how data are distributed Not all patrons borrow equal number of books
66
The Zipf distribution (I) Empirical distribution Verified for many cases Ranks items by frequency/popularity If f is the probability of accessing/using the most popular item in the list (rank 1) The probability of accessing/using the second most popular item will be close to f/2 The probability … third most popular item will be close to f/3
67
The Zipf distribution (II) Can alter the shape Ranks items by frequency/popularity If f is the probability of accessing/using the most popular item in the list (rank 1) The probability of accessing/using the second most popular item will be close to f/2 The probability … third most popular item will be close to f/3
68
The Zipf distribution (II)
69
The Zipf distribution (III) Can adjust the slope of the course by adding an exponent If f is the probability of accessing/using the most popular item in the list (rank 1) The probability of accessing/using the n-th ranked item in the list is will be close to f/n i i = ½ seems to be a good choice
70
Example (I) A library uses two tables to keep track of its books Book(BID,Title,Authors,Loaned,Due) Patron(PID, Name, Phone, Address) The Loaned attribute of a book is equal to The PID of the patron who borrowed the book Zero if the book is on the shelves
71
Example (II) We want to find the titles of all books currently loaned to "E. Chambas" T(Books)=5,000V(Books, Loaned) = 200 T(Patron)=500V(Patron, Name) = 500
72
First plan X = Book ⋈ Loaned = PID Patron Y = σ Name = "E. Chambas" (X) Since PID is the key of Patron and assuming that Loaned were a foreign key in Books: T(X)= T(Books) (all books are borrowed!) = 5,000 T(Y)= T(X)/V(Patron, Name) =5,000/500 = 10
73
Second plan X = σ Name = "E. Chambas" (Patron) Y = Book ⋈ Loaned = PID X T(X)= T(Patron) /V(Patron, Name) = 5,000/5,000 = 1 T(Z)= T(Book)×T(X)/V(Book, Loaned) = 5,000/500 = 10
74
Comparing the two plans (I) Comparison based on the number of tuples created by the plan minus The number of tuples constituting the answer Should be the same for all correct plans For the same reason, we do not considder the number of tuples being read
75
Comparing the two plans (II) Cost of first plan: 5,000 Cost of second plan: 1
76
An example (I) Finding all employees who live in a city where the company has a plant: EMPLOYEE( EID, NAME, …., CITY) PLANT(PLANTID, …,CITY) Assume T(E)=5,000V(E, CITY) = 100 V(E, NAME) = 5,000 T(P)= 200V(P, CITY) = 50
77
A first plan X = E ⋈ E.CITY = P.CITY P Y = π E.NAME (X) T(E⋈P)= T(E)×T(P)/ max(V(E, CITY), V(P, CITY)) = 5,000×200/MAX(100, 50) = 1,000,000/100 = 10,000 T(Y) = 10,000 (not possible!)
78
A second plan X = π P.CITY (P) Y = δ(X) Z = E ⋈ E.CITY = Y.CITY Y U = π E.NAME (Z) T(X)= T(P) = 200 T(Y)= V(X, CITY) = V(P, CITY) =50 T(Z)= T(E)×T(Y)/max(V(E, CITY), 1) = 5,000×50/MAX(100, 1) = 2,500 T(U) =T(Z) = 2,500
79
Comparing the two plans Here it pays off to eliminate duplicates early
80
Example [GM] We have R(a,b) and S(b,c) We want δ(σ A="a" (R⋈S)) We know T(R) = 5,000T(S) = 2,000 V(R, a) = 50 V(R, b) = 100V(S, b) = 200 V(S, c) = 100
81
First plan X 1 = σ a="a" (R) X 2 = X 1 ⋈S X 3 = δ (X 2 )
82
First plan X 1 = σ a="a" (R) X 2 = X 1 ⋈S X 3 = δ (X 2 ) T(X 1 )= T(R)/V(R, a) = 5,000/50 = 100 T(X 2 ) = T(X 1 )×T(S)/max(V(R, b), V(S, b)) = 100×2000/max(100, 200) = 1,000 T(X 3 )= min(…, T(X 2 )/2)= 500 doesn't count
83
Second plan X 1 = δ(R) X 2 = δ(S) X 3 = σ a="a" (X 1 ) X 3 = X 3 ⋈X 2
84
Second plan X 1 =δ(R), X 2 =δ(S), X 3 =σ a="a" (X 1 ), X 4 =X 3 ⋈X 2 T(X 1 )= min(V(R, a)×V(R, b), T(R)/2) = min(50×100, 5000/2) = 2500 T(X 2 )= min(V(S, b)×V(S, c), T(S)/2) = min(200×100, 2000/2) = 1000 T(X 3 )=T(X 1 )/V(R, a) = 2500/50 = 50 T(X 4 )= T(X 3 )×T(X 2 )/max(V(R, b), V(S, b)) = 50×1000/max(100, 200))= 250 nono
85
Comparing the two plans Here it did not pay off to eliminate duplicates early
86
A hybrid plan X 1 = σ a="a" (R) X 2 = δ(X 1 ) X 3 = X 2 ⋈S X 4 = δ(X 3 )
87
A hybrid plan X 1 = σ a="a" (R) X 2 = δ(X 1 ) X 3 = X 2 ⋈S X 4 = δ(X 3 ) T(X 1 )= T(R)/V(R, a) = 5,000/50 = 100 T(X 2 )= min(V(R, b), T(X 1 )/2) = min(100,50) = 50 T(X 3 )= T(X 3 )×T(S)/max(V(R, b), V(S, b)) = 50×2000/max(100, 200))= 500 T(X 4 )= min(…, T(X 3 )/2)= 250 nono
88
Comparing the two best plans Reducing the sizes of the tables in a join is a good idea if we can do it on the cheap
89
Ordering joins Joins methods are often asymmetric so cost(R⋈S)≠cost(S⋈R) Useful to build a join tree A simple greedy algorithm will work well: Start with pair of relation whose estimated join size will the the smallest Find among other relations the one that would produce the smallest estimates size when joined to the current tree.
90
Implementing joins
91
A. Nested loops W = [ ] for rows in R : for rows in S : if match_found( ) : append_concatenated_rows() Number of operations: T(R)×T(S)
92
The idea Table R Table S Try to match every tuple of R with all tuples of S
93
Optimization Assume that the second relation can fit in main memory Read only once Number of reads is T(R) + T(S)
94
B. Sort and merge We sort the two tables using the matching attributes as sorting keys Can now do select matches by doing a merge Single pass process unless we have duplicate matches Number of operations is O(T(R)log(T(R)))+O(T(S)log(T(S)))+T(R)+T(S) assuming one table does not have potential duplicate matches Great if the tables are already sorted
95
C. Hashing Assume both tables maintain a hash with K entries for the matching attributes for i in range(0, K – 1) : join all R entries in bucket i with all S entries in the same bucket We replace a big join by K smaller joins Number of operations will be: K×(T(R)/K)×(T(S)/K) = T(R)×T(S)/K
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.