QUERY PROCESSING AND OPTIMIZATION. Overview SQL is a declarative language:  Specifies the outcome of the computation without defining any flow of control.

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II
Advertisements

CS CS4432: Database Systems II Logical Plan Rewriting.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Completing the Physical-Query-Plan. Query compiler so far Parsed the query. Converted it to an initial logical query plan. Improved that logical query.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Compiler. The Query Compiler Parses SQL query into parse tree Transforms parse tree into expression tree (logical query plan) Transforms logical.
Cost-Based Transformations. Why estimate costs? Well, sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
1 Relational Query Optimization Module 5, Lecture 2.
COMP 451/651 Optimizing Performance
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 183 Database Systems II Query Compiler.
Algebraic Laws For the binary operators, we push the selection only if all attributes in the condition C are in R.
Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.
Estimating the Cost of Operations We don’t want to execute the query in order to learn the costs. So, we need to estimate the costs. How can we estimate.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan.
Cost based transformations Initial logical query plan Two candidates for the best logical query plan.
CS Spring 2002Notes 61 CS 277: Database System Implementation Notes 6: Query Processing Arthur Keller.
CS 4432query processing - lecture 131 CS4432: Database Systems II Lecture #13 Query Processing Professor Elke A. Rundensteiner.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Cost-Based Transformations. Why estimate costs? Sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g. Pushing.
CS 4432query processing1 CS4432: Database Systems II.
CS 4432logical query rewriting - lecture 151 CS4432: Database Systems II Lecture #15 Logical Query Rewriting Professor Elke A. Rundensteiner.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Query Processing & Optimization
CS 4432query processing - lecture 121 CS4432: Database Systems II Lecture #12 Query Processing Professor Elke A. Rundensteiner.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Cost based transformations Initial logical query plan Two candidates for the best logical query plan.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
DBMS 2001Notes 6: Query Compilation1 Principles of Database Management Systems 6: Query Compilation and Optimization Pekka Kilpeläinen (partially based.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
CPS216: Advanced Database Systems Notes 08:Query Optimization (Plan Space, Query Rewrites) Shivnath Babu.
CS 245Notes 61 CS 245: Database System Principles Notes 6: Query Processing Hector Garcia-Molina.
CS 245Notes 61 CS 245: Database System Principles Notes 6: Query Processing Hector Garcia-Molina.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
SCUHolliday - COEN 17814–1 Schedule Today: u Query Processing overview.
CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.
Chapters 15-16a1 (Slides by Hector Garcia-Molina, Chapters 15 and 16: Query Processing.
Estimating the Cost of Operations. Suppose we have parsed a query and transformed it into a logical query plan (lqp) Also suppose all possible transformations.
CS4432: Database Systems II Query Processing- Part 3 1.
1 Algebra of Queries Classical Relational Algebra It is a collection of operations on relations. Each operation takes one or two relations as its operand(s)
CS4432: Database Systems II Query Processing- Part 2.
CS 245Notes 61 CS 245: Database System Principles Notes 6: Query Processing Hector Garcia-Molina.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Query Processing – Implementing Set Operations and Joins Chap. 19.
CS4432: Database Systems II Query Processing- Part 1 1.
1 Ullman et al. : Database System Principles Notes 6: Query Processing.
1/14/2005Yan Huang - CSCI5330 Database Implementation – Query Optimization Query Optimization.
CS 440 Database Management Systems
Database Management System
CS 245: Database System Principles
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Introduction to Database Systems
File Processing : Query Processing
Focus: Relational System
CS 245: Database System Principles
Algebraic Laws.
Lecture 27: Optimizations
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Lecture 23: Query Execution
Evaluation of Relational Operations: Other Techniques
C. Faloutsos Query Optimization – part 2
Yan Huang - CSCI5330 Database Implementation – Query Processing
Evaluation of Relational Operations: Other Techniques
Lecture 20: Query Execution
Presentation transcript:

QUERY PROCESSING AND OPTIMIZATION

Overview SQL is a declarative language:  Specifies the outcome of the computation without defining any flow of control Will require DBMS to select an execution plan Will allow optimizations

Sample query SELECT C, D FROM R, S WHERE R.B = "z" AND S.F = 30 AND R.A = S.D

The two tables R ABC 1x100 2y200 3z300 4y400 S DEF 1u10 3v30 5w

First execution plan Can use relational algebra to express an execution plan Could be:  Cartesian product: R×S  Selection: σ R.B = "z"  S.F = 30  R.A = S.D (R×S)  Projection: π C, E (σ R.B = "z"  S.F = 30  R.A = S.D (R×S))

Graphical representation R×S σ R.B = "z"  S.F = 200  R.A = S.D π C, E

R×S ABCDEF 1x1001u10 1x1003v30 1x1005w30 2y2001u10 2y2003v30 2y2005w30 3z3001u10 3z3003v30 3z3005w30 4y4001u10 4y4003v30 4y4005w30

Second execution plan Selection: σ B = "z" (R) σ F = 30 (S) Join:σ B = "z" (R)⋈ R.A=S.D σ F = 30 (S) Projection:π C, E (…)

The two tables R ABC 1x100 2y200 3z300 4y400 S DEF 1u10 3v30 5w

After the selections σ B = "z" (R) ABC 3z300 σ F = 30 (S) DEF 3v30 5w

σ B = "z" (R) R.A=S.D σ F = 30 (S) ABCDEF 3z3003v30

Discussion Second plan  Extracts first relevant rows of tables R and S  Uses more efficient join  for each row in σ B = "z" (R) : for each row in σ F = 30 (S) : if R.A = S.D : include rows in result  Note that inner loop searches the smaller temporary table (σ F = 30 (S))

More generally Exclude as quickly as possible:  Irrelevant lines  Irrelevant attributes Most important when the involved tables reside on different hosts (Distributed DBMS) Whenever possible, ensure that inner join loops search tables that can reside in main memory

Caching considerations Cannot rely on LRU to achieve that  Will keep in memory recently accessed pages of all tables Must keep  All pages of table inside the inner loop  No pages of the other table Can either  Let DBMS manage the cache  Use a scan-tolerant cache algorithm (ARC)

A third plan Find lines of R where B = "z" Using index S.D find lines of S where S.D matches R.A for the lines where R.B = "z" Include pair of lines in the join

Processing a query (I) Parse the query Convert query parse tree into a logical query plan (LQP) Apply equivalence rules (laws) and try to improve upon extant LQP Estimate result sizes

Processing a query (II) Consider possible physical plans Estimate their cost Select the best Execute it Given the high cost of query processing, it make sense to evaluate various alternatives

Example (from [GM]) SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ );

Relational Algebra plan  title  StarsIn IN  name  birthdate LIKE ‘%1960’ starName MovieStar Fig. 7.15: An expression using a two-argument , midway between a parse tree and relational algebra

Relational Algebra plan  title  StarsIn IN  name  birthdate LIKE ‘%1960’ starName MovieStar Fig. 7.15: An expression using a two-argument , midway between a parse tree and relational algebra

Relational algebra plan  title  StarsIn IN  name  birthdate LIKE ‘%1960’ starName MovieStar Fig. 7.15: An expression using a two-argument , midway between a parse tree and relational algebra

Logical query plan  title  starName=name StarsIn  name  birthdate LIKE ‘%1960’ MovieStar Fig. 7.18: Applying the rule for IN conditions  Cartesian product could indicate a brute force solution

Estmating result sizes Need expected size StarsIn MovieStar 

Estimate the costs of each option Logical Query Plan P1 P2 …. Pn C1 C2 …. Cn Pick the best!

Query optimization At two levels  Relational algebra level: Use equivalence rules  Detailed query plan level: Takes into account result sizes Considers DB organization  How it is stored  Presence and types of indexes, …

Result sizes do matter Consider the Cartesian product  Very costly when its two operands are large tables  Less true when the tables are small

Equivalence rules for joins  R⋈S = S⋈R  (R⋈S)⋈T = R⋈(S⋈T) Column order does not matter because the columns have labels

Rules for product and union Equivalence rules for Cartesian product:  R x S = S x R  (R x S) x T = R x (S x T) Equivalence rules for union :  R  S = S  R  (R  S)  T = R  (S  T) Column order does not matter because the columns have labels

Rules for selections and unions Equivalence rules for selection:   p1  p2 (R) =  p1 (  p2 (R))   p1  p2 (R) =  p1 (R)   p2 (R) Equivalence rules for union :  R  S = S  R  (R  S)  T = R  (S  T)

Combining projections and joins  I f predicate p only involves attributes of R not used in the join  p (R⋈S) =  p (R)⋈S  If predicate q only involves attributes of S not used in the join  q (R⋈S) = R⋈  q (S) Warning: π p1, p2 (R) is NOT the same as π p1 ( π p2 (R))

Combining selection and joins  p  q (R⋈S)=  p (R)⋈  q (S)  p  q  m (R⋈S)=  m [(  p R)⋈(  q S)]  p  q (R⋈S)= [  p (R)⋈S]  [R⋈(  q (S)]

Combining projections and selections Let  x be a subset of R attributes  z the set of attributes of R used in predicate p then  π x [σ p (R)] = π x [σ p [π xz (R)]] We can only eliminate attributes that are not used in the selection predicate!

Combining projections and joins Let  x be a subset of R attributes  y a subset of S attributes  z the common attributes of R and S  then  xy (R⋈S) =  xy {[  xz (R)]⋈[  yz (S)]}

Combining projections, selections and joins Let   x, y, z be...  z' the union of z and the attributes used in predicate p  xy {  p (R⋈S)} =  xy {  p [  xz’ (R)]⋈[  yz'( S)]}

Combining selections, projections and Cartesian product Rules are similar Just replace join operator by Cartesian product operator  Keep in mind that join is a restricted Cartesian product

  p (R U S) =  p (R) U  p (S)   p (R - S) =  p (R) - S =  p (R) -  p (S) Combining selections and unions

 p1  p2 (R)   p1 [  p2 (R)]  Use successive selections  p (R ⋈ S)  [  p (R)] ⋈ S  Do selections before joins R ⋈ S  S ⋈ R  x [  p (R)]   x {  p [  xz (R)]}  Do projections before selection Finding the most promising transformations

First heuristics Do projections early  Example from [GM]: Given R(A,B,C,D,E) and the select predicate P: (A=3)  (B=“cat”) Seems a good idea to replace  x {  p (R)} by  E {  p {  ABE (R)} } What if we have indexes?

Same example with indexes Assume attribute A is indexed  Use index to locate all tuples where A = 3  Select tuples where B=“cat”  Do the projections In other words    x {  p (R)} is the best solution

Second heuristics Do selections early  Especially if we can use indexes but no heuristics is always true

Estimating cost of query plans Requires  Estimating sizes of the results  Estimating the number of I/O operations We will generally assume that the cost of a query plan is dominated by the number tuples being read or written

Estimating result sizes Relevant data are  T(R) : number of tuples in R  S(R) : size of each tuple of R (in bytes)  B(R): number of blocks required to store R  V(R, A) : number of distinct values for attribute A in R

Example Relation R T(R)=8 Assuming dates take 8 bytes and strings 20 bytes S(R)=48 bytes B(R)=1 block V(R, Owner)=3, V(R, Pet)=2,V(R, Vax date)=4 OwnerPetVax date AliceCat3/2/15 AliceCat3/2/15 BobDog10/8/14 BobDog10/8/15 CarolDog11/9/14 CarolCat12/7/14

Estimating cost of W = R 1 x R 2 T(W) = T(R 1 )×T(R 2 ) S(W) = S(R 1 )+S(R 2 ) Obvious!

Estimating cost of W =  A=a  (R) S(W) = S(R) T(W) = T(R)/V(R, A)  but this assumes that the values of A are uniformly distributed over all the tuples

Example W = σ owner= Bob (R) As T(R) = 6 and V(R, Owner) = 3 T(W) = 3 OwnerPetVax date AliceCat3/2/15 AliceCat3/2/15 BobDog10/8/14 BobDog10/8/15 CarolDog11/9/14 CarolCat12/7/14

Making another assumption Assume now that values in select expression Z = val are uniformly distributed over all possible V(R, Z) values. If W = σ Z=val (R)  T(W) = T(R)/V(R, Z)

Estimating sizes of range queries Attribute Z of table R has 50 possible values ranging from 1 to 100 If W = σ Z > 80, what is T(W)? Assuming the values in Z are uniformly distributed over [0, 1] T(W) = T(R)×(100 – 80)/(100 – 1 +1) = 0.2×T(R)

Explanation T(W) = T(R)×(Query_Range/Value_Range) If query had been W = σ Z ≥ 80 T(W) would have been T(R)×(100 – )/(100 – 1 +1) = 0.21×T(R) 21 possible values

Estimating the size of R⋈S queries We consider R(X, Y)⋈S(Y, Z) Special cases:  R and S have disjoint values for Y: T(R⋈S) = 0  Y is the key of S and a foreign key in R: T(R⋈S) = T(R)  Almost all tuples of R and S have the same value for Y: T(R⋈S) = T(R)T(S)

Estimating the size of R⋈S queries General case:  Will assume Containment of values:  If V(R, Y) ≤ V(S, Y) then all values of Y in R are also in S Preservation of value sets:  If A is an attribute of R that is not in S, then V(R⋈S, A) = V(R, A)

Estimating the size of R⋈S queries  If V(R, Y) ≤ V(S, Y) Every value of R is present in S On average, a given tuple in R is likely to match T(S)/V(S, Y) R has T(R) tuples T(R⋈S) = T(R)×T(S)/V(S, Y)

Estimating the size of R⋈S queries  If V(R, Y) ≥ V(S, Y) Every value of S is present in R On average, a given tuple in S is likely to match T(R)/V(R, Y) S has T(S) tuples T(R⋈S) = T(R)×T(S)/V(R, Y)

Estimating the size of R ⋈ S queries  In general T(R⋈S) = T(R)×T(S)/max(V(R, Y), V(R, S))

An example (I) Finding all employees who live in a city where the company has a plant:  EMPLOYEE( EID, NAME, …., CITY)  PLANT(PLANTID, …,CITY) SELECT E.NAME FROM EMPLOYEE E, PLANT P WHERE E.CITY = P.CITY SELECT EMPLOYEE.NAME FROM EMPLOYEE JOIN PLANT ON EMPLOYEE.CITY= PLANT.CITY

An example (II) Assume  T(E)=5,000V(E, CITY) = 100  T(P)= 200V(P, CITY) = 50  T(E⋈P) = T(E)×T(P)/ MAX(V(E, CITY), V(P, CITY)) = 5,000×200/MAX(100, 50) = 1,000,000/100 = 10,000

Estimating the size of multiple joins R ⋈ S ⋈ U  R(A, B)S(B,C)U(C,D) T(R)=1,000T(S)=2,000T(U)=5,000 V(R,B)=20V(S,B)=50 V(S,C)=100V(U,C)=500 Left to right:  T(R⋈S)= 2,000,000/max(20, 50)=40,000  T(R⋈S⋈U)=200,000,000/max(100, 500)=400,000 Right to left:  T(S⋈U)= 10,000,000/max(100, 500)=20,000  T(R⋈S⋈U)=20,000,000/max(20, 50)=400,000

Estimating the size of multicondition joins R(X,y 1, y 2,…)⋈S(y 1, y 2,…, Z)  If V(R, y 1 )≤V(S, y 1 ) and V(R, y 2 )≤ V(S, y 2 ) … Every value of R is present in S On average, a given tuple in R is likely to match T(S)/(V(S, y 1 )×V(S, y 2 ) …) R has T(R) tuples T(R⋈S) = T(R)×T(S)/(V(S, y 1 )×V(S, y 2 ) …)

Multicondition join R(X,y 1, y 2,…)⋈S(y 1, y 2,…, Z) In general  T(R⋈S) = T(R)×T(S)/ [max(V(R, y 1 ), V(R, y 1 ))× max(V(R, y 2 ), V(R, y 2 ))× ….]

Multicondition join R(X,y 1, y 2,…)⋈S(y 1, y 2,…, Z)  If V(R, y 1 )≤V(S, y 1 ) and V(R, y 2 )≤ V(S, y 2 ) … Every value of R is present in S On average, a given tuple in R is likely to match T(S)/(V(S, y 1 )×V(S, y 2 )) R has T(R) tuples T(R⋈S) = T(R)×T(S)/(V(S, y 1 )×V(S, y 2 ))

Estimating the size of unions T(R⋃S)  for a bag union: T(R⋃S) = T(R)+T(S) exact  for a regular union: If the relations are disjoint:  T(R⋃S) = T(R)+T(S) If one relation contains the other:  T(R⋃S) = max(T(R), T(S)) T(R⋃S)=(max(T(R), T(S))+T(R)+T(S))/2  We take the average!

Estimating the size of intersections T(R⋂S) If the relations are disjoint:  T(R⋂S) = 0 If one relation contains the other  T(R⋂S) = min(T(R), T(S)) T(R⋂S)=min(T(R), T(S))/2  We take the average!

Estimating the size of set differences T(R-S)  If the relations are disjoint T(R)-T(S) =T(R)  If relation R contains relation S: T(R-S) = T(R)-T(S)  T(R-S)=(2T(R)+T(S))/2 We take the average!

Estimating the cost of eliminating duplicates δ(R)  If all tuples are duplicates : T(δ(R)) = 1  If no tuples are duplicates : T(δ(R)) = T(R)  T(δ(R)) = T(R)/2 If R(a 1, a 2, …) and we know the V(R, a i )  T(δ(R)) = Π i V(R, a i )

Collecting statistics Can explicitly request statistics Maintain them incrementally Can collect histograms  Give an idea how data are distributed Not all patrons borrow equal number of books

The Zipf distribution (I) Empirical distribution Verified for many cases Ranks items by frequency/popularity If f is the probability of accessing/using the most popular item in the list (rank 1)  The probability of accessing/using the second most popular item will be close to f/2  The probability … third most popular item will be close to f/3

The Zipf distribution (II) Can alter the shape Ranks items by frequency/popularity If f is the probability of accessing/using the most popular item in the list (rank 1)  The probability of accessing/using the second most popular item will be close to f/2  The probability … third most popular item will be close to f/3

The Zipf distribution (II)

The Zipf distribution (III) Can adjust the slope of the course by adding an exponent If f is the probability of accessing/using the most popular item in the list (rank 1)  The probability of accessing/using the n-th ranked item in the list is will be close to f/n i i = ½ seems to be a good choice

Example (I) A library uses two tables to keep track of its books  Book(BID,Title,Authors,Loaned,Due)  Patron(PID, Name, Phone, Address) The Loaned attribute of a book is equal to  The PID of the patron who borrowed the book  Zero if the book is on the shelves

Example (II) We want to find the titles of all books currently loaned to "E. Chambas" T(Books)=5,000V(Books, Loaned) = 200 T(Patron)=500V(Patron, Name) = 500

First plan X = Book ⋈ Loaned = PID Patron Y = σ Name = "E. Chambas" (X) Since PID is the key of Patron and assuming that Loaned were a foreign key in Books: T(X)= T(Books) (all books are borrowed!) = 5,000 T(Y)= T(X)/V(Patron, Name) =5,000/500 = 10

Second plan X = σ Name = "E. Chambas" (Patron) Y = Book ⋈ Loaned = PID X T(X)= T(Patron) /V(Patron, Name) = 5,000/5,000 = 1 T(Z)= T(Book)×T(X)/V(Book, Loaned) = 5,000/500 = 10

Comparing the two plans (I) Comparison based on the number of tuples created by the plan minus  The number of tuples constituting the answer Should be the same for all correct plans For the same reason, we do not considder the number of tuples being read

Comparing the two plans (II) Cost of first plan: 5,000 Cost of second plan: 1

An example (I) Finding all employees who live in a city where the company has a plant:  EMPLOYEE( EID, NAME, …., CITY)  PLANT(PLANTID, …,CITY) Assume  T(E)=5,000V(E, CITY) = 100 V(E, NAME) = 5,000  T(P)= 200V(P, CITY) = 50

A first plan X = E ⋈ E.CITY = P.CITY P Y = π E.NAME (X) T(E⋈P)= T(E)×T(P)/ max(V(E, CITY), V(P, CITY)) = 5,000×200/MAX(100, 50) = 1,000,000/100 = 10,000 T(Y) = 10,000 (not possible!)

A second plan X = π P.CITY (P) Y = δ(X) Z = E ⋈ E.CITY = Y.CITY Y U = π E.NAME (Z) T(X)= T(P) = 200 T(Y)= V(X, CITY) = V(P, CITY) =50 T(Z)= T(E)×T(Y)/max(V(E, CITY), 1) = 5,000×50/MAX(100, 1) = 2,500 T(U) =T(Z) = 2,500

Comparing the two plans Here it pays off to eliminate duplicates early

Example [GM] We have R(a,b) and S(b,c) We want δ(σ A="a" (R⋈S)) We know  T(R) = 5,000T(S) = 2,000 V(R, a) = 50 V(R, b) = 100V(S, b) = 200 V(S, c) = 100

First plan X 1 = σ a="a" (R) X 2 = X 1 ⋈S X 3 = δ (X 2 )

First plan X 1 = σ a="a" (R) X 2 = X 1 ⋈S X 3 = δ (X 2 ) T(X 1 )= T(R)/V(R, a) = 5,000/50 = 100 T(X 2 ) = T(X 1 )×T(S)/max(V(R, b), V(S, b)) = 100×2000/max(100, 200) = 1,000 T(X 3 )= min(…, T(X 2 )/2)= 500 doesn't count

Second plan X 1 = δ(R) X 2 = δ(S) X 3 = σ a="a" (X 1 ) X 3 = X 3 ⋈X 2

Second plan X 1 =δ(R), X 2 =δ(S), X 3 =σ a="a" (X 1 ), X 4 =X 3 ⋈X 2 T(X 1 )= min(V(R, a)×V(R, b), T(R)/2) = min(50×100, 5000/2) = 2500 T(X 2 )= min(V(S, b)×V(S, c), T(S)/2) = min(200×100, 2000/2) = 1000 T(X 3 )=T(X 1 )/V(R, a) = 2500/50 = 50 T(X 4 )= T(X 3 )×T(X 2 )/max(V(R, b), V(S, b)) = 50×1000/max(100, 200))= 250 nono

Comparing the two plans Here it did not pay off to eliminate duplicates early

A hybrid plan X 1 = σ a="a" (R) X 2 = δ(X 1 ) X 3 = X 2 ⋈S X 4 = δ(X 3 )

A hybrid plan X 1 = σ a="a" (R) X 2 = δ(X 1 ) X 3 = X 2 ⋈S X 4 = δ(X 3 ) T(X 1 )= T(R)/V(R, a) = 5,000/50 = 100 T(X 2 )= min(V(R, b), T(X 1 )/2) = min(100,50) = 50 T(X 3 )= T(X 3 )×T(S)/max(V(R, b), V(S, b)) = 50×2000/max(100, 200))= 500 T(X 4 )= min(…, T(X 3 )/2)= 250 nono

Comparing the two best plans Reducing the sizes of the tables in a join is a good idea if we can do it on the cheap

Ordering joins Joins methods are often asymmetric so cost(R⋈S)≠cost(S⋈R) Useful to build a join tree A simple greedy algorithm will work well:  Start with pair of relation whose estimated join size will the the smallest  Find among other relations the one that would produce the smallest estimates size when joined to the current tree.

Implementing joins

A. Nested loops W = [ ] for rows in R : for rows in S : if match_found( ) : append_concatenated_rows() Number of operations:  T(R)×T(S)

The idea Table R Table S Try to match every tuple of R with all tuples of S

Optimization Assume that the second relation can fit in main memory  Read only once  Number of reads is T(R) + T(S)

B. Sort and merge We sort the two tables using the matching attributes as sorting keys Can now do select matches by doing a merge  Single pass process unless we have duplicate matches  Number of operations is O(T(R)log(T(R)))+O(T(S)log(T(S)))+T(R)+T(S) assuming one table does not have potential duplicate matches  Great if the tables are already sorted

C. Hashing Assume both tables maintain a hash with K entries for the matching attributes for i in range(0, K – 1) : join all R entries in bucket i with all S entries in the same bucket  We replace a big join by K smaller joins Number of operations will be: K×(T(R)/K)×(T(S)/K) = T(R)×T(S)/K