Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016.

Similar presentations


Presentation on theme: "Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016."— Presentation transcript:

1 Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016

2 Why Do We Learn This? Query Optimization – At the heart of the database engine Step 1: convert the SQL query to some logical plan – Query compiler Step 2: find a better logical plan, find an associated physical plan – We have multiple ways to address SQL queries, which one is better (the best)? 1

3 Converting from SQL to Logical Plans 2 Select a1, …, an From R1, …, Rk Where C Select a1, …, an From R1, …, Rk Where C  a1,…,an (  C (R1 x R2 x … x Rk))  a1,…,an (  b1, …, bm, aggs (  C (R1 x R2 x … x Rk))) Select a1, …, an From R1, …, Rk Where C Group by b1, …, bm Select a1, …, an From R1, …, Rk Where C Group by b1, …, bm

4 Optimization: Logical Query Plan Now we have one logical plan – Usually not optimal Two approaches to optimizations: – Rule-based (heuristics): apply laws that seem to result in cheaper plans – Cost-based: estimate size and cost of intermediate results, search systematically for best plan Three key components – Algebraic laws – An optimization algorithm – A cost estimator 3

5 Algebraic Laws Commutative and Associative Laws – R  S = S  R, R  (S  T) = (R  S)  T – R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T – R ⋈ S = S ⋈ R, R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T Distributive Laws – R ⋈ (S  T) = (R ⋈ S)  (R ⋈ T) 4 Q: How to prove these laws? Make sense?

6 Algebraic Laws Laws involving selection: –  C AND C’ (R) =  C (  C’ (R)) =  C (R) ∩  C’ (R) –  C OR C’ (R) =  C (R) U  C’ (R) –  C (R ⋈ S) =  C (R) ⋈ S When C involves only attributes of R –  C (R – S) =  C (R) – S –  C (R  S) =  C (R)   C (S) –  C (R ∩ S) =  C (R) ∩ S 5 Q: What do they mean? Make sense?

7 Example R(A, B), S(B, C)  (A=1 OR A=3) AND (B<C) (R ⋈ S)  (A=1 OR A=3) (  (B<C) (R ⋈ S))  (A=1 OR A=3) (R ⋈  (B<C) S)  (A=1 OR A=3) (R) ⋈  (B<C) S 6

8 Algebraic Laws Laws involving projections –  M (R ⋈ S) =  N (  P (R) ⋈  Q (S)) Where N, P, Q are appropriate subsets of attributes of M Does it make sense to reduce I/O? –  M (  N (R)) =  M ∩ N (R) Example R(A,B,C,D), S(E, F, G) –  A,B,G (R ⋈ D=E S) =  ? (  ? (R) ⋈  ? (S)) 7 Q: Again, what do they mean? Make sense?

9 Rule(Heuristic) Based Optimization Query rewriting based on algebraic laws Result in better queries most of the time Heuristics number 1: – Push selections down Heuristics number 2: – Sometimes push selections up, then down 8

10 Predicate Pushdown The earlier we process selections, less tuples we need to manipulate higher up in the tree (but may cause us to loose an important ordering of the tuples, if we use indexes) 9 Product Company maker=name  price>100 AND city=“Tally” pname Product Company maker=name price>100 pname city=“Tally”

11 Predicate Pushdown For each company, find the maximal price of its products – Advantage: the size of the join will be smaller Won’t work if we replace Max by Min 10 Select y.name, Max(x.price) From product x, company y Where x.maker = y.name GroupBy y.name Having Max(x.price) > 100 Select y.name, Max(x.price) From product x, company y Where x.maker = y.name GroupBy y.name Having Max(x.price) > 100 Select y.name, Max(x.price) From product x, company y Where x.maker=y.name and x.price > 100 GroupBy y.name Having Max(x.price) > 100 Select y.name, Max(x.price) From product x, company y Where x.maker=y.name and x.price > 100 GroupBy y.name Having Max(x.price) > 100

12 Behind the Scene: Oracle RBO and CBO Oracle 7 (1992) prior (since 1979): RBO Oracle 7-10: RBO + CBO Oracle 10g (2003): CBO 11

13 Behind the Scene: Oracle RBO and CBO 12

14 Cost Based Estimation 13

15 Cost-based Optimizations Main idea: apply algebraic laws, until estimated cost is minimal Practically: start from partial plans, introduce operators one by one Problem: there are too many ways to apply the laws, hence too many (partial) plans Approaches: – Top-down: the partial plan is a top fragment of the logical plan – Bottom up: the partial plan is a bottom fragment of the logical plan 14

16 Search Strategies Branch-and-bound: – Remember the cheapest complete plan P seen so far and its cost C – Stop generating partial plans whose cost is > C – If a cheaper complete plan is found, replace P, C Hill climbing: – Remember only the cheapest partial plan seen so far Dynamic programming: – Remember the all cheapest partial plans 15

17 Join Trees R1 ⋈ R2 ⋈ …. ⋈ Rn Join tree: A plan = a join tree A partial plan = a subtree of a join tree 16 R3R1R2R4

18 Join Trees Left deep: 17 R3R1 R5 R2 R4

19 Join Trees Bushy: 18 R3 R1 R2R4 R5

20 Problem Given a query R1 ⋈ R2 ⋈ … ⋈ Rn Assume we have a function cost() that gives us the cost of every join tree Objective: Find the best join tree for the query Dynamic programming – Idea: for each subset (subquery) of {R1, …, Rn}, compute the best plan for that subset – In increasing order of set cardinality: Step 1: for {R1}, {R2}, …, {Rn} Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} … Step n: for {R1, …, Rn} – It is a bottom-up strategy 19

21 Dynamic Programming - Algorithm For each subquery Q ⊆ {R1, …, Rn} compute the following: – Size(Q) – A best plan for Q: Plan(Q) – The cost of that plan: Cost(Q) Step 1: For each {Ri} do: – Size({Ri}) = B(Ri) – Plan({Ri}) = Ri – Cost({Ri}) = (cost of scanning Ri) 20

22 Dynamic Programming - Algorithm Step i: For each Q ⊆ {R1, …, Rn} of cardinality i do: – Compute Size(Q) (later…) – For every pair of subqueries Q’, Q’’ s.t. Q = Q’  Q’’ compute cost(Plan(Q’) ⋈ Plan(Q’’)) – Cost(Q) = the smallest such cost – Plan(Q) = the corresponding plan Finally, return Plan({R1, …, Rn}) 21

23 Dynamic Programming - Cost To illustrate, we will make the following simplifications: Cost(P1 ⋈ P2) = Cost(P1) + Cost(P2) + size(intermediate result) – Intermediate results: If P1 = a join, then the size of the intermediate result is size(P1), otherwise the size is 0 Similarly for P2 – Cost of a scan = 0 Example: – Cost(R1 ⋈ R2) = 0 (no intermediate results) – Cost((R1 ⋈ R2) ⋈ R3) = Cost(R1 ⋈ R2) + Cost(R3) + size(R1 ⋈ R2) = size(R1 ⋈ R2) 22

24 Dynamic Programming - Example Relations: R, S, T, U Number of tuples: 2000, 5000, 3000, 1000 Size estimation: T(A ⋈ B) = 0.01*T(A)*T(B) 23

25 Dynamic Programming - Example 24 SubquerySizeCostPlan RS RT RU ST SU TU RST RSU RTU STU RSTU

26 Dynamic Programming - Example 25 SubquerySizeCostPlan RS100k0RS RT60k0RT RU20k0RU ST150k0ST SU50k0SU TU30k0TU RST3M60k(RT)S RSU1M20k(RU)S RTU0.6M20k(RU)T STU1.5M30k(TU)S RSTU30M60k+50k=110k(RT)(SU)

27 Dynamic Programming - Summary Compute optimal plans for subqueries: – Step 1: {R1}, {R2}, …, {Rn} – Step 2: {R1, R2}, {R1, R3}, …, {Rn-1, Rn} – … – Step n: {R1, …, Rn} We used naïve size/cost estimations In practice: – more realistic size/cost estimations – heuristics for reducing the Search Space Restrict to left linear trees Restrict to trees “without cartesian product”: R(A,B), S(B,C), T(C,D) (R join T) join S has a cartesian product 26

28 Size Estimation Need size in order to estimate cost Example: – Cost of partitioned hash-join E1 E2 is 3B(E1) + 3B(E2) – B(E1) = T(E1)/ block size – B(E2) = T(E2)/ block size – So, we need to estimate T(E1), T(E2) Estimating the size of a selection – S =  A=c (R) T(S) can be anything from 0 to T(R) – V(R,A) + 1 Mean value: T(S) = T(R)/V(R,A) – S =  A<c (R) T(S) can be anything from 0 to T(R) Heuristics: T(S) = T(R)/3 27

29 Size Estimation Assume V(R,A) <= V(S,A) Then each tuple t in R joins some tuple(s) in S – How many ? – On average S/V(S,A) – t will contribute S/V(S,A) tuples in R S Hence T(R S) = T(R) T(S) / V(S,A) In general: T(R S) = T(R) T(S) / max(V(R,A),V(S,A)) Example – T(R) = 10000, T(S) = 20000 – V(R,A) = 100, V(S,A) = 200 – How large is R S ? – Answer: T(R S) = 10000 * 20000/200 = 1M 28 AAAAA

30 Size Estimation 29 Joins on more than one attribute: T(R S) = T(R) T(S)/max(V(R,A),V(S,A))max(V(R,B),V(S,B)) A,B

31 Histograms Employee(ssn, name, salary, phone) – Maintain a histogram on salary: – T(Employee) = 25000, but now we know the distribution Ranks(rankName, salary) 30

32 Histograms Assume: – V(Employee, Salary) = 200 – V(Ranks, Salary) = 250 Then T(Employee Ranks) = =  i=1,6 T i T i ’ / 250 = (200x8 + 800x20 + 5000x40 + 12000x80 + 6500x100 + 500x2)/250 = …. 31 Salary


Download ppt "Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016."

Similar presentations


Ads by Google