Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016.

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

Databases and Information Systems 1 Prof. Dr. Stefan Böttcher Fakultät EIM, Institut für Informatik Universität Paderborn WS 2009 / 2010 Contents: selectivity.
CS CS4432: Database Systems II Logical Plan Rewriting.
Query Optimization May 31st, Today A few last transformations Size estimation Join ordering Summary of optimization.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Query Optimization Goal: Declarative SQL query
Query Optimization. Query Optimization Process (simplified a bit) Parse the SQL query into a logical tree: –identify distinct blocks (corresponding to.
Lecture 14: Query Optimization. This Lecture Query rewriting Cost estimation –We have learned how atomic operations are implemented and their cost –We’ll.
Cost-Based Transformations. Why estimate costs? Well, sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g.
Query Rewrite: Predicate Pushdown (through grouping) Select bid, Max(age) From Reserves R, Sailors S Where R.sid=S.sid GroupBy bid Having Max(age) > 40.
Cs44321 CS4432: Database Systems II Query Optimizer – Cost Based Optimization.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
Query Optimization: Transformations May 29 th, 2002.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Cost-Based Transformations. Why estimate costs? Sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g. Pushing.
CMSC724: Database Management Systems Instructor: Amol Deshpande
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
CPS216: Advanced Database Systems Notes 08:Query Optimization (Plan Space, Query Rewrites) Shivnath Babu.
Query Optimization March 6 th, Query Optimization Process (simplified a bit) Parse the SQL query into a logical tree: –identify distinct blocks.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
CPS216: Data-Intensive Computing Systems Introduction to Query Processing Shivnath Babu.
1 Lecture 25 Friday, November 30, Outline Query execution –Two pass algorithms based on indexes (6.7) Query optimization –From SQL to logical.
CSE544 Query Optimization Tuesday-Thursday, February 8 th -10 th, 2011 Dan Suciu , Winter
CPS216: Advanced Database Systems Notes 09:Query Optimization (Cost-based optimization) Shivnath Babu.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
CS4432: Database Systems II Query Processing- Part 2.
Optimization Overview Lecture 17. Today’s Lecture 1.Logical Optimization 2.Physical Optimization 3.Course Summary 2 Lecture 17.
1 Lecture 25: Query Optimization Wednesday, November 26, 2003.
CS 440 Database Management Systems Query Optimization 1.
1 Lecture 15 Monday, May 20, 2002 Size Estimation, XML Processing.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Query Optimization Problem Pick the best plan from the space of physical plans.
CSE 544: Lecture 14 Wednesday, 5/15/2002 Optimization, Size Estimation.
CS4432: Database Systems II Query Processing- Part 1 1.
Chapter 14: Query Optimization
CS 440 Database Management Systems
Query Optimization Heuristic Optimization
Lecture 26: Query Optimizations and Cost Estimation
Lecture 27: Size/Cost Estimation
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Lecture 24: Query Execution and Optimization
Data Engineering Query Optimization (Cost-based optimization)
Introduction to Database Systems CSE 444 Lecture 22: Query Optimization November 26-30, 2007.
Lecture 26: Query Optimization
Lecture 21: ML Optimizers
Query Optimization and Perspectives
Lecture 25: Query Execution
Lecture 27: Optimizations
Lecture 24: Query Execution
Lecture 25: Query Optimization
Lecture 28: Size/Cost Estimation, Recovery
CPSC-608 Database Systems
CPSC-608 Database Systems
Monday, 5/13/2002 Hash table indexes, query optimization
Query Optimization March 7th, 2003.
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Lecture 23: Monday, November 25, 2002.
CSE 544: Optimizations Wednesday, 5/10/2006.
Lecture 26 Monday, December 3, 2001.
CPSC-608 Database Systems
Lecture 26: Wednesday, December 4, 2002.
Lecture 27 Wednesday, December 5, 2001.
Lecture 24: Wednesday, November 27, 2002.
Presentation transcript:

Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016

Why Do We Learn This? Query Optimization – At the heart of the database engine Step 1: convert the SQL query to some logical plan – Query compiler Step 2: find a better logical plan, find an associated physical plan – We have multiple ways to address SQL queries, which one is better (the best)? 1

Converting from SQL to Logical Plans 2 Select a1, …, an From R1, …, Rk Where C Select a1, …, an From R1, …, Rk Where C  a1,…,an (  C (R1 x R2 x … x Rk))  a1,…,an (  b1, …, bm, aggs (  C (R1 x R2 x … x Rk))) Select a1, …, an From R1, …, Rk Where C Group by b1, …, bm Select a1, …, an From R1, …, Rk Where C Group by b1, …, bm

Optimization: Logical Query Plan Now we have one logical plan – Usually not optimal Two approaches to optimizations: – Rule-based (heuristics): apply laws that seem to result in cheaper plans – Cost-based: estimate size and cost of intermediate results, search systematically for best plan Three key components – Algebraic laws – An optimization algorithm – A cost estimator 3

Algebraic Laws Commutative and Associative Laws – R  S = S  R, R  (S  T) = (R  S)  T – R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T – R ⋈ S = S ⋈ R, R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T Distributive Laws – R ⋈ (S  T) = (R ⋈ S)  (R ⋈ T) 4 Q: How to prove these laws? Make sense?

Algebraic Laws Laws involving selection: –  C AND C’ (R) =  C (  C’ (R)) =  C (R) ∩  C’ (R) –  C OR C’ (R) =  C (R) U  C’ (R) –  C (R ⋈ S) =  C (R) ⋈ S When C involves only attributes of R –  C (R – S) =  C (R) – S –  C (R  S) =  C (R)   C (S) –  C (R ∩ S) =  C (R) ∩ S 5 Q: What do they mean? Make sense?

Example R(A, B), S(B, C)  (A=1 OR A=3) AND (B<C) (R ⋈ S)  (A=1 OR A=3) (  (B<C) (R ⋈ S))  (A=1 OR A=3) (R ⋈  (B<C) S)  (A=1 OR A=3) (R) ⋈  (B<C) S 6

Algebraic Laws Laws involving projections –  M (R ⋈ S) =  N (  P (R) ⋈  Q (S)) Where N, P, Q are appropriate subsets of attributes of M Does it make sense to reduce I/O? –  M (  N (R)) =  M ∩ N (R) Example R(A,B,C,D), S(E, F, G) –  A,B,G (R ⋈ D=E S) =  ? (  ? (R) ⋈  ? (S)) 7 Q: Again, what do they mean? Make sense?

Rule(Heuristic) Based Optimization Query rewriting based on algebraic laws Result in better queries most of the time Heuristics number 1: – Push selections down Heuristics number 2: – Sometimes push selections up, then down 8

Predicate Pushdown The earlier we process selections, less tuples we need to manipulate higher up in the tree (but may cause us to loose an important ordering of the tuples, if we use indexes) 9 Product Company maker=name  price>100 AND city=“Tally” pname Product Company maker=name price>100 pname city=“Tally”

Predicate Pushdown For each company, find the maximal price of its products – Advantage: the size of the join will be smaller Won’t work if we replace Max by Min 10 Select y.name, Max(x.price) From product x, company y Where x.maker = y.name GroupBy y.name Having Max(x.price) > 100 Select y.name, Max(x.price) From product x, company y Where x.maker = y.name GroupBy y.name Having Max(x.price) > 100 Select y.name, Max(x.price) From product x, company y Where x.maker=y.name and x.price > 100 GroupBy y.name Having Max(x.price) > 100 Select y.name, Max(x.price) From product x, company y Where x.maker=y.name and x.price > 100 GroupBy y.name Having Max(x.price) > 100

Behind the Scene: Oracle RBO and CBO Oracle 7 (1992) prior (since 1979): RBO Oracle 7-10: RBO + CBO Oracle 10g (2003): CBO 11

Behind the Scene: Oracle RBO and CBO 12

Cost Based Estimation 13

Cost-based Optimizations Main idea: apply algebraic laws, until estimated cost is minimal Practically: start from partial plans, introduce operators one by one Problem: there are too many ways to apply the laws, hence too many (partial) plans Approaches: – Top-down: the partial plan is a top fragment of the logical plan – Bottom up: the partial plan is a bottom fragment of the logical plan 14

Search Strategies Branch-and-bound: – Remember the cheapest complete plan P seen so far and its cost C – Stop generating partial plans whose cost is > C – If a cheaper complete plan is found, replace P, C Hill climbing: – Remember only the cheapest partial plan seen so far Dynamic programming: – Remember the all cheapest partial plans 15

Join Trees R1 ⋈ R2 ⋈ …. ⋈ Rn Join tree: A plan = a join tree A partial plan = a subtree of a join tree 16 R3R1R2R4

Join Trees Left deep: 17 R3R1 R5 R2 R4

Join Trees Bushy: 18 R3 R1 R2R4 R5

Problem Given a query R1 ⋈ R2 ⋈ … ⋈ Rn Assume we have a function cost() that gives us the cost of every join tree Objective: Find the best join tree for the query Dynamic programming – Idea: for each subset (subquery) of {R1, …, Rn}, compute the best plan for that subset – In increasing order of set cardinality: Step 1: for {R1}, {R2}, …, {Rn} Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} … Step n: for {R1, …, Rn} – It is a bottom-up strategy 19

Dynamic Programming - Algorithm For each subquery Q ⊆ {R1, …, Rn} compute the following: – Size(Q) – A best plan for Q: Plan(Q) – The cost of that plan: Cost(Q) Step 1: For each {Ri} do: – Size({Ri}) = B(Ri) – Plan({Ri}) = Ri – Cost({Ri}) = (cost of scanning Ri) 20

Dynamic Programming - Algorithm Step i: For each Q ⊆ {R1, …, Rn} of cardinality i do: – Compute Size(Q) (later…) – For every pair of subqueries Q’, Q’’ s.t. Q = Q’  Q’’ compute cost(Plan(Q’) ⋈ Plan(Q’’)) – Cost(Q) = the smallest such cost – Plan(Q) = the corresponding plan Finally, return Plan({R1, …, Rn}) 21

Dynamic Programming - Cost To illustrate, we will make the following simplifications: Cost(P1 ⋈ P2) = Cost(P1) + Cost(P2) + size(intermediate result) – Intermediate results: If P1 = a join, then the size of the intermediate result is size(P1), otherwise the size is 0 Similarly for P2 – Cost of a scan = 0 Example: – Cost(R1 ⋈ R2) = 0 (no intermediate results) – Cost((R1 ⋈ R2) ⋈ R3) = Cost(R1 ⋈ R2) + Cost(R3) + size(R1 ⋈ R2) = size(R1 ⋈ R2) 22

Dynamic Programming - Example Relations: R, S, T, U Number of tuples: 2000, 5000, 3000, 1000 Size estimation: T(A ⋈ B) = 0.01*T(A)*T(B) 23

Dynamic Programming - Example 24 SubquerySizeCostPlan RS RT RU ST SU TU RST RSU RTU STU RSTU

Dynamic Programming - Example 25 SubquerySizeCostPlan RS100k0RS RT60k0RT RU20k0RU ST150k0ST SU50k0SU TU30k0TU RST3M60k(RT)S RSU1M20k(RU)S RTU0.6M20k(RU)T STU1.5M30k(TU)S RSTU30M60k+50k=110k(RT)(SU)

Dynamic Programming - Summary Compute optimal plans for subqueries: – Step 1: {R1}, {R2}, …, {Rn} – Step 2: {R1, R2}, {R1, R3}, …, {Rn-1, Rn} – … – Step n: {R1, …, Rn} We used naïve size/cost estimations In practice: – more realistic size/cost estimations – heuristics for reducing the Search Space Restrict to left linear trees Restrict to trees “without cartesian product”: R(A,B), S(B,C), T(C,D) (R join T) join S has a cartesian product 26

Size Estimation Need size in order to estimate cost Example: – Cost of partitioned hash-join E1 E2 is 3B(E1) + 3B(E2) – B(E1) = T(E1)/ block size – B(E2) = T(E2)/ block size – So, we need to estimate T(E1), T(E2) Estimating the size of a selection – S =  A=c (R) T(S) can be anything from 0 to T(R) – V(R,A) + 1 Mean value: T(S) = T(R)/V(R,A) – S =  A<c (R) T(S) can be anything from 0 to T(R) Heuristics: T(S) = T(R)/3 27

Size Estimation Assume V(R,A) <= V(S,A) Then each tuple t in R joins some tuple(s) in S – How many ? – On average S/V(S,A) – t will contribute S/V(S,A) tuples in R S Hence T(R S) = T(R) T(S) / V(S,A) In general: T(R S) = T(R) T(S) / max(V(R,A),V(S,A)) Example – T(R) = 10000, T(S) = – V(R,A) = 100, V(S,A) = 200 – How large is R S ? – Answer: T(R S) = * 20000/200 = 1M 28 AAAAA

Size Estimation 29 Joins on more than one attribute: T(R S) = T(R) T(S)/max(V(R,A),V(S,A))max(V(R,B),V(S,B)) A,B

Histograms Employee(ssn, name, salary, phone) – Maintain a histogram on salary: – T(Employee) = 25000, but now we know the distribution Ranks(rankName, salary) 30

Histograms Assume: – V(Employee, Salary) = 200 – V(Ranks, Salary) = 250 Then T(Employee Ranks) = =  i=1,6 T i T i ’ / 250 = (200x x x x x x2)/250 = …. 31 Salary