CS411 Database Systems Kazuhiro Minami 12: Query Optimization.

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

CS CS4432: Database Systems II Logical Plan Rewriting.
Query Optimization May 31st, Today A few last transformations Size estimation Join ordering Summary of optimization.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Query Optimization Goal: Declarative SQL query
Query Optimization. Query Optimization Process (simplified a bit) Parse the SQL query into a logical tree: –identify distinct blocks (corresponding to.
Lecture 14: Query Optimization. This Lecture Query rewriting Cost estimation –We have learned how atomic operations are implemented and their cost –We’ll.
Cost-Based Transformations. Why estimate costs? Well, sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
Query Rewrite: Predicate Pushdown (through grouping) Select bid, Max(age) From Reserves R, Sailors S Where R.sid=S.sid GroupBy bid Having Max(age) > 40.
Cs44321 CS4432: Database Systems II Query Optimizer – Cost Based Optimization.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
Query Optimization: Transformations May 29 th, 2002.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Cost-Based Transformations. Why estimate costs? Sometimes we don’t need cost estimations to decide applying some heuristic transformation. –E.g. Pushing.
CMSC724: Database Management Systems Instructor: Amol Deshpande
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 14: Query Optimization.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
T HE Q UERY C OMPILER Prepared by : Ankit Patel (226)
CPS216: Advanced Database Systems Notes 03:Query Processing (Overview, contd.) Shivnath Babu.
Query Processing Presented by Aung S. Win.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
CSCE Database Systems Chapter 15: Query Execution 1.
Cost based transformations Initial logical query plan Two candidates for the best logical query plan.
Database Management 9. course. Execution of queries.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
CPS216: Advanced Database Systems Notes 08:Query Optimization (Plan Space, Query Rewrites) Shivnath Babu.
Query Optimization March 6 th, Query Optimization Process (simplified a bit) Parse the SQL query into a logical tree: –identify distinct blocks.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
CPS216: Data-Intensive Computing Systems Introduction to Query Processing Shivnath Babu.
1 Lecture 25 Friday, November 30, Outline Query execution –Two pass algorithms based on indexes (6.7) Query optimization –From SQL to logical.
Chapters 15-16a1 (Slides by Hector Garcia-Molina, Chapters 15 and 16: Query Processing.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CPS216: Advanced Database Systems Notes 09:Query Optimization (Cost-based optimization) Shivnath Babu.
CS4432: Database Systems II Query Processing- Part 2.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
1 Lecture 25: Query Optimization Wednesday, November 26, 2003.
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Query Optimization 1.
1 Choosing an Order for Joins. 2 What is the best way to join n relations? SELECT … FROM A, B, C, D WHERE A.x = B.y AND C.z = D.z Hash-Join Sort-JoinIndex-Join.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Chapter 13: Query Processing
CSE 544: Lecture 14 Wednesday, 5/15/2002 Optimization, Size Estimation.
CS4432: Database Systems II Query Processing- Part 1 1.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016.
CS 440 Database Management Systems
Prepared by : Ankit Patel (226)
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Lecture 24: Query Execution and Optimization
Introduction to Database Systems CSE 444 Lecture 22: Query Optimization November 26-30, 2007.
Lecture 26: Query Optimization
Query Optimization and Perspectives
Lecture 27: Optimizations
Lecture 24: Query Execution
Lecture 25: Query Optimization
Lecture 23: Query Execution
Monday, 5/13/2002 Hash table indexes, query optimization
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Lecture 23: Monday, November 25, 2002.
CSE 544: Optimizations Wednesday, 5/10/2006.
Lecture 26 Monday, December 3, 2001.
Lecture 27 Wednesday, December 5, 2001.
Lecture 20: Query Execution
Lecture 24: Wednesday, November 27, 2002.
Presentation transcript:

CS411 Database Systems Kazuhiro Minami 12: Query Optimization

One-pass Algorithms Duplicate elimination  (R) Need to keep a dictionary in memory: –balanced search tree –hash table –etc Cost: B(R) Assumption: B(  (R)) <= M R Input buffer Scan before? M-1 buffers Output buffer

Optimization Step 1: convert the SQL query to some logical plan –Remove subqueries from conditions –Map the SFW statement into RA expression Step 2: find a better logical plan, find an associated physical plan –Algebraic laws: foundation for every optimization –Two approaches to optimizations: Heuristics: apply laws that seem to result in cheaper plans Cost based: estimate size and cost of intermediate results, search systematically for best plan

SQL –> Logical Query Plans

Converting from SQL to Logical Plans Select a 1, …, a n From R 1, …, R k Where C Select a 1, …, a n From R 1, …, R k Where C  a1,…,an (  C (R 1  R 2  …  R k ))  a1,…,an (  b1, …, bm, aggs (  C (R 1  R 2  …  R k ))) Select a 1, …, a n From R 1, …, R k Where C Group by b 1, …, b l Select a 1, …, a n From R 1, …, R k Where C Group by b 1, …, b l

Some nested queries can be flattened Select distinct product.name From product Where product.maker in (Select company.name From company where company.city=“Urbana”) Select distinct product.name From product Where product.maker in (Select company.name From company where company.city=“Urbana”) Select distinct product.name From product, company Where product.maker = company.name AND company.city=“Urbana” Select distinct product.name From product, company Where product.maker = company.name AND company.city=“Urbana”

Converting Nested Queries Select distinct x.name, x.maker From product x Where x.color= “blue” AND x.price >= ALL (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”) Select distinct x.name, x.maker From product x Where x.color= “blue” AND x.price >= ALL (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”) Q: How do we convert this one to logical plan ? Q: Give a list of product-manufacture pairs where the color of the product is blue and its prices is the highest among the products with blue color from that manufacture.

Converting Nested Queries Select distinct x.name, x.maker From product x Where x.color= “blue” AND x.price < SOME (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”) Select distinct x.name, x.maker From product x Where x.color= “blue” AND x.price < SOME (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”) Let’s compute the complement first:

Converting Nested Queries Select distinct x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price Select distinct x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price This one becomes a query without subqueries: This returns exactly the products we DON’T want, so…

A set difference operator finishes the job (Select x.name, x.maker From product x Where x.color = “blue”) EXCEPT (Select x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price) (Select x.name, x.maker From product x Where x.color = “blue”) EXCEPT (Select x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price)

Now rewrite the logical plan to an equivalent but better one Same query answer Optimizer uses algebraic laws Equivalen t Original Will probably run faster Cost-based: estimate size and cost of intermediate results, search systematically for best plan Heuristic: likely to result in cheaper plans

Algebraic Laws

Commutative and Associative Laws –R  S = S  R, R  (S  T) = (R  S)  T –R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T –R ⋈ S = S ⋈ R, R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T Distributive Laws –R ⋈ (S  T) = (R ⋈ S)  (R ⋈ T)

Algebraic laws about selections  C AND D (R) =  C (  D (R)) =  C (R) ∩  D (R)  C OR D (R) =  C (R)   D (R)  C (R  S) =  C (R)   C (S)  C (R ⋈ S) =  C (R) ⋈ S  C (R – S) =  C (R) – S  C (R ∩ S) =  C (R) ∩ S if C involves only attributes of R

R( A,B,C,D ) S( E,F,G )  F=3 (R ⋈ D=E S) =  A=5 AND G=9 (R ⋈ D=E S) = (R ⋈ D=E  F=3 (S))  A=5 (  G=9 (R ⋈ D=E S)) =  A=5 (R ⋈ D=E  G=9 (S)) =  A=5 (R) ⋈ D=E  G=9 (S)

Algebraic laws for projection  M (R ⋈ S) =  N (  P (R) ⋈  Q (S)) where N, P, Q are appropriate subsets of attributes of M  M (  N (R)) =  M,N (R) R(A,B,C,D) S(E,F,G)  A,B,G (R ⋈ D=E S) =  ? (  ? (R) ⋈ D=E  ? (S))

Algebraic laws for grouping and aggregation  (  A, agg(B) (R)) =  A, agg(B) (R)  A, agg(B) (  (R)) =  A, agg(B) (R), if agg is duplicate insensitive The book describes additional algebraic laws, but even the book doesn’t cover them all. SUM COUNT AVG MIN MAX

Heuristics-based Optimization - or – Do projections and selections as early as possible

Heuristic Based Optimizations Query rewriting based on algebraic laws Result in better queries most of the time Heuristics number 1: –Push selections down Heuristics number 2: –Sometimes push selections up, then down

Predicate Pushdown Product Company maker=name  price>100 AND city=“Urbana” pname Product Company maker=name price>100 pname city=“Urbana” (but may cause us to lose an important ordering of the tuples, if we use indexes).

For each company with a product costing more than $100, find the max price of its products Select y.name, y.address, Max(x.price) From product x, company y Where x.maker = y.name GroupBy y.name Having Max(x.price) > 100 Select y.name, y.address, Max(x.price) From product x, company y Where x.maker = y.name GroupBy y.name Having Max(x.price) > 100 Select y.name, y.address, Max(x.price) From product x, company y Where x.maker=y.name and x.price > 100 GroupBy y.name Having Max(x.price) > 100 Select y.name, y.address, Max(x.price) From product x, company y Where x.maker=y.name and x.price > 100 GroupBy y.name Having Max(x.price) > 100 Advantage: the size of the join will be smaller. Requires transformation rules specific to the grouping/aggregation operators. Won’t work if we replace Max by Min.

Pushing predicates up Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.cname, x.category, x.price From product x, company y Where x.maker=y.cname Create View V2 AS Select y.cname, x.category, x.price From product x, company y Where x.maker=y.cname Select V2.category, V2.cname, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price Select V2.category, V2.cname, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price V1: categories and the cheapest price of the product in that category under $20 V2: Company name, product category, and the price of the product made by that company

Query Rewrite: Pushing predicates up Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.cname, x.category, x.price From product x, company y Where x.maker=y.cname Create View V2 AS Select y.cname, x.category, x.price From product x, company y Where x.maker=y.cname Select V2.cname, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20 Select V2.cname, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20

24 Query Rewrite: Pushing predicates up Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.cname, x.category, x.price From product x, company y Where x.maker=y.cname AND x.price < 20 Create View V2 AS Select y.cname, x.category, x.price From product x, company y Where x.maker=y.cname AND x.price < 20 Select V2.cname, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20 Select V2.cname, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20

Cost-based Optimization

Cost-based Optimizations Main idea: apply algebraic laws, until estimated cost is minimal Practically: start from partial plans, introduce operators one by one –Will see in a few slides Problem: there are too many ways to apply the laws, hence too many (partial) plans

Often: generate a partial plan, optimize it, then add another operator, … Top-down: the partial plan is a top fragment of the logical plan Bottom up: the partial plan is a bottom fragment of the logical plan

Search Strategies Branch-and-bound: –Remember the cheapest complete plan P seen by using heuristics so far and its cost C –Stop generating partial plans whose cost is > C –If a cheaper complete plan P is found, replace C with P Hill climbing: –Find nearby plans that have lower cost by making small changes to the plan Dynamic programming: –Compute cheapest partial plans of the smallest and compute cheapest partial plans of larger size next –Remember the all cheapest partial plans

Algebraic Laws for Joins Commutative and Associative Laws –R  S = S  R, R  (S  T) = (R  S)  T –R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T –R ⋈ S = S ⋈ R, R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T Distributive Laws –R ⋈ (S  T) = (R ⋈ S)  (R ⋈ T)

The order that relations are joined in has a huge impact on performance Given: query R1 ⋈ … ⋈ Rn, function cost( ), find the best join tree for the query R3R1R2R4 Plan = tree Partial plan = subtree

Types of Join Trees Left deep (all right children are leaves) R3R1 R5 R2 R4

Types of Join Trees Right deep (all left children are leaves) R3 R1 R5 R2R4

Types of Join Trees Bushy if neither left-deep nor right-deep R3 R1 R2R4 R5

Dynamic programming is a good (bottom-up) way to choose join ordering Find the best plan for each subquery Q of {R1, …, Rn}: 1.{R1}, …, {Rn} 2.{R1, R2}, {R1, R3}, …, {Rn-1, Rn} 3.{R1, R2, R3}, {R1, R2, R4}, … 4.… 5.{R1, …, Rn} Output: 1.Size(Q) 2.A best plan Plan(Q) 3.Cost(Q)

The ith step of the dynamic program For each Q ⊆ {R1, …, Rn} of size i do: 1.Compute Size(Q) (later…) 2.For every pair Q1, Q2 such that Q = Q1  Q2, compute cost(Plan(Q1) ⋈ Plan(Q2)) Cost(Q) = the smallest such cost Plan(Q) = the corresponding plan

Dynamic Programming Return Plan({R1, …, Rn})

Dynamic Programming To illustrate, we will make the following simplifications: Cost(P1 ⋈ P2) = Cost(P1) + Cost(P2) + size(intermediate results for P1 and P2) Intermediate results: –If P1 is a join, then the size of the intermediate result is size(P1), otherwise the size is 0 –Similarly for P2 Cost of a scan = 0

Dynamic Programming Example: Cost(R5 ⋈ R7) = Cost(R5) + Cost(R7) + intermediate results for R5 and R7 = 0 (no intermediate results) Cost((R2 ⋈ R1) ⋈ R7) = Cost(R2 ⋈ R1) + Cost(R7) + size(R2 ⋈ R1) = size(R2 ⋈ R1)

Dynamic Programming Relations: R, S, T, U Number of tuples: 2000, 5000, 3000, 1000 Size estimation: T(A ⋈ B) = 0.01*T(A)*T(B)

SubquerySizeCostPlan RS RT RU ST SU TU RST RSU RTU STU RSTU T(R) = 2000 T(S) = 5000 T(T) = 3000 T(U) = 1000 T(A ⋈ B) = 0.01*T(A)*T(B) 100k0 RS 60K 0 RT 20K 0 RU 150K 0 ST 50K 0 SU 30K 0 TU 0.01 * T(R) * T(S) = 0.01 * 2000 * 5000 = 100,000 = 100k

SubquerySizeCostPlan RS RT RU ST SU TU RST RSU RTU STU RSTU T(R) = 2000 T(S) = 5000 T(T) = 3000 T(U) = 1000 T(A ⋈ B) = 0.01*T(A)*T(B) 100K0 RS 60K 0 RT 20K 0 RU 150K 0 ST 50K 0 SU 30K 0 TU Cost((RS)T) = Cost((RS)T) = Cost(RS) + Cost(T) + size(RS) = 100k Cost((RT)S) = Cost((RT)S) = Cost(RT) + Cost(S) + size(RT) = 60k Cost((ST)R) = 150k 3M 60K (RT)S 1M 20K (RU)S 0.6M 20K (RU)T 1.5M 30K (TU)S

SubquerySizeCostPlan RS RT RU ST SU TU RST RSU RTU STU RSTU T(R) = 2000 T(S) = 5000 T(T) = 3000 T(U) = 1000 T(A ⋈ B) = 0.01*T(A)*T(B) 100K0 RS 60K 0 RT 20K 0 RU 150K 0 ST 50K 0 SU 30K 0 TU 3M 60K (RT)S 1M 20K (RU)S 0.6M 20K (RU)T 1.5M 30K (TU)S 1.Cost(R(STU)) = 30K+1.5M 2.Cost(S(RTU)) = 20K+0.6M 3.Cost(T(RSU)) = 20K+1M 4.Cost(U(RST)) = 60K+3M 5.Cost((RS)(TU)) = 130K 6.Cost((RT)(SU)) = 110K 7.Cost((RU)(ST)) = 170K 30M 60K+50K (RT)(SU)

SubquerySizeCostPlan RS100k0RS RT60k0RT RU20k0RU ST150k0ST SU50k0SU TU30k0TU RST3M60k(RT)S RSU1M20k(RU)S RTU0.6M20k(RU)T STU1.5M30k(TU)S RSTU30M60k+50k=110k(RT)(SU)

What if we don’t oversimplify? More realistic size/cost estimations!! (next lecture) Use heuristics to reduce the search space –Consider only left linear trees –No trees with cartesian products: R(A,B) S(B,C) T(C,D) (R ⋈ T) ⋈ S has a cartesian product

Summary of query optimization process so far 1.Parse your query into tree form 2.Move selections as far down the tree as you can 3.Project out unwanted attributes as early as you can, when you have their tuples in memory anyway 4.Pick a good join order, based on the expected size of intermediate results 5.Pick an implementation for each operation in the tree