Lecture 25: Query Execution

Slides:

Advertisements

Similar presentations

1 Lecture 23: Query Execution Friday, March 4, 2005.

Advertisements

Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.

Query Optimization Goal: Declarative SQL query

Query Optimization. Query Optimization Process (simplified a bit) Parse the SQL query into a logical tree: –identify distinct blocks (corresponding to.

Lecture 14: Query Optimization. This Lecture Query rewriting Cost estimation –We have learned how atomic operations are implemented and their cost –We’ll.

Lecture 9 Query Optimization November 24, 2010 Dan Suciu -- CSEP544 Fall

Lecture 24: Query Execution Monday, November 20, 2000.

1 Lecture 22: Query Execution Wednesday, March 2, 2005.

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

CS186 Final Review Query Optimization.

1 Lecture 7: Query Execution and Optimization Tuesday, February 20, 2007.

1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.

CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.

Query Optimization March 6 th, Query Optimization Process (simplified a bit) Parse the SQL query into a logical tree: –identify distinct blocks.

Query Optimization Imperative query execution plan: Declarative SQL query Ideally: Want to find best plan. Practically: Avoid worst plans! Goal: Purchase.

1 Lecture 23: Query Execution Wednesday, March 8, 2006.

1 Lecture 25 Friday, November 30, Outline Query execution –Two pass algorithms based on indexes (6.7) Query optimization –From SQL to logical.

CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.

CSE544 Query Optimization Tuesday-Thursday, February 8 th -10 th, 2011 Dan Suciu , Winter

CS411 Database Systems Kazuhiro Minami 11: Query Execution.

Lecture 17: Query Execution Tuesday, February 28, 2001.

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.

1 Lecture 23: Query Execution Monday, November 26, 2001.

CSE 544: Lecture 14 Wednesday, 5/15/2002 Optimization, Size Estimation.

CS4432: Database Systems II Query Processing- Part 1 1.

1 Lecture 24: Query Execution Monday, November 27, 2006.

CS 540 Database Management Systems

CS 440 Database Management Systems

Lecture 24: Query Execution and Optimization

Lecture 16: Relational Operators

Indexing and Execution

Database Management Systems (CS 564)

Evaluation of Relational Operations: Other Operations

CS222: Principles of Data Management Notes #13 Set operations, Aggregation Instructor: Chen Li.

Introduction to Database Systems

Cse 344 April 25th – Disk i/o.

Introduction to Database Systems CSE 444 Lecture 22: Query Optimization November 26-30, 2007.

Relational Operations

April 27th – Cost Estimation

Database Management Systems (CS 564)

Lecture 27: Optimizations

Relational Algebra Friday, 11/14/2003.

Lecture 24: Query Execution

Lecture 25: Query Optimization

Lecture 13: Query Execution

Query Execution Index Based Algorithms (15.6)

CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.

Lecture 23: Query Execution

Evaluation of Relational Operations: Other Techniques

Lecture 22: Query Execution

CSE 444: Lecture 25 Query Execution

Lecture 22: Query Execution

Monday, 5/13/2002 Hash table indexes, query optimization

Query Optimization March 7th, 2003.

External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.

Lecture 11: B+ Trees and Query Execution

Query Optimization May 16th, 2002

CSE 544: Query Execution Wednesday, 5/12/2004.

Lecture 23: Monday, November 25, 2002.

CSE 544: Optimizations Wednesday, 5/10/2006.

Lecture 26 Monday, December 3, 2001.

Lecture 22: Friday, November 22, 2002.

Evaluation of Relational Operations: Other Techniques

Lecture 24: Query Execution

Lecture 20: Query Execution

Lecture 24: Wednesday, November 27, 2002.

Presentation transcript:

Lecture 25: Query Execution Wednesday, December 1st, 2005

Outline Index-based algorithms Query optimization: algebraic laws 16.2

Indexed Based Algorithms Recall that in a clustered index all tuples with the same value of the key are clustered on as few blocks as possible Note: book uses another term: “clustering index”. Difference is minor… a a a a a a a a a a

Index Based Selection Selection on equality: sa=v(R) Clustered index on a: cost B(R)/V(R,a) Unclustered index on a: cost T(R)/V(R,a)

Index Based Selection B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Example: Table scan (assuming R is clustered): B(R) = 2,000 I/Os Index based selection: If index is clustered: B(R)/V(R,a) = 100 I/Os If index is unclustered: T(R)/V(R,a) = 5,000 I/Os Lesson: don’t build unclustered indexes when V(R,a) is small ! cost of sa=v(R) = ?

Index Based Join R S Assume S has an index on the join attribute Iterate over R, for each tuple fetch corresponding tuple(s) from S Assume R is clustered. Cost: If index is clustered: B(R) + T(R)B(S)/V(S,a) If index is unclustered: B(R) + T(R)T(S)/V(S,a)

Index Based Join Assume both R and S have a sorted index (B+ tree) on the join attribute Then perform a merge join called zig-zag join Cost: B(R) + B(S)

Summary of External Join Algorithms Block Nested Loop Join: B(S) + B(R)*B(S)/M Partitioned Hash Join: 3B(R)+3B(S) Assuming min(B(R),B(S)) <= M2 Merge Join Assuming B(R)+B(S) <= M2 Index Join B(R) + T(R)B(S)/V(S,a) Assuming…

Example Select Product.pname From Product, Company Product(pname, maker), Company(cname, city) How do we execute this query ? Select Product.pname From Product, Company Where Product.maker=Company.cname and Company.city = “Seattle”

Example Product(pname, maker), Company(cname, city) Assume: Clustered index: Product.pname, Company.cname Unclustered index: Product.maker, Company.city

Logical Plan: scity=“Seattle” Product (pname,maker) maker=cname scity=“Seattle” Product (pname,maker) Company (cname,city)

Index-based selection Physical plan 1: Index-based join Index-based selection cname=maker scity=“Seattle” Company (cname,city) Product (pname,maker)

Scan and sort (2a) index scan (2b) Physical plans 2a and 2b: Merge-join Which one is better ?? maker=cname scity=“Seattle” Product (pname,maker) Company (cname,city) Index- scan Scan and sort (2a) index scan (2b)

Index-based selection Physical plan 1:  T(Product) / V(Product, maker) Index-based join Index-based selection Total cost: T(Company) / V(Company, city)  T(Product) / V(Product, maker) cname=maker scity=“Seattle” Company (cname,city) Product (pname,maker) T(Company) / V(Company, city)

Scan and sort (2a) index scan (2b) Total cost: (2a): 3B(Product) + B(Company) (2b): T(Product) + B(Company) Physical plans 2a and 2b: Merge-join No extra cost (why ?) maker=cname scity=“Seattle” 3B(Product) Product (pname,maker) Company (cname,city) T(Product) Table- scan Scan and sort (2a) index scan (2b) B(Company)

Which one is better ?? It depends on the data !! Plan 1: T(Company)/V(Company,city)  T(Product)/V(Product,maker) Plan 2a: B(Company) + 3B(Product) Plan 2b: B(Company) + T(Product) Which one is better ?? It depends on the data !!

Example Case 1: V(Company, city)  T(Company) T(Company) = 5,000 B(Company) = 500 M = 100 T(Product) = 100,000 B(Product) = 1,000 We may assume V(Product, maker)  T(Company) (why ?) Case 1: V(Company, city)  T(Company) Case 2: V(Company, city) << T(Company) V(Company,city) = 2,000 V(Company,city) = 20

Which Plan is Best ? Case 1: Case 2: Plan 1: T(Company)/V(Company,city)  T(Product)/V(Product,maker) Plan 2a: B(Company) + 3B(Product) Plan 2b: B(Company) + T(Product) Case 1: Case 2:

Lessons Need to consider several physical plan even for one, simple logical plan No magic “best” plan: depends on the data In order to make the right choice need to have statistics over the data the B’s, the T’s, the V’s

Query Optimzation Have a SQL query Q Create a plan P Find equivalent plans P = P’ = P’’ = … Choose the “cheapest”. HOW ??

Logical Query Plan Q.phone > ‘5430000’ P=  In class: SELECT P.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’ P= buyer  City=‘seattle’ phone>’5430000’ Buyer=name In class: find a “better” plan P’ Purchase Person

Logical Query Plan Q= SELECT city, sum(quantity) FROM sales GROUP BY city HAVING sum(quantity) < 100 T2(city,p) P= s p < 100 T1(city,p) g city, sum(quantity)→p In class: find a “better” plan P’ sales(product, city, price)

The three components of an optimizer We need three things in an optimizer: Algebraic laws An optimization algorithm A cost estimator

Algebraic Laws Commutative and Associative Laws Distributive Laws R  S = S  R, R  (S  T) = (R  S)  T R || S = S || R, R || (S || T) = (R || S) || T Distributive Laws R || (S  T) = (R || S)  (R || T)

Algebraic Laws Laws involving selection: s C AND C’(R) = s C(s C’(R)) = s C(R) ∩ s C’(R) s C OR C’(R) = s C(R)  s C’(R) s C (R || S) = s C (R) || S When C involves only attributes of R s C (R – S) = s C (R) – S s C (R  S) = s C (R)  s C (S) s C (R || S) = s C (R) || S

Algebraic Laws Example: R(A, B, C, D), S(E, F, G) s F=3 (R || D=E S) = ? s A=5 AND G=9 (R || D=E S) = ?

Algebraic Laws Laws involving projections PM(R || S) = PM(PP(R) || PQ(S)) PM(PN(R)) = PM,N(R) Example R(A,B,C,D), S(E, F, G) PA,B,G(R || D=E S) = P ? (P?(R) || D=E P?(S))

Algebraic Laws Laws involving grouping and aggregation: (A, agg(B)(R)) = A, agg(B)(R) A, agg(B)((R)) = A, agg(B)(R) if agg is “duplicate insensitive” Which of the following are “duplicate insensitive” ? sum, count, avg, min, max A, agg(D)(R(A,B) || B=C S(C,D)) = A, agg(D)(R(A,B) || B=C (C, agg(D)S(C,D)))

Optimizations Based on Semijoins THIS IS ADVANCED STUFF; NOT ON THE FINAL R S = P A1,…,An (R S) Where the schemas are: Input: R(A1,…An), S(B1,…,Bm) Output: T(A1,…,An)

Optimizations Based on Semijoins Semijoins: a bit of theory (see [AHV]) Given a query: A full reducer for Q is a program: Such that no dangling tuples remain in any relation Q :-  ( (R1 |x| R2 |x| . . . |x| Rn )) Ri1 := Ri1 Rj1 Ri2 := Ri2 Rj2 . . . . . Rip := Rip Rjp

Optimizations Based on Semijoins Example: A full reducer is: Q(E) :- R1(A,B) |x| R2(B,C) |x| R3(C,D,E) R2(B,C) := R2(B,C) |x R1(A,B) R3(C,D,E) := R3(C,D,E) |x R2(B,C) R2(B,C) := R2(B,C) |x R3(C,D,E) R1(A,B) := R1(A,B) |x R2(B,C) The new tables have only the tuples necessary to compute Q(E)

Optimizations Based on Semijoins Example: Doesn’t have a full reducer (we can reduce forever) Theorem a query has a full reducer iff it is “acyclic” Q(E) :- R1(A,B) |x| R2(B,C) |x| R3(A,C, E)

Optimizations Based on Semijoins Semijoins in [Chaudhuri’98] CREATE VIEW DepAvgSal As ( SELECT E.did, Avg(E.Sal) AS avgsal FROM Emp E GROUP BY E.did) SELECT E.eid, E.sal FROM Emp E, Dept D, DepAvgSal V WHERE E.did = D.did AND E.did = V.did AND E.age < 30 AND D.budget > 100k AND E.sal > V.avgsal

Optimizations Based on Semijoins First idea: CREATE VIEW LimitedAvgSal As ( SELECT E.did, Avg(E.Sal) AS avgsal FROM Emp E, Dept D WHERE E.did = D.did AND D.buget > 100k GROUP BY E.did) SELECT E.eid, E.sal FROM Emp E, Dept D, LimitedAvgSal V WHERE E.did = D.did AND E.did = V.did AND E.age < 30 AND D.budget > 100k AND E.sal > V.avgsal

Optimizations Based on Semijoins Better: full reducer CREATE VIEW PartialResult AS (SELECT E.id, E.sal, E.did FROM Emp E, Dept D WHERE E.did=D.did AND E.age < 30 AND D.budget > 100k) CREATE VIEW Filter AS (SELECT DISTINCT P.did FROM PartialResult P) CREATE VIEW LimitedAvgSal AS (SELECT E.did, Avg(E.Sal) AS avgsal FROM Emp E, Filter F WHERE E.did = F.did GROUP BY E.did)

Optimizations Based on Semijoins SELECT P.eid, P.sal FROM PartialResult P, LimitedDepAvgSal V WHERE P.did = V.did AND P.sal > V.avgsal