Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Slides:



Advertisements
Similar presentations
Chapter 14 Query Optimization
Advertisements

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 14: Query Optimization.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
1 Relational Query Optimization Module 5, Lecture 2.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 14: Query Optimization.
Ch.14: Query Optimization  Introduction  Catalog Information for Cost Estimation  Estimation of Statistics  Transformation of Relational Expressions.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Processing (overview)
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 14: Query Optimization.
Ch.14: Query Optimization  Introduction  Catalog Information for Cost Estimation  Estimation of Statistics  Transformation of Relational Expressions.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Slides adapted from A. Silberschatz et al. Database System Concepts, 5th Ed. SQL - part 2 - Database Management Systems I Alex Coman, Winter 2006.
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
1 Query Processing Query Processing Overview Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions.
José Alferes Versão modificada de Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan Chapter 14: Query Optimization.
Query Processing Presented by Aung S. Win.
Chapter 13: Query Optimization
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Dr. Alexandra I. Cristea.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts - 6 th Edition Chapter 13: Query Optimization.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Optimization.
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization.
Chapter 14 Query Optimization. Chapter 14: Query Optimization Introduction Catalog Information for Cost Estimation Estimation of Statistics Transformation.
Database Management 9. course. Execution of queries.
©Silberschatz, Korth and Sudarshan1 Query Optimization Introduction Statistical (Catalog) Information for Cost Estimation Estimation of Statistics Cost-based.
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan Chapter 14: Query Optimization.
Database System Concepts 5 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 14: Query Optimization.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CMSC424: Database Design Instructor: Amol Deshpande
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
+ Under the hood: Query Optimization, Query Execution plans.
Lecture 4 - Query Optimization Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan Chapter 13: Query Processing.
Chapter 14: Query Optimization Chapter 14: Query Optimization Introduction Transformation of Relational Expressions Catalog Information for Cost.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Chapter 14 Query Optimization. ©Silberschatz, Korth and Sudarshan14.2Database System Concepts 3 rd Edition Chapter 14: Query Optimization Introduction.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 13: Query Optimization.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Chapter 13: Query Processing
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Query Processing and Query Optimization Database System Implementation CSE 507 Slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts.
J. GamperDMS 2006/07 1 Introduction Statistical information for cost estimation Transformation of relational expressions (equivalence rules) Rule-based.
Chapter 14: Query Optimization
Database System Implementation CSE 507
Database Management System
Chapter 13: Query Optimization
Chapter 13: Query Optimization
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Arranging the Join Order: the Wong-Youssefi algorithm (INGRES)
Introduction to Database Systems
Basic Steps in Query Processing
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
Chapter 14: Query Optimization
Lecture 5- Query Optimization (continued)
Chapter 14: Query Optimization
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 14: Query Optimization
Presentation transcript:

Database Techniek – Query Optimization Database Techniek Query Optimization (chapter 14)

Database Techniek – Query Optimization Lecture 3 Query Rewriting –Equivalence Rules Query Optimization –Dynamic Programming (System R/DB2) –Heuristics (Ingres/Postgres) De-correlation of nested queries Result Size Estimation –Practicum Assignment 2

Database Techniek – Query Optimization Lecture 3 Query Rewriting –Equivalence Rules Query Optimization –Dynamic Programming (System R/DB2) –Heuristics (Ingres/Postgres) De-correlation of nested queries Result Size Estimation –Practicum Assignment 2

Database Techniek – Query Optimization Transformation of Relational Expressions Two relational algebra expressions are said to be equivalent if on every legal database instance the two expressions generate the same set of tuples –Note: order of tuples is irrelevant In SQL, inputs and outputs are bags (multi-sets) of tuples –Two expressions in the bag version of the relational algebra are said to be equivalent if on every legal database instance the two expressions generate the same bag of tuples An equivalence rule says that expressions of two forms are equivalent –Can replace expression of first form by second, or vice versa

Database Techniek – Query Optimization Equivalence Rules 1.Conjunctive selection operations can be deconstructed into a sequence of individual selections. 2.Selection operations are commutative. 3.Only the last in a sequence of projection operations is needed, the others can be omitted. 4.Selections can be combined with Cartesian products and theta joins. a.   (E 1 X E 2 ) = E 1  E 2 b.   1 (E 1  2 E 2 ) = E 1  1  2 E 2

Database Techniek – Query Optimization Algebraic Rewritings for Selection:  cond2  cond1 R  cond1 AND cond2 R  cond1 OR cond2 R   cond2 R  cond1   cond 2 R

Database Techniek – Query Optimization Equivalence Rules (Cont.) 5.Theta-join operations (and natural joins) are commutative. E 1  E 2 = E 2  E 1 6. Natural join operations are associative: (E 1 E 2 ) E 3 = E 1 (E 2 E 3 )

Database Techniek – Query Optimization Equivalence Rules for Joins commutative associative

Database Techniek – Query Optimization 7.For pushing down selections into a (theta) join we have the following cases: –(push 1) When all the attributes in  0 involve only the attributes of one of the expressions (E 1 ) being joined.   0  E 1  E 2 ) = (   0 (E 1 ))  E 2 –(split) When  1 involves only the attributes of E 1 and  2 involves only the attributes of E 2.   1    E 1  E 2 ) = (   1 (E 1 ))  (   (E 2 )) –(impossible) When  involves both attributes of E 1 and E 2 (it is a join condition) Equivalence Rules (Cont.)

Database Techniek – Query Optimization Pushing Selection thru Cartesian Product and Join  RS cond   S R  The right direction requires that cond refers to S attributes only  RS cond  S R

Database Techniek – Query Optimization Projection Decomposition  p XY p X p Y p   Y X  Y X  p   pXppXp pYppYp  X.tax Y.price total=X.tax*Y.price p  {total} {tax} {price} p XY p X p Y

Database Techniek – Query Optimization 8.The projections operation distributes over the theta join operation as follows: (a) if  involves only attributes from L 1  L 2 : (b) Consider a join E 1  E 2. –Let L 1 and L 2 be sets of attributes from E 1 and E 2, respectively. –Let L 3 be attributes of E 1 that are involved in join condition , but are not in L 1  L 2, and –Let L 4 be attributes of E 2 that are involved in join condition , but are not in L 1  L 2. More Equivalence Rules

Database Techniek – Query Optimization Join Ordering Example For all relations r 1, r 2, and r 3, (r 1 r 2 ) r 3 = r 1 (r 2 r 3 ) If r 2 r 3 is quite large and r 1 r 2 is small, we choose (r 1 r 2 ) r 3 so that we compute and store a smaller temporary relation.

Database Techniek – Query Optimization Join Ordering Example (Cont.) Consider the expression  customer-name ((  branch-city = “Brooklyn” (branch)) account depositor) Could compute account depositor first, and join result with  branch-city = “Brooklyn” (branch) but account depositor is likely to be a large relation. Since it is more likely that only a small fraction of the bank’s customers have accounts in branches located in Brooklyn, it is better to compute  branch-city = “Brooklyn” (branch) account first.

Database Techniek – Query Optimization Lecture 3 Query Rewriting –Equivalence Rules Query Optimization –Dynamic Programming (System R/DB2) –Heuristics (Ingres/Postgres) De-correlation of nested queries Result Size Estimation

Database Techniek – Query Optimization Lecture 3 Query Rewriting –Equivalence Rules Query Optimization –Dynamic Programming (System R/DB2) –Heuristics (Ingres/Postgres) De-correlation of nested queries Result Size Estimation

Database Techniek – Query Optimization The role of Query Optimization SQL physical algebra logical algebra parsing, normalization logical query optimization physical query optimization query execution

Database Techniek – Query Optimization The role of Query Optimization SQL physical algebra logical algebra parsing, normalization logical query optimization physical query optimization query execution Compare different relational algebra plan  on result size (Practicum 2A)

Database Techniek – Query Optimization The role of Query Optimization SQL physical algebra logical algebra parsing, normalization logical query optimization physical query optimization query execution Compare different execution algorithms  on true cost (IO, CPU, cache)

Database Techniek – Query Optimization Enumeration of Equivalent Expressions Query optimizers use equivalence rules to systematically generate expressions equivalent to the given expression repeated until no more expressions can be found: – for each expression found so far, use all applicable equivalence rules, and add newly generated expressions to the set of expressions found so far The above approach is very expensive in space and time Time and space requirements are reduced by not generating all expressions

Database Techniek – Query Optimization Finding A Good Join Order Consider finding the best join-order for r 1 r 2... r n. There are (2(n – 1))!/(n – 1)! different join orders for above expression. With n = 7, the number is , with n = 10, the number is greater than 176 billion! No need to generate all the join orders. Using dynamic programming, the least-cost join order for any subset of {r 1, r 2,... r n } is computed only once and stored for future use.

Database Techniek – Query Optimization Dynamic Programming in Optimization To find best join tree for a set of n relations: –To find best plan for a set S of n relations, consider all possible plans of the form: S 1 (S – S 1 ) where S 1 is any non-empty subset of S. –Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2 n – 1 alternatives. –When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing it Dynamic programming

Database Techniek – Query Optimization Join Order Optimization Algorithm procedure findbestplan(S) if (bestplan[S].cost   ) return bestplan[S] // else bestplan[S] has not been computed earlier, compute it now for each non-empty subset S1 of S such that S1  S P1= findbestplan(S1) P2= findbestplan(S - S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[S].cost bestplan[S].cost = cost bestplan[S].plan = “execute P1.plan; execute P2.plan; join results of P1 and P2 using A” return bestplan[S]

Database Techniek – Query Optimization Left Deep Join Trees In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.

Database Techniek – Query Optimization Cost of Optimization With dynamic programming time complexity of optimization with bushy trees is O(3 n ). –With n = 10, this number is instead of 176 billion! Space complexity is O(2 n ) To find best left-deep join tree for a set of n relations: –Consider n alternatives with one relation as right-hand side input and the other relations as left-hand side input. –Using (recursively computed and stored) least-cost join order for each alternative on left-hand-side, choose the cheapest of the n alternatives. If only left-deep trees are considered, time complexity of finding best join order is O(n 2 n ) –Space complexity remains at O(2 n ) Cost-based optimization is expensive, but worthwhile for queries on large datasets (typical queries have small n, generally < 10)

Database Techniek – Query Optimization Physical Query Optimization Minimizes absolute cost –Minimize I/Os –Minimize CPU, cache miss cost (main memory DBMS) Must consider the interaction of evaluation techniques when choosing evaluation plans: choosing the cheapest algorithm for each operation independently may not yield best overall algorithm. E.g. –merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation. –nested-loop join may provide opportunity for pipelining

Database Techniek – Query Optimization Physical Optimization: Interesting Orders Consider the expression (r 1 r 2 r 3 ) r 4 r 5 An interesting sort order is a particular sort order of tuples that could be useful for a later operation. –Generating the result of r 1 r 2 r 3 sorted on the attributes common with r 4 or r 5 may be useful, but generating it sorted on the attributes common only r 1 and r 2 is not useful. –Using merge-join to compute r 1 r 2 r 3 may be costlier, but may provide an output sorted in an interesting order. Not sufficient to find the best join order for each subset of the set of n given relations; must find the best join order for each subset, for each interesting sort order –Simple extension of earlier dynamic programming algorithms –Usually, number of interesting orders is quite small and doesn’t affect time/space complexity significantly

Database Techniek – Query Optimization Heuristic Optimization Cost-based optimization is expensive, even with dynamic programming. Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion. Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance: –Perform selection early (reduces the number of tuples) –Perform projection early (reduces the number of attributes) –Perform most restrictive selection and join operations before other similar operations. –Some systems use only heuristics, others combine heuristics with partial cost-based optimization.

Database Techniek – Query Optimization Steps in Typical Heuristic Optimization 1.Deconstruct conjunctive selections into a sequence of single selection operations (Equiv. rule 1.). 2.Move selection operations down the query tree for the earliest possible execution (Equiv. rules 2, 7a, 7b, 11). 3.Execute first those selection and join operations that will produce the smallest relations (Equiv. rule 6). 4.Replace Cartesian product operations that are followed by a selection condition by join operations (Equiv. rule 4a). 5.Deconstruct and move as far down the tree as possible lists of projection attributes, creating new projections where needed (Equiv. rules 3, 8a, 8b, 12). 6.Identify those subtrees whose operations can be pipelined, and execute them using pipelining).

Database Techniek – Query Optimization Heuristic Join Order: the Wong- Youssefi algorithm (INGRES) Sample TPC-H Schema Nation(NationKey, NName) Customer(CustKey, CName, NationKey) Order(OrderKey, CustKey, Status) Lineitem(OrderKey, PartKey, Quantity) Product(SuppKey, PartKey, PName) Supplier(SuppKey, SName) SELECT SName FROM Nation, Customer, Order, LineItem, Product, Supplier WHERE Nation.NationKey = Cuctomer.NationKey AND Customer.CustKey = Order.CustKey AND Order.OrderKey=LineItem.OrderKey AND LineItem.PartKey= Product.Partkey AND Product.Suppkey = Supplier.SuppKey AND NName = “Canada” Find the names of suppliers that sell a product that appears in a line item of an order made by a customer who is in Canada

Database Techniek – Query Optimization Challenges with Large Natural Join Expressions For simplicity, assume that in the query 1.All joins are natural 2.whenever two tables of the FROM clause have common attributes we join on them 1.Consider Right-Index only Nation CustomerOrderLineItemProductSupplier σ NName=“Canada” π SName One possible order RI Index

Database Techniek – Query Optimization Wong-Yussefi algorithm assumptions and objectives Assumption 1 (weak): Indexes on all join attributes (keys and foreign keys) Assumption 2 (strong): At least one selection creates a small relation –A join with a small relation results in a small relation Objective: Create sequence of index-based joins such that all intermediate results are small

Database Techniek – Query Optimization Hypergraphs CName CustKey NationKey NName Status OrderKey Quantity PartKey SuppKey PName SName relation hyperedges two hyperedges for same relation are possible each node is an attribute can extend for non-natural equality joins by merging nodes Nation Customer Order LineItem Product Supplier

Database Techniek – Query Optimization Small Relations/Hypergraph Reduction CName CustKey NationKey NName Status OrderKey Quantity PartKey SuppKey PName SName Nation Customer Order LineItem Product Supplier NationKey NName “Nation” is small because it has the equality selection NName = “Canada” Nation σ NName=“Canada” Index Pick a small relation (and its conditions) to start the plan

Database Techniek – Query Optimization CName CustKey NationKey NName Status OrderKey Quantity PartKey SuppKey PName SName Nation Customer Order LineItem Product Supplier NationKey NName Nation σ NName=“Canada” Index RI (1) Remove small relation (hypergraph reduction) and color as “small” any relation that joins with the removed “small” relation Customer (2) Pick a small relation (and its conditions if any) and join it with the small relation that has been reduced

Database Techniek – Query Optimization After a bunch of steps… Nation CustomerOrderLineItemProductSupplier σ NName=“Canada” π SName RI Index

Database Techniek – Query Optimization Some Query Optimizers The System R/Starburst: dynamic programming on left-deep join orders. Also uses heuristics to push selections and projections down the query tree. DB2, SQLserver are cost-based optimizers –SQLserver is transformation based, also uses dynamic programming. MySQL optimizer is heuristics-based (rather weak) Heuristic optimization used in some versions of Oracle: –Repeatedly pick “best” relation to join next Starting from each of n starting points. Pick best among these.

Database Techniek – Query Optimization Lecture 3 Query Rewriting –Equivalence Rules Query Optimization –Dynamic Programming (System R/DB2) –Heuristics (Ingres/Postgres) De-correlation of nested queries Result Size Estimation –Practicum Assignment 2

Database Techniek – Query Optimization Lecture 3 Query Rewriting –Equivalence Rules Query Optimization –Dynamic Programming (System R/DB2) –Heuristics (Ingres/Postgres) De-correlation of nested queries Result Size Estimation –Practicum Assignment 2

Database Techniek – Query Optimization Optimizing Nested Subqueries SQL conceptually treats nested subqueries in the where clause as functions that take parameters and return a single value or set of values –Parameters are variables from outer level query that are used in the nested subquery; such variables are called correlation variables E.g. select customer-name from borrower where exists (select * from depositor where depositor.customer-name = borrower.customer-name) Conceptually, nested subquery is executed once for each tuple in the cross-product generated by the outer level from clause –Such evaluation is called correlated evaluation –Note: other conditions in where clause may be used to compute a join (instead of a cross-product) before executing the nested subquery

Database Techniek – Query Optimization Optimizing Nested Subqueries (Cont.) Correlated evaluation may be quite inefficient since –a large number of calls may be made to the nested query –there may be unnecessary random I/O as a result SQL optimizers attempt to transform nested subqueries to joins where possible, enabling use of efficient join techniques E.g.: earlier nested query can be rewritten as select customer-name from borrower, depositor where depositor.customer-name = borrower.customer-name –Note: above query doesn’t correctly deal with duplicates, can be modified to do so as we will see In general, it is not possible/straightforward to move the entire nested subquery from clause into the outer level query from clause –A temporary relation is created instead, and used in body of outer level query

Database Techniek – Query Optimization Optimizing Nested Subqueries (Cont.) In general, SQL queries of the form below can be rewritten as shown Rewrite: select … from L 1 where P 1 and exists (select * from L 2 where P 2 ) To: create table t 1 as select distinct V from L 2 where P 2 1 select … from L 1, t 1 where P 1 and P 2 2 –P 2 1 contains predicates in P 2 that do not involve any correlation variables –P 2 2 reintroduces predicates involving correlation variables, with relations renamed appropriately –V contains all attributes used in predicates with correlation variables

Database Techniek – Query Optimization Optimizing Nested Subqueries (Cont.) In our example, the original nested query would be transformed to create table t 1 as select distinct customer-name from depositor select customer-name from borrower, t 1 where t 1.customer-name = borrower.customer-name The process of replacing a nested query by a query with a join (possibly with a temporary relation) is called decorrelation. Decorrelation is more complicated when – the nested subquery uses aggregation, or – when the result of the nested subquery is used to test for equality, or –when the condition linking the nested subquery to the other query is not exists, –and so on.

Database Techniek – Query Optimization Practicum Assignment 2A Get the XML metadata description for TPC-H –xslt script for plotting histograms Take our solution for your second query (assignment 1) For each operator in the tree give: –Selectivity –Intermediate Result size –Short description how you computed this –Explanation how to compute histograms on all result columns Sum all intermediate result sizes into total query cost DEADLINE: march 31

Database Techniek – Query Optimization The Big Picture 1.Parsing and translation 2.Optimization 3.Evaluation

Database Techniek – Query Optimization The Big Picture 1.Parsing and translation 2.Optimization 3.Evaluation

Database Techniek – Query Optimization Optimization Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. – Cost is estimated using statistical information from the database catalog e.g. number of tuples in each relation, size of tuples, etc. In this lecture we study logical cost estimation –introduction to histograms –estimating the amount of tuples in the result with perfect and equi-height histograms –propagation of histograms into result columns –How to compute result size from width and #tuples

Database Techniek – Query Optimization Cost Estimation Physical cost estimation –predict I/O blocks, seeks, cache misses, RAM consumption, … –Depends in the execution algorithm In this lecture we study logical cost estimation –“the plan with smallest intermediate result tends to be best” –need estimations for intermediate result sizes Histogram-based estimation (practicum, assignment 2) –estimating the amount of tuples in the result with perfect and equi-height histograms –propagation of histograms into result columns –compute result size as tuple-width * #tuples

Database Techniek – Query Optimization Selectivities select expr := X(col,const): X in { =, > } | expr && expr | expr || expr –s op (R) = |R'| / |R|. join only 1-n / n-1 foreign key joins –s join (R 1,R 2 ) = |R'| / (|R 1 |*|R 2 |). aggr –s(A(R;g 1 ;a)) = |A(R;g 1 ;a)| / |R| = distinct(R.g 1 ) / |R|. –s(A(R;g 1,g 2 ;a)) = (distinct(R.g 1 ) * distinct(R.g 2 )) / |R|. project, order –s project (R) = s order (R) = 1 topn –s topN (R) = min(N,|R|) / |R| = min(N/|R|,1).

Database Techniek – Query Optimization Result Size #tuples_max * selectivity * #columns We disregard differences in column-width project: –#columns = |projectlist| –#tuples_max = |R| aggr: –#columns = |groupbys| + |aggrs| –#tuples_max = min(|R|, |g1| *.. * |gn|) join: –#columns = |child1| + |child2| –#tuples_max = |R1| * |R2| other: –#columns stays equal wrt child –#tuples_max = |R|

Database Techniek – Query Optimization Selectivity estimation We can estimate the selectivities using: domain constraints min/max statistics histograms

Database Techniek – Query Optimization Histograms Buckets: B = Leave min out (B i.min = B i-1.max)

Database Techniek – Query Optimization Different Kinds of Histograms Perfect Equi-width Equi-height In the practicum we use Perfect histograms, when distinct(R.a) < 25 Equi-height histograms of 10 buckets, otherwise –Not perfectly even-height: disjunct value ranges between buckets –(i.e. frequent value is not split over even-height buckets. It may create a bigger-than-height bucket)

Database Techniek – Query Optimization Perfect Histograms: Equi-Selection s(R.a=C) = B k.total * (1/|R|) –in case there is a k with B k.max = C s(R.a=C) = 0 –otherwise a c d f total s(R.a=d)

Database Techniek – Query Optimization Perfect Histograms: Equi-Selection s(R.a=C) = B k.total * (1/|R|) –in case there is a k with B k.max = C s(R.a=C) = 0 –otherwise a c d f total s(R.a=d)

Database Techniek – Query Optimization Perfect Histograms: Range-Selection s(R.a<C) = sum(B i.total) * (1/|R|), –for all 1 <= i < k with B (k-1).max < C <= B k.max a c d f total s(R.a<d)

Database Techniek – Query Optimization Perfect Histograms: Range-Selection s(R.a<C) = sum(B i.total) * (1/|R|), –for all 1 <= i < k with B (k-1).max < C <= B k.max a c d f total s(R.a<d)

Database Techniek – Query Optimization Equi-Height Histograms: Equi-Selection s(R.a=C) = avg_freq(B k ) * (1/|R|) –in case there is a k with B (k-1).max < C <= B k.max –avg_freq(B k ) = B k.total / B k.distinct s(R.a=C) = 0 –otherwise a d e f total s(R.a=c)

Database Techniek – Query Optimization Equi-Height Histograms: Equi-Selection s(R.a=C) = avg_freq(B k ) * (1/|R|) –in case there is a k with B (k-1).max < C <= B k.max –avg_freq(B k ) = B k.total / B k.distinct s(R.a=C) = 0 –otherwise a d e f total s(R.a=c)

Database Techniek – Query Optimization Equi-Height Histograms: Range-Selection s(R.a<C) = ( sum(B i.total) + freq_lt(B k,C) ) * (1/|R|), –for all 1 <= i < k with B (k-1).max < C <= B k.max. total s(R.a<c) a d e f

Database Techniek – Query Optimization Equi-Height Histograms: Range-Selection s(R.a<C) = ( sum(B i.total) + freq_lt(B k,C) ) * (1/|R|), –for all 1 <= i < k with B (k-1).max < C <= B k.max. total s(R.a<c) a d e f

Database Techniek – Query Optimization Select with And and Or Assume no correlation between attributes: s(θ a and θ c ) = s(θ a ) * s(θ c ) s(θ a or θ c ) = s(θ a ) + (1-s(θ a )) * s(θ c ) Note: must normalize θ a, θ c into non- overlapping conditions

Database Techniek – Query Optimization Foreign-key Join Selectivity/Hitrate Estimation Foreign-key constraint: R 1 matches at most once with R 2 “each order matches on average with 7 lineitems”  hitrate = 7 But what if R’ 2 (e.g. order) is an intermediate result? R’ 2 may have multiple key occurrences due to a previous join R’ 2 may have less key occurrences (missing keys) due to a select (or join). Simple Approach (practicum): Hitrate *= |R’ 2 |/|R 2 |

Database Techniek – Query Optimization Aggr(R,[g1..gn],[..]) Can only predict groupby columns and size: Expected result size = min(|R|,distinct(g 1 ) * …. * distinct(g n ))

Database Techniek – Query Optimization Histogram Propagation order:histogram stays identical project:histogram stays identical »Expression (e.g. l_tax*l_price) not required for the practicum »possible to use cartesian product on histograms, followed by expression evaluation and re-bucketizing. topn:not required for the practicum »Use last bucket (and backwards) to take highest N distinct values and their frequencies aggr:not required for the practicum »Groupbys: distinct is multiplication of distincts, freq=1 »Aggregates: only possible for global aggregates (no groupbys) fk-join:multiply totals by join hitrate »distinct = min(distinct,total)  this is a simplicifation! Select:multiply totals by selectivity »distinct = min(distinct,total) Select (selection attribute): »Get totals/distincts from subset of buckets

Database Techniek – Query Optimization Practicum Assignment 2 Get the XML metadata description for TPC-H –ps/pfd histograms also available Take our solution for your second query (assignment 1) For each operator in the tree give: –Selectivity –Intermediate Result size –Short description how you computed this –Explanation how to compute histograms on all result columns Sum all intermediate result sizes into total query cost DEADLINE: march 31