1 Relational Query Optimization Chapter 15. 2 Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL.

1 Relational Query Optimization Chapter 15

2 Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL query with no nesting and exactly one SELECT, FROM, WHERE, GROUP BY, and HAVING clause.  WHERE is in conjunctive normal form.

3 Query Blocks: Units of Optimization  Nested blocks are usually treated as calls to a subroutine, made once per outer tuple.  Optimization is one block at a time. SELECT S.sname FROM Sailors S WHERE S.age IN ( SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating ) Nested blockOuter block SELECT S.sname FROM Sailors S WHERE S.age IN Reference to nested block SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating

4 Query Block  For each block, the plans considered are: – All available access methods – for each relation in FROM clause. – All left-deep join trees – all ways to join the relations one-at-a-time, with inner relation in FROM clause, considering all relation permutations and join methods.

5 Cost Estimation  For each plan considered, must estimate cost:  Must estimate cost of each operation in plan tree. Depends on input cardinalities. Depends on algorithm (sequential scan, index scan).  Must also estimate size of result for each operation. Use information about the input relations. Must make assumptions about effect of predicates.  Cost of plan = sum of cost of each operator in tree.

6 Size Estimation and Reduction Factors  Goal : Estimate result size ! SELECT attribute list FROM relation list WHERE term1 AND... AND termk

7 Size Estimation and Reduction Factors  Consider a query block:  Given maximum # tuples in result :  product of cardinalities of relations in FROM clause.  Reduction factor (RF) associated with each term :  reflects impact of term in reducing result size.  Result size =  product of cardinalities of involved relations (FROM) * product of reduction factors (WHERE). SELECT attribute list FROM relation list WHERE term1 AND... AND termk

8 Assumptions  Uniform distribution of values in domain  Independent distribution of values in different columns.  For selections and joins, assume independence of predicates.

9 Reduction Factors  Reduction factor (RF) associated with each term :  Term col=value has  RF 1/NKeys(I), given index I on col  Term col1=col2 has  RF 1/ MAX (NKeys(I1), NKeys(I2))  Term col>value has  RF (High(I)-value)/(High(I)-Low(I))

10 Reduction Factors  Column = value. - Given the index I on column, assume uniform distribution. 1/Nkeys(I). Otherwise, fixed reduction factor 1/10  Column1 = column2 - Given indexes I1 and I2 on column1 and column2, assuming each key value in I1 (smaller one) has a matching value in I2. 1/MAX(Nkeys(I1), Nkeys(I2)). One index I, 1/Nkeys(I) otherwise, 1/10  Column > values - Given an index I on column, arithmetic type. High(I) – value / High(I) – Low(I). Not arithmetic type, or no index. A fraction Less than half is arbitrarily chosen.  Column IN (list of values) - reduction factor for (column = value) * number of items in list.

11 More on Estimation  Uniform distribution is not accurate since real data is not uniformly distributed.  Histogram: a data structure maintained by a DBMS to approximate a data distribution.

12 Estimation  Equi-width: divide range of column values into subranges (buckets). Assuming the distribution within the histogram bucket is uniform.  Equi-depth: number of tuples within each bucket is equal (almost).  Compressed equi-depth: maintain separate counts for a small number of very frequent values, and maintain equi-depth histogram to cover the remaining values. 2 3 3 1 2 1 3 4 8 2 0 1 2 4 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2.67 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Equiwidth 1.33 5.0 1.0 5.0 Bucket 12345 Count8415315 2.25 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Equidepth 2.5 5.0 1.75 9.0 Bucket 12345 Count9101079

13 Relational Algebra Equivalences  Allow us to choose different join orders  Allow us to `push’ selections and projections ahead of joins.

14 Relational Algebra Equivalences  Selections : ( Cascade ) ( Commute ) Combine several selections into one selection. Evaluate conjunctive condition to each of the tuple. Separate conjunctions into smaller selection operations, Able to combine as Join.

15 Relational Algebra Equivalences  Projections: (Cascade) Where a i  a i+1 for i =1…n-1.

16 Relational Algebra Equivalences  Joins:R (S T) (R S) T (Associative) (R S) (S R) (Commute) When join several relations, we are free to join the relations in any order we choose.

17 Selects, Projects and Joins  Case1: A projection commutes with a selection that only uses attributes retained by the projection.  a (  c (R))   c (  a (R))  Case2: Combine a selection with a cross-product to form a join.R c S   c (R  S)  Case3: A selection on just attributes of R commutes with Join. i.e., (R S) (R) S

18 Selects, Projects and Joins  Selection and Joins :  c1  c2  c3 (R S)   c1 (  c2 (R)  c3 (S))  Project and Joins  a (R  S)   a1 (R)   a2 (S)  a (R c S)   a1 (R) c  a2 (S) (Where c appear in a)

19 Enumeration of Alternative Plans  There are two main cases:  Single-relation plans  Multiple-relation plans

20 Single Relation Plans  Queries over a single relation: combination of selects, projects, and aggregate operations:  Main decision : Which access path for retrieving tuples. Most selective access path (file scan / index) if only single operator considered.  The different operations essentially carried out together. e.g., if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are pipelined into the aggregate computation.

21  s.rating, COUNT(*) ( HAVING COUNT DISTINCT(S.sname) > 2 ( GROUP Bys.rating(  s.rating, S.sname (  s.rating>5  S.age=20 ( Sailors))))) Single Relation Plans without Index SELECT S.rating, COUNT(*) FROM Sailors S WHERE S.rating > 5 AND S.age = 20 GROUP BY S.rating HAVING COUNT DISTINCT (S.sname) > 2 File scan to retrieve tuples and apply the selections and projections. Writing out tuples after the selections and projections. Sorting these tuples to implement the GROUP BY clause. * Then GROUP BY and HAVING are done on-the-fly. e.g., Cost = Cost1(scan) + cost2 (writing pairs) + cost3 (sorting as per the GROUP BY clause). cost1 = 500 pages cost2 = 500* ratio of tuple size * RFs = 20 pages ratio of tuple size = pair size / tuple size = 0.8 RF( s.rating>5) = 0.5 RF(S.age=20) = 0.1 cost3 = 3*Npages = 60 (assuming two passes) Sailors = 500 pages

22 Single-Relation Plans with Index  Single-index access path When several indexes match the selection conditions, choose the most selective access path.  Multiple-index access path Several indexes using Alternatives (2) or (3) for data entries match the selection condition. Retrieve rids using them individually. Intersect result set. Sort by page id.  Sorted-index access path Grouping attributes is a prefix of a tree index. Using index to retrieve tuples in order required by GROUP BY  Index-only access path All attributes mentioned in the query are included in the search key for some dense index on the relation in the FROM clause. Index-only scan to compute the answer.

23 Single-Relation Plans with Index  Index I on primary key matches selection:  Cost is Height(I)+1 for a B+ tree, about 1.2+1 for hash index.  Clustered index I matching one or more selects:  (NPages(I)+NPages(R)) * product of RF’s of matching selects.  Non-clustered index I matching one or more selects:  (NPages(I)+NTuples(R)) * product of RF’s of matching selects.  Sequential scan of file:  NPages(R).

24 Single-Relation Example  Sailors: 500 pages, 80 tuples/page  Case2: index on rating.  Clustered index: (1/NKeys(I)) * (NPages(I)+NPages(S)) = (1/10) * (50+500) pages Unclustered index: (1/NKeys(I)) * (NPages(I)+NTuples(S)) = (1/10) * (50+40000) pages  Case3: index on sid.  Would have to retrieve all tuples/pages. With a clustered index, the cost is 50+500, with unclustered index, 50+40000.  Case4: file scan:  We retrieve all file pages (500). SELECT S.sid FROM Sailors S WHERE S.rating=8

25 Queries Over Multiple Relations  As the number of joins increases, the number of alternative plans grows rapidly  We need to restrict search space !

26 Queries Over Multiple Relations  Fundamental decision in System R (IBM):  Only left-deep join trees are considered.  Left-deep trees can generate all fully pipelined plans. Intermediate results not written to temporary files. Not all left-deep trees are fully pipelined (e.g., SM join). B A C D B A C D C D B A

27  Left-deep plans differ in :  the order of relations,  the access method for each relation, and  the join method for each join. Enumeration of Left-Deep Plans

28  Enumerated using N passes (if N relations joined):  Pass 1: Find best 1-relation plan for each relation.  Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.)  Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation. (All N-relation plans.)  For each subset of relations, retain :  Cheapest plan overall, plus  Cheapest plan for each interesting order of the tuples. Enumeration of Left-Deep Plans BACD Pass 1 Pass 2 Pass 3

29 Enumeration of Left-Deep Plans : Pass 1  Identify selection terms in WHERE clause that mention only attributes of A. (perform access of A, before Join)  Identify attributes of A not mentioned in SELECT or WHERE (project out when first access of A, before Join)  Keep cheapest overall plan for fetching all tuples : a file scan  Keep cheapest plan with tuples in search key order : B+ tree index

30 Left-Deep Plans: All 2-relation Plans  Each of the single-relation plan from Pass 1 as the outer relation, and every other relation as the inner relation.  Examine WHERE clauses: Selections involving only attributes of inner relation (apply before Join). Selections defining the Join. Selections involving attributes of other relations (apply after Join).  Only selections that are really applied before the join are those that match the chosen access paths for A and B.  Depending on the Join algorithm chosen, the cost may include materializing the outer relation.

31 Enumeration of Plans (Contd.)  ORDER BY, GROUP BY, aggregates etc. handled as a final step, using either an `interestingly ordered’ plan or an additional sorting operator.  An N-1 way plan is not combined with another relation unless join condition between them :  avoid Cartesian products if possible.  In spite of pruning plan space, this approach is still exponential in the # of tables.

32 Example: Pass 1  Sailors :  Choice1: B+ tree matches rating>5, probably cheapest.  If selection is expected to retrieve a lot of tuples, and index is unclustered, file scan may be cheaper.  Decision: B+ tree plan kept (because tuples are in rating order).  Reserves :  B+ tree on bid matches bid=100 ; cheapest. Sailors: B+ tree on rating Hash on sid Reserves: B+ tree on bid Reserves Sailors sid=sid bid=100 rating > 5 sname

33 Example Pass 2  Pass1:  Sailors : Choice1: B+ tree matches rating>5,  Reserves : B+ tree on bid matches bid=100. Sailors: B+ tree on rating Hash on sid Reserves: B+ tree on bid  Pass 2: – each plan retained from Pass 1 as the outer. – how to join it with the (only) other relation. – e.g., Reserves as outer : Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value. – e.g., Sailors as outer : Could possibly use sort-merge join, etc. Reserves Sailors sid=sid bid=100 rating > 5 sname

34 Example  Pass1:  Sailors : Choice1: B+ tree matches rating>5, probably cheapest. if selection is expected to retrieve a lot of tuples, and index is unclustered, file scan may be cheaper. Decision: B+ tree plan kept (because tuples are in rating order).  Reserves : B+ tree on bid matches bid=100 ; cheapest. Sailors: B+ tree on rating Hash on sid Reserves: B+ tree on bid  Pass 2: – each plan retained from Pass 1 as the outer. – how to join it with the (only) other relation. e.g., Reserves as outer : Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value. Reserves Sailors sid=sid bid=100 rating > 5 sname

35 Nested Queries  Nested block is optimized independently, with the outer tuple considered as providing a selection condition.  Outer block is optimized with the cost of `calling’ nested block computation taken into account.  Implicit ordering of these blocks means that some good strategies are not considered. The non- nested version of the query is typically optimized better. SELECT S.sname FROM Sailors S WHERE EXISTS ( SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) Nested block to optimize: SELECT * FROM Reserves R WHERE R.bid=103 AND S.sid= outer value Equivalent non-nested query: SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103

36 Highlights of System R Optimizer  Impact:  Most widely used currently; works well for < 10 joins.  Cost estimation:  Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes.  Considers combination of CPU and I/O costs.  Plan Space:  Only the space of left-deep plans is considered. Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation.  Focus on optimizing SQLs without nesting. No duplication in projections. (unless DISTINCT used)  Cartesian products avoided.

37 Highlights of System R Optimizer  Impact of R Optimizer:  Most widely used currently; works well for < 10 joins.  Cost estimation: Approximate art at best.  Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes.  Considers combination of CPU and I/O costs.  Plan Space: Too large, must be pruned.  Only the space of left-deep plans is considered. Left-deep plans allow output of each operator to be pipelined into next operator without storing it in temporary relation.  Cartesian products avoided.

38 Other Approaches to Query Optimization  Exhaustive search is not suitable for large number of Joins.  Rule-based optimizers: a set of rules to guide the generation of candidate plans.  Randomized plan generation: probabilistic algorithms to explore a large space of plans.  Estimating the size of intermediate relations accurately. Parametric query optimization: find good plans for a given query for each of several different conditions that might be encountered at run-time. Multiple-query optimization: take concurrent execution of several queries into account.

39 Summary  There are several alternative evaluation algorithms for each relational operator.  A query is evaluated by converting it to a tree of operators and evaluating the operators in the tree.  Must understand query optimization in order to fully understand the performance impact of a given database design (relations, indexes) on a workload (set of queries).  Two parts to optimizing a query:  Consider a set of alternative plans. Must prune large search space.  Must estimate cost of each considered plan Must estimate size of result and cost for each plan node. Key issues : Statistics, indexes, operator implementations.

40 Summary  Two parts to optimizing a query:  Consider a set of alternative plans. Must prune search space; Typically, left-deep plans only.  Must estimate cost of each plan that is considered. Estimate size of result Estimate cost for each plan node. Key issues : Statistics, indexes, operator implementations.

41 Summary (Contd.)  Single-relation queries:  All access paths considered, cheapest is chosen.  Issues : Selections that match index, whether index key has all needed fields and/or provides tuples in a desired order.  Multiple-relation queries:  All single-relation plans are first enumerated. Selections/projections considered as early as possible.  Next, for each 1-relation plan, all ways of joining another relation (as inner) are considered.  Next, for each 2-relation plan that is `retained’, all ways of joining another relation (as inner) are considered, etc.  At each level, for each subset of relations, only best plan for each interesting order of tuples is `retained’.

1 Relational Query Optimization Chapter 15. 2 Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL.

Similar presentations

Presentation on theme: "1 Relational Query Optimization Chapter 15. 2 Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Relational Query Optimization Chapter 15. 2 Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL.

Similar presentations

Presentation on theme: "1 Relational Query Optimization Chapter 15. 2 Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL."— Presentation transcript:

Similar presentations

About project

Feedback