1 Query Compilation Evaluating Logical Query Plan Physical Query Plan Source: our textbook, slides by Hector Garcia-Molina.

1 Query Compilation Evaluating Logical Query Plan Physical Query Plan Source: our textbook, slides by Hector Garcia-Molina

2 Outline uConvert SQL query to a parse tree wSemantic checking: attributes, relation names, types uConvert to a logical query plan (relational algebra expression) wdeal with subqueries uImprove the logical query plan wuse algebraic transformations wgroup together certain operators wevaluate logical plan based on estimated size of relations uConvert to a physical query plan wsearch the space of physical plans wchoose order of operations wcomplete the physical query plan

3 Estimating Sizes of Relations uUsed in two places: wto help decide between competing logical query plans wto help decide between competing physical query plans uNotation review: wT(R): number of tuples in relation R wB(R): minimum number of block needed to store R wV(R,a): number of distinct values in R of attribute a

4 Desiderata for Estimation Rules 1.Give accurate estimates 2.Are easy (fast) to compute 3.Are logically consistent: estimated size should not depend on how the relation is computed Here describe some simple heuristics. All we really need is a scheme that properly ranks competing plans.

5 Estimating Size of Projection uThis can be exactly computed uEvery tuple changes size by a known amount.

6 Estimating Size of Selection uSuppose selection condition is A = c, where A is an attribute and c is a constant. uA reasonable estimate of the number of tuples in the result is: wT(R)/V(R,A), i.e., original number of tuples divided by number of different values of A uGood approximation if values of A are evenly distributed uAlso good approximation in some other, common, situations (see textbook)

7 Estimating Size of Selection (cont'd) uIf condition is A < c: wa good estimate is T(R)/3; intuition is that usually you ask about something that is true of less than half the tuples uIf condition is A ≠ c: wa good estimate is T(R ) uIf condition is the AND of several equalities and inequalities, estimate in series.

8 Example uConsider relation R(a,b,c) with 10,000 tuples and 50 different values for attribute a. uConsider selecting all tuples from R with a = 10 and b < 20. uEstimate of number of resulting tuples is 10,000*(1/50)*(1/3) = 67.

9 Estimating Size of Selection (cont'd) If condition has the form C 1 OR C 2, use: 1.sum of estimate for C 1 and estimate for C 2, or 2.minimum of T(R) and the previous, or 3.assuming C 1 and C 2 are independent, T(R)*(1  (1  f 1 )*(1  f 2 )), where f 1 is fraction of R satisfying C 1 and f 2 is fraction of R satisfying C 2

10 Example uConsider relation R(a,b) with 10,000 tuples and 50 different values for a. uConsider selecting all tuples from R with a = 10 or b < 20. uEstimate for a = 10 is 10,000/50 = 200 uEstimate for b < 20 is 10,000/3 = 3333 uEstimate for combined condition is w200 + 3333 = 3533 or  10,000*(1  (1  1/50)*(1  1/3)) = 3466

11 Estimating Size of Natural Join uAssume join is on a single attribute Y. uSome possibilities: 1.R and S have disjoint sets of Y values, so size of join is 0 2.Y is the key of S and a foreign key of R, so size of join is T(R) 3.All the tuples of R and S have the same Y value, so size of join is T(R)*T(S) uWe need some assumptions…

12 Common Join Assumptions uContainment of Value Sets: If R and S both have attribute Y and V(R,Y) ≤ V(S,Y), then every value of Y in R appears a value of Y in S wtrue if Y is a key of S and a foreign key of R uPreservation of Value Sets: After the join, a non-matching attribute of R has the same number of values as it does in R wtrue if Y is a key of S and a foreign key of R

13 Join Estimation Rule uExpected number of tuples in result is wT(R)*T(S) / max(V(R,Y),V(S,Y)) uWhy? Suppose V(R,Y) ≤ V(S,Y). wThere are T(R) tuples in R. wEach of them has a 1/V(S,Y) chance of joining with a given tuple of S, creating T(S)/V(S,Y) new tuples

14 Example uSuppose we have wR(a,b) with T(R) = 1000 and V(R,b) = 20 wS(b,c) with T(S) = 2000, V(S,b) = 50, and V(S,c) = 100 wU(c,d) with T(U) = 5000 and V(U,c) = 500 uWhat is the estimated size of R S U? wFirst join R and S (on attribute b): estimated size of result, X, is T(R)*T(S)/max(V(R,b),V(S,b)) = 40,000 by containment of value sets, number of values of c in X is the same as in S, namely 100 wThen join X with U (on attribute c): estimated size of result is T(X)*T(U)/max(V(X,c),V(U,c)) = 400,000

15 Example (cont'd) uIf the joins are done in the opposite order, still get the same estimated answer uDue to preservation of value sets assumption. uThis is desirable: we don't want the estimate to depend on how the result is computed

16 More About Natural Join uIf there are mutiple join attributes, the previous rule generalizes: wT(R)*T(S) divided by the larger of V(R,y) and V(S,y) for each join attribute y uConsider the natural join of a series of relations: wcontainment and preservation of value sets assumptions ensure that the same estimated size is achieved no matter what order the joins are done in

17 Summary of Estimation Rules uProjection: exactly computable uProduct: exactly computable uSelection: reasonable heuristics uJoin: reasonable heuristics uThe other operators are harder to estimate…

18 Additional Estimation Heuristics uUnion: wbag: exactly computable (sum) wset: estimate as larger plus half the smaller uIntersection: estimate as half the smaller  Difference: estimate R  S as T(R )  T(S)/2 uDuplicate elimination: T(R)/2 or product of all the V(R,a)'s, whichever is smaller uGrouping: T(R )/2 or product of V(R,a) for all grouping attributes a, whichever is smaller

19 Estimating Size Parameters uEstimating the size of a relation depended on knowing T(R) and V(R,a)'s uEstimating cost of a physical algorithm depends on also knowing B(R). uHow can the query compiler learn them? wScan relation to learn T, V's, and then calculate B wCan also keep a histogram of the values of attributes. Makes estimating join results more accurate wRecomputed periodically, after some time or some number of updates, or if DB administrator thinks optimizer isn't choosing good plans

20 Heuristics to Reduce Cost of LQP uFor each transformation of the tree being considered, estimate the "cost" before and after doing the transformation uAt this point, "cost" only refers to sizes of intermediate relations (we don't yet know about number of disk I/O's) uSum of sizes of all intermediate relations is the heuristic: if this sum is smaller after the transformation, then incorporate it

21 R(a,b)S(b,c) T(R ) = 5000T(S) = 2000 V(R,a) = 50 V(R,b) = 100V(S,b) = 200 V(S,c) = 100   a=10 R S Initial logical query plan: Modified logical query plan: move selection down should  be moved below join?   a=10 S R  R S vs. 5000 2000 5000 250 501000 100 500 1150 vs. 1100

23 Deriving a Physical Query Plan uTo convert a logical query plan into a physical query plan, choose: wan order and grouping for sets of joins, unions, and intersections walgorithm for each operator (e.g., nest-loop join vs. hash join) wadditional operators (scanning, sorting, etc.) that are needed for physical plan but not explicitly in the logical plan whow to pass arguments (store intermediate result on disk vs. pipeline one tuple or buffer at time) uPhysical query plans are evaluated by their estimated cost…

24 Cost of Evaluating an Expression uMeasure by number of disk I/O's uInfluenced by: woperators in the chosen logical query plan wsizes of intermediate results wphysical operators used to implement the logical operators wordering of groups of similar operators (e.g., joins) wargument passing method

25 Enumerating Physical Plans uBaseline approach is exhaustive search, but not practical (too many options) uHeuristic selection: make a sequence of choices based on heuristics uVarious other approaches based on ideas from AI and algorithm analysis to search a space of possibilities uCompare plans by counting number of disk I/O's

26 Some Heuristics uTo implement selection on R with condition A = c: if R has an index on a, then use index-scan uTo implement join when one argument R has an index on the join attribute(s): use index-join with R in inner loop uTo implement join when one argument R is sorted on the join attribute(s): choose sort-join over hash-join uTo implement union or intersection of > 2 relations: group smallest relations first

28 Choosing Order for Joins uSuppose we have > 2 relations to be joined (naturally) uPay attention to asymmetry: wone-pass alg: left argument is smaller and is stored in main memory data structure wnested-loop alg: left argument is used in the outer loop windex-join: right argument has the index uCommon point: these algs work better if left argument is the smaller one

29 Choosing Join Order (cont'd) uTemplate for tree is given below: uChoices are which relations go where: R S U W U S R V vs.

30 Choosing Join Order (cont'd) uHow do we decide on the leaves? wTry all possibilities. Not a good idea: there are n! choices, where n is the number of relations to be joined wUse dynamic programming, a technique from analysis of algorithms. Works well for relatively small values of n wHeuristic approach with a greedy algorithm, works faster but doesn't always find the best ordering

32 Remaining Steps uChoose algorithms for remaining operators uDecide when intermediate results will be materialized (stored on disk in entirety) or pipelined (created only in main memory, in pieces)

33 Choosing Selection Method uSuppose selection condition is the AND of several equalities and inequalities, each involving an attribute and a constant wEx: a = 10 AND b < 20 uDecide between these algorithms: wdo a table scan and "filter" each tuple to check for the condition wdo an index scan on one attribute (which one?) and "filter" each retrieved tuple to check for the remaining parts of the condition uCompare number of disk I/O's

34 Disk I/O Costs uTable scan: wB(R) if R is clustered uIndex scan on an attribute that is part of an equality: wB(R)/V(R,a) if index is clustering uIndex scan on an attribute that is part of an inequality wB(R)/3 if the index is clustering T(R)  T(R)  T(R)  not

35 Example uAssumptions about R(x,y,z): w5000 tuples w200 blocks wV(R,x) = 100 wV(R,y) = 500 uSelect tuples satisfying x=1 AND y=2 AND z<5 uChoices and their costs: 1.table scan: B(R) = 200 2.index scan on x: T(R)/V(R,x) = 50 3.index scan on y: T(R)/V(R,y) = 10 4.index scan on z: B(R)/3 = 67 wR is clustered windex on x is not clustering windex on y is not clustering windex on z is clustering

36 Choosing Join Method uIf we have good estimates of relation statistics (T(R), B(R), V(R,a)'s) and the number of main memory buffers available, use formulas from Ch. 15 regarding sort-join, hash-join, and index-join. uOtherwise, apply these principles: wtry one-pass join wtry nested-loop join wsort-join is good if one argument is already sorted on join attribute(s) or there are multiple joins on same attribute, so the cost of sorting can be amortized over additional join(s) wif joining R and S, R is small, and S has an index on the join attribute, then use index-join wif none of the above apply, use hash-join

37 Materialization vs. Pipelining uMaterialization: perform operations in series and write intermediate results to disk uPipelining: interleave execution of several operations. Tuples produced by one operation are passed directly to the operations that use them as input, bypassing the disk wsaves on disk I/O's wrequires more main memory

38 Notation for Physical Query Plan When converting logical query plan (tree) to physical query plan (tree): uleaves of LQP (stored relations) become scan operators uinternal nodes of LQP (operators) become one or more physical operations (algorithms) uedges of LQP are marked as "pipeline" or "materialize" w"materialize" choice implies a scan of the intermediate relation

39 Operators for Leaves uTableScan(R ) : all blocks holding tuples of R are read in arbitrary order uSortScan(R,L): all tuples of R are read in order, sorted according to attributes in L uIndexScan(R,C): tuples of R satisfying C are retrieved through an index on attribute A; C is a comparison condition involving A uIndexScan(R,A): all tuples of R are retrieved through an index on A

40 Physical Operators for Selection uIf there is no index on the attribute in the condition C, then use Filter(C) operator uIf the relation is on disk, then we must precede the Filter with TableScan or SortScan uIf the condition has the form A op c AND D, then use the physical operators IndexScan(R,A op c) followed by Filter(D)

41 Example Physical Query Plans Filter(x=1 AND z<5) IndexScan(R,y=2) two-pass hash-join 101 buffers two-pass hash-join 101 buffers TableScan(U) TableScan(R)TableScan(S) materialize R S U  x=1 AND y=2 AND z<5 (R)

1 Query Compilation Evaluating Logical Query Plan Physical Query Plan Source: our textbook, slides by Hector Garcia-Molina.

Similar presentations

Presentation on theme: "1 Query Compilation Evaluating Logical Query Plan Physical Query Plan Source: our textbook, slides by Hector Garcia-Molina."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Query Compilation Evaluating Logical Query Plan Physical Query Plan Source: our textbook, slides by Hector Garcia-Molina.

Similar presentations

Presentation on theme: "1 Query Compilation Evaluating Logical Query Plan Physical Query Plan Source: our textbook, slides by Hector Garcia-Molina."— Presentation transcript:

Similar presentations

About project

Feedback