Completing the Physical- Query-Plan and Chapter 16 Summary (16.7-16.8) CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung Donavon Norwood.

Completing the Physical- Query-Plan and Chapter 16 Summary (16.7-16.8) CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung Donavon Norwood

2 Outline 16.7 Completing the Physical-Query-Plan I. Choosing a Selection Method II. Choosing a Join Method III. Pipelining Versus Materialization IV. Pipelining Unary Operations V. Pipelining Binary Operations VI. Notation for Physical Query Plan VII. Ordering the Physical Operations 16.8 Summary of Chapter 16

3 Before complete Physical- Query-Plan  A query previously has been  Parsed and Preprocessed (16.1)  Converted to Logical Query Plans (16.3)  Estimated the Costs of Operations (16.4)  Determined costs by Cost-Based Plan Selection (16.5)  Weighed costs of join operations by choosing an Order for Joins

4 16.7 Completing the Physical- Query-Plan  3 topics related to turning LP into a complete physical plan 1.Choosing of physical implementations such as Selection and Join methods 2.Decisions regarding to intermediate results (Materialized or Pipelined) 3.Notation for physical-query-plan operators

5 I. Choosing a Selection Method (A)  Algorithms for each selection operators 1. Can we use an created index on an attribute?  If yes, index-scan. Otherwise table-scan) 2. After retrieve all condition-satisfied tuples in (1), then filter them with the rest selection conditions

6 Choosing a Selection Method(A) (cont.)  Recall  Cost of query = # disk I/O’s  How costs for various plans are estimated from σ C (R) operation 1. Cost of table-scan algorithm a)B(R) if R is clustered b)T(R) if R is not clustered 2. Cost of a plan picking an equality term (e.g. a = 10) w/ index-scan a)B(R) / V(R, a) clustering index b)T(R) / V(R, a) nonclustering index 3. Cost of a plan picking an inequality term (e.g. b < 20) w/ index-scan a)B(R) / 3 clustering index b)T(R) / 3 nonclustering index

7 Example Selection: σ x=1 AND y=2 AND z<5 (R) - Where parameters of R(x, y, z) are : T(R)=5000,B(R)=200, V(R,x)=100, andV(R, y)=500 -Relation R is clustered -x, y have nonclustering indexes, only index on z is clustering.

8 Example (cont.) Selection options: 1.Table-scan  filter x, y, z. Cost is B(R) = 200 since R is clustered. 2.Use index on x =1  filter on y, z. Cost is 50 since T(R) / V(R, x) is (5000/100) = 50 tuples, index is not clustering. 3.Use index on y =2  filter on x, z. Cost is 10 since T(R) / V(R, y) is (5000/500) = 10 tuples using nonclustering index. 4.Index-scan on clustering index w/ z < 5  filter x, y. Cost is about B(R) /3 = 67

9 Example (cont.)  Costs option 1 = 200 option 2 = 50 option 3 = 10 option 3 = 10 option 4 = 67 The lowest Cost is option 3.  Therefore, the preferred physical plan 1.retrieves all tuples with y = 2 2.then filters for the rest two conditions (x, z).

10 II. Choosing a Join Method  Determine costs associated with each join algorithms: 1. One-pass join, and nested-loop join devotes enough buffer to joining 2. Sort-join is preferred when attributes are pre-sorted or two or more join on the same attribute such as ( R(a, b) S(a, c)) T(a, d) - where sorting R and S on a will produce result of R S to be sorted on a and used directly in next join

11 3. Index-join for a join with high chance of using index created on the join attribute such as R(a, b) S(b, c) 4. Hashing join is the best choice for unsorted or non-indexing relations which needs multipass join. Choosing a Join Method (cont.)

12 III. Pipelining Versus Materialization  Materialization (naïve way)  store (intermediate) result of each operations on disk  Pipelining (more efficient way)  Interleave the execution of several operations, the tuples produced by one operation are passed directly to the operations that used it  store (intermediate) result of each operations on buffer, which is implemented on main memory

13  Unary = a-tuple-at-a-time or full relation  selection and projection are the best candidates for pipelining. IV. Pipelining Unary Operations R In buf Unary operation Out buf In buf Unary operation Out buf M-1 buffers

14 Pipelining Unary Operations (cont.)  Pipelining Unary Operations are implemented by iterators

15 V. Pipelining Binary Operations  Binary operations : , , -,, x  The results of binary operations can also be pipelined.  Use one buffer to pass result to its consumer, one block at a time.  The extended example shows tradeoffs and opportunities

16 Example  Consider physical query plan for the expression ( R(w, x) S(x, y)) U(y, z)  Assumption  R occupies 5,000 blocks, S and U each 10,000 blocks.  The intermediate result R S occupies k blocks for some k.  Both joins will be implemented as hash-joins, either one-pass or two-pass depending on k  There are 101 buffers available.

17 Example (cont.)  First consider join R S, neither relations fits in buffers  Needs two-pass hash-join to partition R into 100 buckets (maximum possible) each bucket has 50 blocks  The 2 nd pass hash-join uses 51 buffers, leaving the rest 50 buffers for joining result of R S with U.

18 Example (cont.)  Case 1: suppose k  49, the result of R S occupies at most 49 blocks.  Steps 1.Pipeline in R S into 49 buffers 2.Organize them for lookup as a hash table 3.Use one buffer left to read each block of U in turn 4.Execute the second join as one-pass join.

19 Example (cont.)  The total number of I/O’s is 55,000  45,000 for two-pass hash join of R and S  10,000 to read U for one- pass hash join of (R S) U. (R S) U.

20 Example (cont.)  Case 2: suppose k > 49 but 49 but < 5,000, we can still pipeline, but need another strategy which intermediate results join with U in a 50- bucket, two-pass hash-join. Steps are: 1.Before start on R S, we hash U into 50 buckets of 200 blocks each. 2.Perform two-pass hash join of R and U using 51 buffers as case 1, and placing results in 50 remaining buffers to form 50 buckets for the join of R S with U. 3.Finally, join R S with U bucket by bucket.

21 Example (cont.)  The number of disk I/O’s is:  20,000 to read U and write its tuples into buckets  45,000 for two-pass hash-join R S  k to write out the buckets of R S  k+10,000 to read the buckets of R S and U in the final join  The total cost is 75,000+2k.

22 Example (cont.)  Compare Increasing I/O’s between case 1 and case 2  k  49 (case 1)  Disk I/O’s is 55,000  k > 50  5000 (case 2)  k=50, I/O’s is 75,000+(2*50) = 75,100  k=51, I/O’s is 75,000+(2*51) = 75,102  k=52, I/O’s is 75,000+(2*52) = 75,104 Notice: I/O’s discretely grows as k increases from 49  50.

23 Example (cont.)  Case 3: k > 5,000, we cannot perform two-pass join in 50 buffers available if result of R S is pipelined. Steps are 1.Compute R S using two-pass join and store the result on disk. 2.Join result on (1) with U, using two-pass join.

24 Example (cont.)  The number of disk I/O’s is:  45,000 for two-pass hash-join R and S  k to store R S on disk  30,000 + k for two-pass join of U in R S  The total cost is 75,000+4k.

25 Example (cont.)  In summary, costs of physical plan as function of R S size.

26 VI. Notation for Physical Query Plans  Several types of operators: 1.Operators for leaves 2.(Physical) operators for Selection 3.(Physical) Sorts Operators 4.Other Relational-Algebra Operations  In practice, each DBMS uses its own internal notation for physical query plan.

27 Notation for Physical Query Plans (cont.) 1.Operator for leaves  A leaf operand is replaced in LQP tree  TableScan(R) : read all blocks  SortScan(R, L) : read in order according to L  IndexScan(R, C): scan index attribute A by condition C of form Aθc.  IndexScan(R, A) : scan index attribute R.A. This behaves like TableScan but more efficient if R is not clustered.

28 Notation for Physical Query Plans (cont.) 2.(Physical) operators for Selection  Logical operator σ C (R) is often combined with access methods.  If σ C (R) is replaced by Filter(C), and there is no index on R or an attribute on condition C  Use TableScan or SortScan(R, L) to access R  If condition C  Aθc AND D for condition D, and there is an index on R.A, then we may  Use operator IndexScan(R, Aθc) to access R and  Use Filter(D ) in place of the selection σ C (R)

29 Notation for Physical Query Plans (cont.) 3.(Physical) Sort Operators  Sorting can occur any point in physical plan, which use a notation SortScan(R, L).  It is common to use an explicit operator Sort(L) to sort relation that is not stored.  Can apply at the top of physical-query-plan tree if the result needs to be sorted with ORDER BY clause ( г ).

30 Notation for Physical Query Plans (cont.) 4.Other Relational-Algebra Operations  Descriptive text definitions and signs to elaborate  Operations performed e.g. Join or grouping.  Necessary parameters e.g. theta-join or list of elements in a grouping.  A general strategy for the algorithm e.g. sort- based, hashed based, or index-based.  A decision about number of passed to be used e.g. one-pass, two-pass or multipass.  An anticipated number of buffers the operations will required.

31 Notation for Physical Query Plans (cont.)  Example of a physical-query-plan  A physical-query-plan in example 16.36 for the case k > 5000  TableScan  Two-pass hash join  Materialize (double line)  Store operator

32 Notation for Physical Query Plans (cont.)  Another example  A physical-query-plan in example 16.36 for the case k < 49  TableScan  (2) Two-pass hash join  Pipelining  Different buffers needs  Store operator

33 Notation for Physical Query Plans (cont.)  A physical-query-plan in example 16.35  Use Index on condition y = 2 first  Filter with the rest condition later on.

34 VII. Ordering of Physical Operations  The PQP is represented as a tree structure implied order of operations.  Still, the order of evaluation of interior nodes may not always be clear.  Iterators are used in pipeline manner  Overlapped time of various nodes will make “ordering” no sense.

35 Ordering of Physical Operations (cont.)  3 rules summarize the ordering of events in a PQP tree: 1.Break the tree into sub-trees at each edge that represent materialization.  Execute one subtree at a time. 2.Order the execution of the subtree  Bottom-top  Left-to-right 3.All nodes of each sub-tree are executed simultaneously.

36 Summary of Chapter 16 In this part of the presentation I will talk about the main topics of Chapter 16.

37 COMPILATION OF QUERIES  Compilation means turning a query into a physical query plan, which can be implemented by query engine.  Steps of query compilation :  Parsing  Semantic checking  Selection of the preferred logical query plan  Generating the best physical plan

38 THE PARSER  The first step of SQL query processing.  Generates a parse tree  Nodes in the parse tree corresponds to the SQL constructs  Similar to the compiler of a programming language

39 VIEW EXPANSION  A very critical part of query compilation.  Expands the view references in the query tree to the actual view.  Provides opportunities for the query optimization.

40 SEMANTIC CHECKING  Checks the semantics of a SQL query.  Examines a parse tree.  Checks :  Attributes  Relation names  Types  Resolves attribute references.

41 CONVERSION TO A LOGICAL QUERY PLAN  Converts a semantically parsed tree to a algebraic expression.  Conversion is straightforward but sub queries need to be optimized.  Two argument selection approach can be used.

42 ALGEBRAIC TRANSFORMATION  Many different ways to transform a logical query plan to an actual plan using algebraic transformations.  The laws used for this transformation :  Commutative and associative laws  Laws involving selection  Pushing selection  Laws involving projection  Laws about joins and products  Laws involving duplicate eliminations  Laws involving grouping and aggregation

43 ESTIMATING SIZES OF RELATIONS  True running time is taken into consideration when selecting the best logical plan.  Two factors the affects the most in estimating the sizes of relation :  Size of relations ( No. of tuples )  No. of distinct values for each attribute of each relation  Histograms are used by some systems.

44 COST BASED OPTIMIZING  Best physical query plan represents the least costly plan.  Factors that decide the cost of a query plan :  Order and grouping operations like joins, unions and intersections.  Nested loop and the hash loop joins used.  Scanning and sorting operations.  Storing intermediate results.

45 PLAN ENUMERATION STRATEGIES  Common approaches for searching the space for best physical plan.  Dynamic programming : Tabularizing the best plan for each sub expression  Selinger style programming : sort-order the results as a part of table  Greedy approaches : Making a series of locally optimal decisions  Branch-and-bound : Starts with enumerating the worst plans and reach the best plan

46 LEFT-DEEP JOIN TREES  Left – Deep Join Trees are the binary trees with a single spine down the left edge and with leaves as right children.  This strategy reduces the number of plans to be considered for the best physical plan.  Restrict the search to Left – Deep Join Trees when picking a grouping and order for the join of several relations.

47 PHYSICAL PLANS FOR SELECTION  Breaking a selection into an index-scan of relation, followed by a filter operation.  The filter then examines the tuples retrieved by the index-scan.  Allows only those to pass which meet the portions of selection condition.

48 PIPELINING VERSUS MATERIALIZING  This flow of data between the operators can be controlled to implement “ Pipelining “.  The intermediate results should be removed from main memory to save space for other operators.  This techniques can implemented using “ materialization “.  Both the pipelining and the materialization should be considered by the physical query plan generator.  An operator always consumes the result of other operator and is passed through the main memory.

49 Questions & Answers

For your attention

51 Reference [1] H. Garcia-Molina, J. Ullman, and J. Widom, “Database System: The Complete Book,” second edition: p.897-913, Prentice Hall, New Jersey, 2008

Completing the Physical- Query-Plan and Chapter 16 Summary (16.7-16.8) CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung Donavon Norwood.

Similar presentations

Presentation on theme: "Completing the Physical- Query-Plan and Chapter 16 Summary (16.7-16.8) CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung Donavon Norwood."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Completing the Physical- Query-Plan and Chapter 16 Summary (16.7-16.8) CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung Donavon Norwood.

Similar presentations

Presentation on theme: "Completing the Physical- Query-Plan and Chapter 16 Summary (16.7-16.8) CS257 Spring 2009 Professor Tsau Lin Student: Suntorn Sae-Eung Donavon Norwood."— Presentation transcript:

Similar presentations

About project

Feedback