Query Optimization
Query Optimization The execution cost is expressed as weighted combination of I/O, CPU and communication cost. In earlier distributed query optimizers was ignored that local processing cost such as (I/O, CPU) are also important. The important inputs to the optimizers are to estimating execution costs are fragments statistics and formulas for estimating the cardinalities of results of relational operations. Query optimization is a general term i.e., independent of whether the environment is centralized or distributed. Query optimization point out the process of producing a query execution plan(QEP) which represents and execution strategy for the query .
This strategy or plan is used to minimize the cost function. A query optimizer is a software module that is used to perform query optimization. Normally it has three main components A Search space A cost model Search strategy Search Space: The search space is the set of alternative execution plans to represent the input query. These plans are normally equal to yield the same result but their execution order of operations, and the way these operations are implemented and therefore on performance.
The search space is obtained by using the transformation rules i. e The search space is obtained by using the transformation rules i.e. relation algebra E.g The query “Find the names of employees other than “J.Doe” who worked on the CAD/CAM project for either one or two years “ sql query for above search Select ename from proj, asg, emp where asg.end = emp.end AND asg.pno = proj.pno AND ename not = “J.Deo” AND proj.pname = “CAD/CAM” AND (DUR = 12 OR DUR =24)
A tree transformation of the above query
We can make another tree for the same above query
Another format for the above query
Search space (Cont) The above three trees shown the transformation rules Query execution plans are typically abstracted by mean of operator trees which define the order in which the operations are executed. These operations are filled from additional information, such as the best algorithm chosen for each operation. So we can say that the search space can be defined as the set of equivalent operator trees that can be produced using transformation rules. To characterize query optimizer, it is useful to concentrate on join and operator trees.
Search space e.g. Select ename, resp from emp, asg, proj where emp.eno=asg.eno AND asg.pno=proj.pno
The 1st restriction is use the heuristics. The c part of the equivalent join tree starts from Cartesian product may have a much higher cost than the other join trees. Query optimizers typically restricts the size of the search space they consider. The 1st restriction is use the heuristics. The most common heuristics is to perform selection and projection when accessing base relations. Another important restriction is with the shape of join tree. Two kind of join trees are usually distinguished. A linear tree A bushy tree Linear Tree: A tree that at least one operand of each operator node is a base relation. By considering the linear trees the size of search space can be reduced.
Bushy Tree: Is more general and may have operators with no base relations as operands. However bushy trees useful in distributed environment.
Cost Model Total_Time=Tcpu*#insts+TI/O*#I/Os+TMSG*#msgs+TTR*#bytes The cost model predicts the cost of a given execution plan The cost of a distributed execution plan can be calculated either in the form of total time or the response time. The total time is sum of all time components While the response time is the elapsed time from the initiation to the completion of query. A general formula for determinig total time can be specified by [Lohman et al.,1985] Total_Time=Tcpu*#insts+TI/O*#I/Os+TMSG*#msgs+TTR*#bytes TCPU is the time of a CPU instruction Ti/o is the time of disk I/O. The communication time is watched by the two last components
Tmsg is fixed tunning of sending and receiving any message TTR is the time it takes to transmit a data unit from one site to another site. The data unit is used in term of bytes( #bytes is the sum of the sizes of all messages)
Search Strategy This explores the search space and selects the best plan, using the cost model. It defines which plans are examined and in which order and details of the environment are captured by the search space and the cost model. The most popular search strategy used by optimizers is dynamic programming that is deterministic. Deterministic strategies building plans starting from base relation and add one relation at each step until complete plans are obtained. Dynamic programming builds all possible plans before it selects the “best” plan.
On deterministic strategy is that partial plans those are not lead to the optimal plans are pruned as they found. (B.D.F) Another deterministic strategy the greedy algorithm, builds only one plan.(D.F) Dynamic programming is an exhaustive and try to get the “BEST” of all plans found. It is acceptable when less number of relations are in the query. Unlike deterministic strategies , randomized strategies allow the optimizer to obtain optimization for execution time. Such as iterative improvements Simulated Annealing focus on searching the optimal solution around some particular points. But it also not guarantee that the best solution is obtained but avoid the high cost of optimization, in term of memory and time consumption.
Thanks