Presentation is loading. Please wait.

Presentation is loading. Please wait.

L4.2.2. Distributed Query Optimization Algorithms -- 1 Distributed Query Optimization Algorithms v System R and R* v Hill Climbing and SDD-1.

Similar presentations


Presentation on theme: "L4.2.2. Distributed Query Optimization Algorithms -- 1 Distributed Query Optimization Algorithms v System R and R* v Hill Climbing and SDD-1."— Presentation transcript:

1 L4.2.2. Distributed Query Optimization Algorithms -- 1 Distributed Query Optimization Algorithms v System R and R* v Hill Climbing and SDD-1

2 L4.2.2. Distributed Query Optimization Algorithms -- 2 System R (Centralized) Algorithm v Simple (one relation) queries are executed according to the best access path. v Execute joins  Determine the possible ordering of joins  Determine the cost of each ordering  Choose the join ordering with the minimal cost v For joins, two join methods are considered:  Nested loops  Merge join

3 L4.2.2. Distributed Query Optimization Algorithms -- 3 System R Algorithm -- Example Names of employees working on the CAD/CAM project v Assume  EMP has an index on ENO,  ASG has an index on PNO,  PROJ has an index on PNO and an index on PNAME

4 L4.2.2. Distributed Query Optimization Algorithms -- 4 System R Algorithm -- Example v Choose the best access paths to each relation  EMP: sequential scan (no selection on EMP)  ASG: sequential scan (no selection on ASG)  PROJ: index on PNAME (there is a selection on PROJ based on PNAME) v Determine the best join ordering  EMP ASG PROJ  ASG PROJ EMP  PROJ ASG EMP  ASG EMP PROJ  EMP  PROJ ASG  PROJ  EMP ASG  Select the best ordering based on the join costs evaluated according to the two methods

5 L4.2.2. Distributed Query Optimization Algorithms -- 5 System R Example (cont'd) v Best total join order is one of EMPASG PROJ EMP ASGASG EMPPROJ × EMPASG PROJEMP × PROJ (ASG EMP) PROJ (PROJ ASG) EMP PROJ ASG (ASG EMP) PROJ (PROJ ASG) EMP

6 L4.2.2. Distributed Query Optimization Algorithms -- 6 System R Algorithm v (PROJ ASG) EMP has a useful index on the select attribute and direct access to the join attributes of ASG and EMP. v Final plan:  select PROJ using index on PNAME  then join with ASG using index on PNO  then join with EMP using index on ENO

7 L4.2.2. Distributed Query Optimization Algorithms -- 7 System R* Distributed Query Optimization v Total-cost minimization. Cost function includes local processing as well as transmission. v Algorithm  For each relation in query tree find the best access path  For the join of n relations find the optimal join order strategy  each local site optimizes the local query processing

8 L4.2.2. Distributed Query Optimization Algorithms -- 8 Data Transfer Strategies v Ship-whole. entire relation is shipped and stored as temporary relation. If merge join algorithm is used, no need for temporary storage, and can be done in pipeline mode v Fetch-as-needed. this method is equivalent to semijoin of the inner relation with the outer relation tuple

9 L4.2.2. Distributed Query Optimization Algorithms -- 9 Join Strategy 1 v External relation R with internal relation S, let LC be local processing cost, CC be data transfer cost, let average number of tuples of S that match one tuple of R be s v Strategy 1. Ship the entire outer relation to the site of internal relation TC = LC(get R) + CC(size(R)) + LC(get s tuples from S)*card(R)

10 L4.2.2. Distributed Query Optimization Algorithms -- 10 Join Strategy 2 v Ship the entire inner relation to the site of the outer relation TC = LC(get S) + CC(size(S)) + LC(store S) + LC(get R) + LC(get s tuples from S)*card(R)

11 L4.2.2. Distributed Query Optimization Algorithms -- 11 Join Strategy 3 v Fetch tuples of the inner relation for each tuple of the outer relation TC = LC(get R) + CC(len(A)) * card(R) + LC(get s tuples from S) * card(R) + CC(s*len(S))*card(R)

12 L4.2.2. Distributed Query Optimization Algorithms -- 12 Join Strategy 4 v Move both relations to 3rd site and join there TC = LC(get R) + LC(get S) + CC(size(S)) + LC(store S) + CC(size(R)) + LC(get s tuples from S)*card(R) v Conceptually, the algorithm does an exhaustive search among all alternatives and selects one that minimizes total cost

13 L4.2.2. Distributed Query Optimization Algorithms -- 13 Hill Climbing Algorithm - Algorithm Inputs query graph, locations of relations, and relation statistics Initial solution the least costly among all when the relations are sent to a candidate result site denoted by ES 0, and the site as chosen site Splits ES 0 into ES 1 : ship one relation of join to the site of other relation ES 2 : these two relations are joined locally and the result is transmitted to the chosen site If cost(ES 1 ) + cost(ES 2 ) + LC > cost (ES 0 ) select ES 0, else select ES 1 and ES 2. The process can be recursively applied to ES 1 and ES 2 till no more benefit occurs

14 L4.2.2. Distributed Query Optimization Algorithms -- 14 Hill Climbing Algorithm - Example  SAL  PNAME=“CAD/CAM” PROJ ASG EMP PNO TITLE ENO PAY Ignore the local processing cost Length of tuples is 1 for all relation Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) ES 0 Cost = 13 8 4 1

15 L4.2.2. Distributed Query Optimization Algorithms -- 15 HCA - Example Site1 EMP(8) Site2 PAY(4 ) Site3 PROJ(1) Site4 ASG(10) ? ? ? TITLE ES 1 ES 2 ES 3 Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) ES 0 Cost = 13 8 4 1 Solution 1 Cost = Solution 2 Cost = ES 1 ES 2 ES 3 ESo is the “BEST”

16 L4.2.2. Distributed Query Optimization Algorithms -- 16 Hill Climbing Algorithm - Comments v Greedy algorithm: determines an initial feasible solution and iteratively tries to improve it. v If there are local minimas, it may not find the global minima v If the optimal solution has a high initial cost, it won’t be found since it won’t be chosen as the initial feasible solution. Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) COST =

17 L4.2.2. Distributed Query Optimization Algorithms -- 17 SDD-1 Algorithm v SDD-1 algorithm generalized the hill-climbing algorithm to determine ordering of beneficial semijoins; and uses statistics on the database, called database profiles. v Cost of semijoin: Cost (R SJ A S) = C MSG + C TR *size(  A (S)) v Benefit is the cost of transferring irrelevant tuple Benefit(R SJ A S) = (1-SF SJ (S.A)) * size(R) * C TR v A semijoin is beneficial if cost < benefit.

18 L4.2.2. Distributed Query Optimization Algorithms -- 18 SDD-1: The Algorithm v initialization phase generates all beneficial semijoins, and an execution strategy that includes only local processing v most beneficial semijoin is selected; statistics are modified and new beneficial semijoins are selected v the above step is done until no more beneficial joins are left v assembly site selection to perform local operations v postoptimization removes unnecessary semijoins

19 L4.2.2. Distributed Query Optimization Algorithms -- 19 SDD1 - Example SELECT * FROM EMP, ASG, PROJ WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO Site 1 EMP Site 2 ASG Site 3 PROJ ENO PNO

20 L4.2.2. Distributed Query Optimization Algorithms -- 20 SDD1 - First Iteration v SJ1: ASG SJ EMP benefit = (1-0.3)*3000 = 2100; cost = 120 v SJ2: ASG SJ PROJ benefit = (1-0.4)*3000 = 1800 cost = 200 v SJ3: EMP SJ ASG benefit = (1-0.8)*1500 = 300; cost = 400 v SJ4: PROJ SJ ASG benefit = 0; cost = 400 v SJ1 is selected v ASG size is reduced to 3000*0.3=900 ASG’ = ASG SJ EMP v Semijoin selectivity factor is reduced; it is approximated by SF SJ (G’.ENO)= 0.8*0.3 = 0.24, SF SJ (G’PNO)=1.0*0.3 =0.3, size(G’.ENO)= 400*0.3=120, size(G’.PNO) = 120

21 L4.2.2. Distributed Query Optimization Algorithms -- 21 SDD-1 - Second & Third Iterations Second iteration v SJ2: ASG’ SJ PROJ benefit=(1- 0.4)*900=540 cost=200; v SJ3: EMP SJ ASG’; benefit=(1- 0.24)*1500=1140 cost=120 v SJ4: PROJ SJ ASG’, benefit=(1- 0.3)*2000=1400 cost=120 è SJ4 is selected PROJ’ = PROJ SJ ASG’ size(PROJ’) = 2000*0.3 = 600 SF SJ (J’)=0.4*0.3=0.12 size(J’.PNO)=200*0.3=60 Third Iteration v SJ2: ASG’ SJ PROJ benefit=(1-0.12)*900=792 cost=60; v SJ3: EMP SJ ASG’; benefit=(1- 0.24)*1500=1140 cost=120 è SJ3 is selected reduces size of E to 1500*0.24=360 è Finally SJ2 is selected, with size of G as 108

22 L4.2.2. Distributed Query Optimization Algorithms -- 22 Local Optimization v Each site optimizes the plan to be executed at the site v A centralized query optimization problem

23 L4.2.2. Distributed Query Optimization Algorithms -- 23 SDD-1 - Assembly Site Selection v After reduction EMP is at site 1 with size 360 ASG is at site 2 with size 108 PROJ is at site 3 with size 600 è Site 3 is chosen as assembly site v SJ4 is removed in post optimization. Site1 EMP Site3 PROJ Site2 ASG (ASG SJ EMP) SJ PROJ  site 3 (EMP SJ ASG)  site 3 join at site 3


Download ppt "L4.2.2. Distributed Query Optimization Algorithms -- 1 Distributed Query Optimization Algorithms v System R and R* v Hill Climbing and SDD-1."

Similar presentations


Ads by Google