Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Processor  A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level.

Similar presentations


Presentation on theme: "Query Processor  A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level."— Presentation transcript:

1 Query Processor  A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level query  For a DDBMS, the QP also does data localization for the query based on the fragmentation scheme and generates the execution strategy that incorporates the communication operations involved in processing the query

2 Query Optimizer  Queries expressed in SQL can have multiple equivalent relational algebra query expressions  The distributed query optimizer must select the ordering of relational algebra operations, sites to process data, and possibly the way data should be transferred. This makes distributed query processing significantly more difficult

3 Complexity of Relational Algebra Operations  The relational algebra is used to express the output of the query. The complexity of relational algebra operations play a role in defining some of the principles of query optimization. All complexity measures are based on the cardinality of the relation  Operations Complexity Select, Project (w/o duplicate elimination)O(n) Project (with duplicate elimination), GroupO(n logn) Join, Semi-join, Division, Set OperatorsO(n logn) Cartesian ProductO(n 2 )

4 Characteristics of Query Processors  Languages  Input language can be relational algebra or calculus; output language is relational algebra (annotated with communication primitives). The query processor must efficiently map input language to output language  Types of Optimization  The output language specification represents the execution strategy. There can be many such strategies, the best one can be selected through exhaustive search, or by applying heuristic (minimize size of intermediate relations). For distributed databases semi­joins can be applied to reduce data transfer.

5 When to Optimize  Static: done before executing the query (at compilation time), cost of optimization amortized over multiple executions, mostly based on exhaustive search. Since sizes of intermediate relations need to be estimated, it can result in sub-optimal strategies.  Dynamic: done at run time; every time the query is executed, can make use of exact sizes of intermediate relations, expensive, based on heuristics  Hybrid: mixes static and dynamic approaches; the approach is mainly static, but dynamic query optimization may take place when high difference between predicted and actual sizes are detected

6 Characteristics of Query Processors  Statistics  fragment cardinality and size  size and number of distinct values for each attribute. detailed histograms of attribute values for better selectivity estimation.  Decision Sites  one site or several sites participate in selection of strategy  Exploitation of network topology  wide area network ­ communication cost  local area network ­ parallel execution

7 Characteristics of Query Processors  Exploitation of replicated fragments  larger number of possible strategies  Use of Semijoins  reduce size of data transfer  increase # of messages and local processing  good for fast or slow networks?

8 Layers of Query Processing QUERY DECOMPOSITION DATA LOCALIZATION GLOBAL OPTIMIZATION LOCAL OPTIMIZATION FRAGMENT SCHEMA STATISTICS ON FRAGMENTS LOCAL SCHEMA GLOBAL SCHEMA Calculus Query on Distributed Relations Algebra Query on Distributed Relations Fragment Query Optimized Fragment Query With Communication Operations Optimized Local Queries CONTROL SITE LOCAL SITE

9 Query Decomposition  Normalization  The calculus query is written in a normalized form (CNF or DNF) for subsequent manipulation  Analysis  The query is analyzed for semantic correctness  Simplification  Redundant predicates are eliminated to obtain simplified queries  Restructuring  The calculus query is translated to optimal algebraic query representation

10 Query Decomposition: Normalization  Lexical and syntactic analysis  check validity  check for attributes and relations  type checking on the qualification  There are two possible forms of representing the predicates in query qualification: Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF)  CNF: (p 11  p 12 ...  p 1n ) ...  (p m1  p m2 ...  p mn )  DNF: (p 11  p 12 ...  p 1n ) ...  (p m1  p m2 ...  p mn )  OR's mapped into union  AND's mapped into join or selection

11 Query Decomposition: Analysis  Queries are rejected because  the attributes or relations are not defined in the global schema; or  operations used in qualifiers are semantically incorrect  For only those queries that do not use disjunction or negation semantic correctness can be determined by using query graph  One node of the query graph represents result sites, others operand relations, edge between nodes operand nodes represent joins, and edge between operand node and result node represents project

12 Query Graph and Join Graph SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND G.JNO = J.JNO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog'' EMPResultASG G.JNO = J.JNO E. ENo = G. ENO Resp Ename DUR >= 36 JNAME = ``CAD'' Title = ``Prog'' EMPPROJASG G.JNO = J.JNO E. ENo = G. ENO PROJ

13 Disconnected Query Graph  Semantically incorrect conjunctive multivariable query without negation have query graphs which are not connected SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog'' EMPResultASG E. ENo = G. ENO Resp Ename DUR >= 36JNAME = ``CAD'' Title = ``Prog'' PROJ

14 Simplification: Eliminating Redundancy  Elimination of redundant predicates using well known idempotency rules: p  p = p;p  p = p; p  true = true; p  false = p; p  true = p; p  false = false; p 1  (p 1  p 2 ) = p 1 ; p 1  (p 1  p 2 ) = p 1  Such redundant predicates arise when user query is enriched with several predicates to incorporate view­ relation correspondence, and ensure semantic integrity and security

15 Eliminating Redundancy-- An Example SELECT TITLE FROM E WHERE (NOT (TITLE = ``Programmer'') AND (TITLE = ``Programmer'' OR TITLE = ``Elec.Engr'') AND NOT (TITLE = ``Elec.Engr'')) OR ENAME = ``J.Doe''; SELECT TITLE FROM E WHERE ENAME = ``J.Doe'';

16 Eliminating Redundancy-- An Example p1 = p2 = p3 = The disjunctive normal form of the query is = ( ¬ p1  p1  ¬ p2)  ( ¬ p1  p2  ¬ p2)  p3 = (false  ¬ p2)  ( ¬ p1  false) Ú p3 = false  false  p3 = p3 Let the query qualification is ( ¬ p1  (p1  p2)  ¬ p2)  p3

17 Query Decomposition: Rewriting  Rewriting calculus query in relational algebra;  straightforward transformation from relational calculus to relational algebra, and  restructuring relational algebra expression to improve performance

18 Rewriting -- Transformation Rules (I)  Commutativity of binary operations: R  S  S  R R  S  S  R  Associativity of binary operations: (R  S)  T  R  ( S  T )  Idempotence of unary operations: grouping of projections and selections   A’ (  A’’ (R ))   A’ (R ) for A’  A’’  A   p1(A1) (  p2(A2) (R ))   p1(A1)  p2(A2) (R ) R S  S R (R S) T  R (S T)

19 Rewriting -- Transformation Rules (II)  Commuting selection with projection  A1, …, An (  p (Ap) (R ))   A1, …, An (  p (Ap) (  A1, …, An, Ap (R )))  Commuting selection with binary operations  p (Ai) (R  S)  (  p (Ai) (R))  S  p (Ai) (R S)  (  p (Ai) (R)) S  p (Ai) (R  S)   p (Ai) (R)   p (Ai) (S)  Commuting projection with binary operations  C (R  S)   A (R)   B (S) C = A  B  C (R S)   C (R)  C (S)  C (R  S)   C (R)   C (S)

20 An SQL Query and Its Query Tree ASGEMP  ENAME  (ENAME<>“J.DOE” )  (JNAME=“CAD/CAM” )  (Dur=12  Dur=24) PROJ SELECT Ename FROM J, G, E WHERE G.Eno=E.ENo AND G.JNo = J.JNo AND ENAME <> `J.Doe' AND JName = `CAD' AND (Dur=12 or Dur=24 ) JNO ENO

21 Query Decomposition: Rewriting  ENAME  JNO  JNO, ENAME  JNO, ENO  ENO, ENAME  Dur=12  Dur=24  JNAME=“CAD/CAM”  ENAME<>“J.DOE” PROJASGEMP ENO JNO

22 Data Localization Input: Algebraic query on distributed relations  Determine which fragments are involved  Localization program  substitute for each global query its materialization program  optimize

23 Data Localization-- An Example PROJ ASG1 EMP1  ENAME  Dur=12  Dur=24  JNAME=“CAD/CAM”  ENAME<>“J.DOE” ENO JNO EMP is fragmented into EMP1 =  ENO  “E3” (EMP) EMP2 =  “E3” < ENO  “E6” (EMP) EMP3 =  ENO >“E6” (EMP) ASG is fragmented into ASG1 =  ENO  “E3” (ASG) ASG2 =  ENO >“E3” (ASG) EMP1  ASG2 ASG1 

24 Reduction with Selection EMP is fragmented into EMP1 =  ENO  “E3” (EMP) EMP2 =  “E3” < ENO  “E6” (EMP) EMP3 =  ENO >“E6” (EMP) SELECT * FROM EMP WHERE ENO=“E5” EMP1EMP2EMP3   ENO=“E5” EMP2  ENO=“E5” EMP  ENO=“E5” Given Relation R, F R ={R 1, R 2, …, R n } where R j =  pj (R)  pj (R j ) =  if  x  R:  (p i (x)  p j (x))

25 Reduction with join EMP is fragmented into EMP1 =  ENO  “E3” (EMP) EMP2 =  “E3” < ENO  “E6” (EMP) EMP3 =  ENO >“E6” (EMP) ASG is fragmented into ASG1 =  ENO  “E3” (ASG) ASG2 =  ENO >“E3” (ASG) ASG1 EMP1 ENO EMP1  ASG2 ASG1  SELECT * FROM EMP, ASG WHERE EMP.ENO=ASG.ENO ENO ASG EMP

26 ASG1 EMP1 ENO EMP2EMP3  ASG2 ASG1  Reduction with Join (I) (R1  R2) S  (R1 S)  (R2 S) ASG1EMP1 ENO ASG1EMP2 ENO ASG2EMP2 ENO ASG1EMP3 ENO ASG2EMP3 ENO ASG2EMP1 ENO 

27 Reduction with Join (II) ASG1 EMP1 ENO ASG2 EMP2 ENO ASG2 EMP3 ENO  Given R i =  pi (R) and R j =  pj (R) R i Rj =  if  x  R i,  y  R j :  (p i (x)  p j (y)) Reduction with join 1. Distribute join over union 2. Eliminate unnecessary work

28 Reduction for VF  Find useless intermediate relations Relation R defined over attributes A = {A1, A2, …, An} vertically fragmented as R i =  A’ (R) where A’  A  K,D (R i ) is useless if the set of projection attributes D is not in A’ EMP1=  ENO,ENAME (EMP) EMP2=  ENO,TITLE (EMP) SELECT ENAME FROM EMP EMP2 EMP1 ENO  ENAME EMP1  ENAME

29 Reduction for DHF Distribute joins over union Apply the join reduction for horizontal fragmentation EMP1:  TITLE=“Programmer” (EMP) EMP2:  TITLE  “Programmer” (EMP) ASG1: ASG ENO EMP1 ASG2: ASG ENO EMP2 SELECT * FROM EMP, ASG WHERE ASG.ENO = EMP.ENO AND EMP.TITLE = “Mech. Eng.” ASG1 EMP1 ENO EMP2  ASG2 ASG1   TITLE=“MECH. Eng.”

30 Reduction for DHF (II)  ASG1 EMP2  TITLE=“Mech. Eng.” ENO ASG1 ASG2 EMP2  TITLE=“Mech. Eng.” ENO ASG1 ASG2 EMP2  TITLE=“Mech. Eng.” ENO ASG1 ENO EMP2 ASG2 ASG1   TITLE=“Mech. Eng.” Selection first Joins over union

31 Reduction for HF  Remove empty relations generated by contradicting selection on horizontal fragments;  Remove useless relations generated by projections on vertical fragments;  Distribute joins over unions in order to isolate and remove useless joins

32 Reduction for HF --An Example EMP1 =  ENO  “E4” (  ENO,ENAME (EMP)) EMP2 =  ENO>“E4” (  ENO,ENAME (EMP)) EMP3 =  ENO,TITLE (EMP) QUERY SELECT ENAME FROM EMP WHERE ENO = “E5” ASG1 ENO EMP3 EMP2 EMP1   ENO=“E5”  ENAME EMP2  ENO=“E5”  ENAME

33 Why Optimization – An Example Query Select ename From EMP e, ASG g Where e.Eno = g. Eno And resp = ‘‘manager’’ EMP(eno, ename, title) ASG(eno, jno, resp, dur) Find the name of the employees who are managing a project? ASG EMP ASG   resp=”manager”  EMP.Eno=ASG.Eno  Ename Database SQL Query RA tree

34 Example - Strategies EMP1 =  ENO <= 100 (EMP) at site 1 EMP2 =  ENO > 100 (EMP) at site 2 ASG1 =  ENO <= 100 (ASG) at site 3 ASG2 =  ENO > 100 (ASG) at site 4 Fragment Schema Query site: Site 5 ENO  ASG1  resp=“manager ” EMP1 ENO ASG2   resp=“manager ” EMP2 Site 5  ASG1  resp=“manager ” EMP1 ENO ASG2  EMP2 Plan A Plan B ASG1’ASG2’

35 Example – DB Statistics & Costs Database Statistics  EMP has 400 tuples,  ASG has 1000 tuples,  there are 20 managers in G  the data is uniformly distributed among sites.  ASG and EMP are locally clustered on attributes RESP and ENO, respectively Costs  tuple access t acc = 1 unit,  tuple transfer t trans = 10 units,

36 Costs for Example Plan  The cost of Plan A: Produce ASG’ = 20  t acc =20 (processing locally) Transfer ASG’ = 20 *t trans =200(transfer to EMP site) Produce EMP’ = (10+10) * t acc * 2 = 40(join at the EMP site) Transfer EMP’ = 20 * t trans =200(send to Site 5) Total cost = 460  The cost of Plan B: Transfer EMP = 400 * t trans = 4,000(send EMP to Site 5) Transfer ASG = 1000 * t trans = 10,000(send ASG to Site 5) Produce ASG’ = 1000 * t acc = 1,000(selection at Site 5) Join EMP and ASG’ = 400 * 20 * t acc = 8,000 (join at Site 5) Total cost = 23,000

37 Query Optimization  Problems in query optimization  Determining the physical copies of the fragments upon which to execute the fragment query expressions (also known as materialization)  Selecting the order of execution of operations  Selecting the method for executing each operation  The above problems are not independent, for instance, the choice of the best materialization for a query depends on the order in which operations are executed. But they are treated as independent. Further,  We bypass (1) by taking materialization for granted  We bypass (3) by clustering all operations at the same site as a local database system dependent problem

38 Query Optimization - Objectives  The selection of alternative query execution strategies is made based on predetermined objectives  Two main objectives:  minimize the total processing time (total cost) –network and computers at nodes do not get loaded. –Response time cannot be guaranteed  minimize the response time –allocation must facilitate parallel execution of the query –but throughput may decrease and cost can be higher than total cost  Total processing time (cost) is the sum of all the time (cost) incurred in executing the query (CPU, I/O, data transfer)  Response time is the elapsed time from the initiation till the completion of the query

39 Optimization Algorithms – The Issues  Cost model  cost components  weights for each components  costs for primitive operations  Search space  The set of equivalent algebra expressions (query trees)  Search strategies  How do we move inside the search space  Exhaustive search, heuristics, …

40 Cost Models  The cost measures are: I/O and CPU for centralized DBMSs and I/O, CPU and data transfer costs for DDBMS  Total cost = CPU cost + I/O cost + communication cost  CPU cost: C cpu * #insts  I/O cost:C i/o * #i/os  Communication CostC msg *#msgs + C tr *#bytes –C cpu, C i/o, C tr and C msg are all assumed to be constants.  Response time = sum (sequential operations)  C cpu *s_#insts  C i/o *s_#i/os  C msg *s_#msg + c tr *s_#bytes –S_x stands for maximum number of sequential x’s that need to be executed to process the query

41 Intermediate Result Size  The size of the intermediate relations produced during the execution facilitates the selection of the execution strategy  This is useful in selecting an execution strategy that reduces data transfer  The sizes of intermediate relations need to be estimated based on cardinalities of relations and lengths of attributes  R{A 1, A 2,..., A n } fragmented as R 1,R 2,…, R n the statistical data collected typically are  len(A i ), length of attribute A i in bytes  min(A i ) and max(A i ) for ordered domains  card(dom(A i )) unique values in dom[A i ]  Number of tuples in each fragment card(R j )

42 Intermediate Size Estimation  Join selectivity factor SF j (r,s) = card(r * s) / card(r) * card(s)  Selecton selectivity factor SF S (F) = card(  f (r)) / card(r)  size(r) = card(r) * len(r)  Cardinality of intermediate relations  SF S (A = value) = 1/card(dom(A))  SF S (A > value) = max(A) - value/max(A)-min(A)  SF S (A < value) = value - min(A)/max(A)-min(A)  Sf s (p(A i )  p(A j )) = sf s (p(A i )) * sf s (p(A j ))  Sf s (p(A i )  p(A j )) = sf s (p(A i )) + sf s (p(A j )) - sf s (p(A i )) * sf s (p(A j ))  SF S (A  {values}) = SF S (A = value) * card(values)

43 Intermediate Size Estimation (II)  Projection card(  a (r)) = card(r)  Cartesian product card(r X S) = card(r) * card(s)  Join card(R A=B S) = card(s); if A is key in R, B is foreign key in S card(R A=B S) = SF J (R,S) * card(r) * card(s)  Union Upper bound = card(r) + card(s) Lower bound = max{card(r), card(s)}

44 Cost of Processing Primitive Operations  Selection  Projection  Union  Join  nested-loops  sort-merge  hash-based  For distributed join, semi-join is proposed to perform joins

45 Semi-join R S R’=  A (R) S’ = R’ S S’ R S’ R S Amount of data transferred: |R’| + |S’| 1. join is replaced with a project; followed by semi-join; and then join 2. the project and join operations are done at one site, and semi-join at another site 3.amount of data transferred: |R’| + |S’|

46 Semi-join versus Join  using sem-ijoin increases local processing costs because a relation must be scanned twice (join, project)  For joining intermediate relations produced during sem-ijoin one cannot exploit indices on the base relations  Sem-ijoin may not be good when communication costs are low

47 Search Space  Search space is characterized by alternative execution plans  Most optimizers focus on join trees  For N relations, there are O(N!) equivalent join trees SELECT ENAME, RESP FROM EMP, ASG, PROJ WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO ENO ASG EMP PNO PROJ ENO ASG EMP PNO PROJ ASG EMP PNO,EN O PROJ 

48 Restricting Search Space  O(N!) is large  Considering join methods, the search space is even bigger  Restrict by means of heuristics  Ignore cartisian product  …  Restrict the shape of the join tree  Only consider deep trees  …. R1R1 R2R2 R3R3 R1R1 R2R2 R3R3 R4R4 R4R4 R1R1 R2R2 R3R3 R4R4 deep tree Left-deep tree bushy tree

49 Search Strategy  How to move in the search space to find the optimal plan  Deterministic  Start from base relations and build plans by adding relations at each step  Dynamic programming: breadth-first  Greedy: depth-first  Randomized  Search for the optimal one around a particular starting point –simulated annealing –iterative improvement

50 Search Strategies -- Example R1R1 R2R2 R3R3 R4R4 R1R1 R2R2 R1R1 R2R2 R3R3 R1R1 R3R3 R4R4 R2R2 R1R1 R3R3 R2R2 R4R4 R1R1 R2R2 R3R3 R4R4 Deterministic Randomized

51 Distributed Query Optimization Algorithms  System R and R*  Hill Climbing and SDD-1

52 System R (Centralized) Algorithm  Simple (one relation) queries are executed according to the best access path.  Execute joins  Determine the possible ordering of joins  Determine the cost of each ordering  Choose the join ordering with the minimal cost  For joins, two join methods are considered:  Nested loops  Merge join

53 System R Algorithm -- Example Names of employees working on the CAD/CAM project  Assume  EMP has an index on ENO,  ASG has an index on PNO,  PROJ has an index on PNO and an index on PNAME

54 System R Algorithm -- Example  Choose the best access paths to each relation  EMP: sequential scan (no selection on EMP)  ASG: sequential scan (no selection on ASG)  PROJ: index on PNAME (there is a selection on PROJ based on PNAME)  Determine the best join ordering  EMP ASG PROJ  ASG PROJ EMP  PROJ ASG EMP  ASG EMP PROJ  EMP  PROJ ASG  PROJ  EMP ASG  Select the best ordering based on the join costs evaluated according to the two methods

55 System R Example (cont'd)  Best total join order is one of EMPASG PROJ EMP ASGASG EMPPROJ × ASGASG PROJEMP × PROJ (ASG EMP) PROJ (PROJ ASG) EMP PROJ ASG (ASG EMP) PROJ (PROJ ASG) EMP

56 System R Algorithm  (PROJ ASG) EMP has a useful index on the select attribute and direct access to the join attributes of ASG and EMP.  Final plan:  select PROJ using index on PNAME  then join with ASG using index on PNO  then join with EMP using index on ENO

57 System R* Distributed Query Optimization  Total-cost minimization. Cost function includes local processing as well as transmission.  Algorithm  For each relation in query tree find the best access path  For the join of n relations find the optimal join order strategy  each local site optimizes the local query processing

58 Data Transfer Strategies  Ship-whole. entire relation is shipped and stored as temporary relation, merge join algorithm is used, done in pipeline mode  Fetch-as-needed. this method is equivalent to semijoin of the inner relation with the outer relation tuple

59 Join Strategy 1  External relation R with internal relation S, let LC be local processing cost, CC be data transfer cost, let average number of tuples of S that match one tuple of R be s  Strategy 1. Ship the entire outer relation to the site of internal relation TC = LC(get R) + CC(size(R)) + LC(get s tuples from S)*card(R)

60 Join Strategy 2  Ship the entire inner relation to the site of the outer relation TC = LC(get S) + CC(size(S)) + LC(store S) + LC(get R) + LC(get s tuples from S)*card(R)

61 Join Strategy 3  Fetch tuples of the inner relation for each tuple of the outer relation TC = LC(get R) + CC(len(A)) * card(R) + LC(get s tuples from S) * card(R) + CC(s*len(S))*card(R)

62 Join Strategy 4  Move both relations to 3rd site and join there TC = LC(get R) + LC(get S) + CC(size(S)) + LC(store S) + CC(size(R)) + LC(get s tuples from S)*card(R)  Conceptually, the algorithm does an exhaustive search among all alternatives and selects one that minimizes total cost

63 Hill Climbing Algorithm - Algorithm Inputs query graph, locations of relations, and relation statistics Initial solution the least costly among all when the relations are sent to a candidate result site denoted by ES 0, and the site as chosen site Splits ES 0 into ES 1 : ship one relation of join to the site of other relation ES 2 : these two relations are joined locally and the result is transmitted to the chosen site If cost(ES 1 ) + cost(ES 2 ) + LC > cost (ES 0 ) select ES 0, else select ES 1 and ES 2. The process can be recursively applied to ES 1 and ES 2 till no more benefit occurs

64 Hill Climbing Algorithm - Example  SAL  PNAME=“CAD/CAM” PROJ ASG EMP PNO TITLE ENO PAY Ignore the local processing cost Length of tuples is 1 for all relation Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) ES 0 Cost = 13 8 4 1

65 HCA - Example Site1 EMP(8) Site2 PAY(4 ) Site3 PROJ(1) Site4 ASG(10) ? ? ? TITLE ES 1 ES 2 ES 3 Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) ES 0 Cost = 13 8 4 1 Solution 1 Cost = Solution 2 Cost = ES 1 ES 2 ES 3 ESo is the “BEST”

66 Hill Climbing Algorithm - Comments  Greedy algorithm: determines an initial feasible solution and iteratively tries to improve it.  If there are local minimas, it may not find the global minima  If the optimal solution has a high initial cost, it won’t be found since it won’t be chosen as the initial feasible solution. Site1 EMP(8) Site2 PAY(4) Site3 PROJ(1) Site4 ASG(10) COST =

67 SDD-1 Algorithm  SDD-1 algorithm generalized the hill-climbing algorithm to determine ordering of beneficial semijoins; and uses statistics on the database, called database profiles.  Cost of semijoin: Cost (R SJ A S) = C MSG + C TR *size(  A (S))  Benefit is the cost of transferring irrelevant tuple Benefit(R SJ A S) = (1-SF SJ (S.A)) * size(R) * C TR  A semijoin is beneficial if cost < benefit.

68 SDD-1: The Algorithm  initialization phase generates all beneficial semijoins, and an execution strategy that includes only local processing  most beneficial semijoin is selected; statistics are modified and new beneficial semijoins are selected  the above step is done until no more beneficial joins are left  assembly site selection to perform local operations  postoptimization removes unnecessary semijoins

69 SDD1 - Example SELECT * FROM EMP, ASG, PROJ WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO Site 1 EMP Site 2 ASG Site 3 PROJ ENO PNO

70 SDD1 - First Iteration  SJ1: ASG SJ EMP benefit = (1-0.3)*3000 = 2100; cost = 120  SJ2: ASG SJ PROJ benefit = (1-0.4)*3000 = 1800 cost = 200  SJ3: EMP SJ ASG benefit = (1-0.8)*1500 = 300; cost = 400  SJ4: PROJ SJ ASG benefit = 0; cost = 400  SJ1 is selected  ASG size is reduced to 3000*0.3=900 ASG’ = ASG SJ EMP  Semijoin selectivity factor is reduced; it is approximated by SF SJ (G.ENO)= 0.8*0.3 = 0.24

71 SDD-1 - Second & Third Iterations Second iteration  SJ2: ASG’ SJ PROJ benefit=(1-0.4)*900=540 cost=200;  SJ3: EMP SJ ASG’; benefit=(1-0.24)*1500=1140 cost=400 è SJ3 is selected EMP’ = EMP SJ ASG size(EMP’) = 1500*0.24 = 360 Third Iteration  SJ2: ASG’ SJ PROJ benefit=(1-0.4)*900=540 cost=200; è it is selected reduces size of G further to 900*0.4=360

72 Local Optimization  Each site optimizes the plan to be executed at the site  A centralized query optimization problem

73 SDD-1 - Assembly Site Selection  After reduction EMP is at site 1 with size 360 ASG is at site 2 with size 360 PROJ is at site 3 with size 2000 è Site 3 is chosen as assembly site  no semijoins reduced in post optimization. Site1 EMP Site3 PROJ Site2 ASG (ASG SJ EMP) SJ PROJ  site 3 (EMP SJ ASG)  site 3 join at site 3


Download ppt "Query Processor  A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level."

Similar presentations


Ads by Google