Distributed Database Management Systems Lecture 35
In the previous lecture Query Optimization Centralized QO Best access path Join Processing QO in Distributed Environment.
In this lecture Query Optimization Fragmented Queries Joins replaced by Semijoins Three major QO algorithms.
Semijoin based Algorithms
Reduces cost of join queries Semijoin is ……. Join of two relations can be replaced SJ of one or both relations.
Which one? Need to estimate costs. So R ⋈A S can be replaced: (R ⋉A S) ⋈A S R ⋈A (S ⋉A R) (R ⋉A S) ⋈A (S ⋉A R) Which one? Need to estimate costs.
Same Assumptions: R at site 1, S at site 2 Size (R) < Size (S), so A (S) site 1 Site1 computes R’ = R ⋉A S’ R’ site 2 Site2 computes R’ ⋈A S
Ignoring Tmsg semijoin is better if Size(A(S)) + size(R ⋉A S) < size(R) Join is better if …..- Semijoin is better if…..-.
SJ with more than two tables Will be more complex Semijoin approach can be applied to each individual join, consider EMP ⋈ ASG ⋈ PROJ
EMP ⋈ ASG ⋈ PROJ = EMP’ ⋈ ASG’ ⋈ PROJ where EMP’ = EMP ⋉ ASG and ASG’ = ASG ⋉ PROJ rather EMP” = EMP ⋉ (ASG ⋉ PROJ)
Many SJ expressions possible for a relation “Full reducer” a SJ expression that reduces R the maximum Not exists for cyclic queries.
Select eName From EMP, ASG, PROJ Where EMP.eNo = ASG.eNo and ASG.eNo = PROJ.eNo and EMP.city = PROJ.city
Cyclic Query Tree Query city eNo pNo eNo, city pNo, city ASG EMP PROJ
Full Reducer may be hard to find. Easy for a chained query Most systems use single SJs to reduce relation size.
Distributed Query Processing Algorithms
Three main representative algos are Distributed INGRES Algorithm R* Algorithm SDD-1 Algorithm.
R* Algorithm Static, exhaustive Algorithm supports fragmentation, actual implementation doesn’t Master, execution and apprentice sites.
Optimizer of Master site makes inter-site decisions Apprentice sites make local decisions Optimizes local processing time & communication time.
Optimizer, based on stats of DB and size of iterm results, decides about Join Ordering Join Algo (nested/mergeJoin) Access path (indexed/seq.).
Inter-site transfers Ship-whole Fetch-as-needed Entire relation transferred Stored in a temp relation In case of merge-join approach, tuples can be processed as they arrive Fetch-as-needed nExternal relation is sequentially scanned Join attribute value is sent to other relation Relevant tuples scanned at other site and sent to first site.
Inter-site transfers: comparison Ship-whole larger data transfer smaller number of messages better if relations are small Fetch-as-needed number of messages = O(cardinality of external relation) data transfer per message is minimal better if relations are large and the join selectivity is good.
Example, join of an external relation R with an internal relation S, there are four strategies.
1-Move outer relation tuples to the site of the inner relation Can be joined as they arrive Total Cost = LT (retrieve card(R) tuples from R) + CT (size(R)) + LT (retrieve s tuples from S) * card (R)
2- Move inner relation to the site of outer relation cannot join as they arrive; they need to be stored Total Cost = LT (retrieve card(S) tuples from S) + CT (size (S)) + LT (store card(S) tuples as T) + LT (retrieve card(R) tuples from R) + LT (retrieve s tuples from T) * card (R).
3- Fetch inner tuples as needed For each tuple in R, send join attribute value to site of S Retrieve matching inner tuples at site S Send the matching S tuples to site of R Join as they arrive
Total Cost = LT (retrieve card(R) tuples from R)+ CT (length(A) * card (R)) + LT(retrieve s tuples from S) * card(R) + CT (s * length(S)) * card(R)
4- Move both inner and outer relations to another site Example: A query consisting join of PROJ (ext) and ASG (int) on pNo Four strategies
1- Ship PROJ to site of ASG 2- Ship ASG to site of PROJ 3- Fetch ASG tuples as needed for each tuple of PROJ 4- Move both to a third site Optimization involves costing for each possibility.
That is it regarding R* algorithm for distributed query optimization Lets review it.
SDD-1 Algorithm System for Distributed Databases A non-commercial database.
Based on the Hill Climbing Algorithm No semijoins, No rep/frag Cost of transferring the result to the user site from the final result site is not considered Can minimize either total time or response time
Input include Query Graph Locations of relations Relations’ statistics.
1- Do the initial local processing 2- Select the initial best plan (ES0) Calculate cost of moving all relations to a single site Plan with the least cost is ES0
3- Split ES0 into ES1 and ES2 ES1: Sending one of the relation to other site, relations joined there ES2:Sending the result back to site in ES0.
5- Recursively apply step 3 and 4 on ES1 and ES2, until no improvement 4- Replace ES0 with ES1 and ES2 when we should have cost(ES1) + cost(local join) + cost (ES2) < cost (ES0) 5- Recursively apply step 3 and 4 on ES1 and ES2, until no improvement
Example “Find the salaries of engineers working on CAD/CAM project” Involves EMP, PAY, PROJ and ASG sal(PAY ⋈title(EMP ⋈eNo(ASG ⋈pNo(pName = ‘CAD/CAM’ (PROJ)))))
Assume Tmsg = 0 and TTR = 1 Length of a tuple is 1 Relation Size Site EMP PAY PROJ ASG 8 4 1 10 2 3 Assume Tmsg = 0 and TTR = 1 Length of a tuple is 1 So size(R) = card(R)
Site 1 Considering only transfers costs PAY site 1 = 4 PROJ site 1 = 1 ASG site 1 = 10 Total = 15
Assume Tmsg = 0 and TTR = 1 Length of a tuple is 1 Relation Size Site EMP PAY PROJ ASG 8 4 1 10 2 3 Assume Tmsg = 0 and TTR = 1 Length of a tuple is 1 So size(R) = card(R)
Site 1 Considering only transfers costs PAY site 1 = 4 PROJ site 1 = 1 ASG site 1 = 10 Total = 15
Cost for site 2 = 19 Cost for site 3 = 22 Cost for site 4 = 13 So site 4 is our ES0 Move all relations to site 4.
Thanks