Distributed Database Management Systems Lecture 30
In the previous lecture Locking based CC Timestamp ordering based CC Concluded TM.
In this Lecture Basic Concepts of Query Optimization QP in centralized and Distributed DBs.
Introduction SQL one of the success factors of RDBMS Query processor transforms complex queries into concise and simple ones
Query processing is critical performance issue QP a complex problem specially in DDBS environment
Main function of QP is to transform an SQL query into equivalent relational algebra one (low level language) Transformation must achieve correctness and efficiency
Correctness is straightforward since rules exist An SQL query can have many equivalents in R Algebra
Considering the tables EMP(eNo, eName, title) ASG(eNo, pNo, resp, dur) PROJ(pNo, pName, budget, loc) Query: Get the names of employees who are managing a project
SELECT eName FROM EMP, ASG WHERE EMP.eNo = ASG.eNo AND resp = ‘Manager’
eName(resp=‘Manager’ ^ EMP.eNo = ASG.eNo) (EMPxASG) eName(EMP ⋈ (resp=‘Manager’ (ASG))) Obviously second one needs less computing resources since avoids Cartesian product
Centralized QP is to choose best query execution plan Distributed is more complex; it also involves the selection of site to execute query
Same query in DDBS Suppose EMP and ASG are HF as EMP1 = eNo ≤ ‘E3’ (EMP) EMP2 = eNo > ‘E3’ (EMP) ASG1 = eNo ≤ ‘E3’ (ASG) ASG2 = eNo > ‘E3’ (ASG)
Further suppose these fragments are stored at site 1, 2, 3 and 4 and result at site 5
Site 5 Site 4 Site 3 Site 2 Site 1 EMP1’ EMP2’ ASG1’ ASG2’ ASC1’=resp = ‘Manager(ASG1) EMP1’=EMP1 ⋈(ASG1’) Site 1 Site 3 ASC2’=resp = ‘Manager(ASG2) EMP2’=EMP2 ⋈(ASG2’) Site 2 Site 4 ASG1’ ASG2’ result = EMP1’ U EMP2’ Site 5 EMP1’ EMP2’
resp = ‘Manager’ (ASG1 U ASG2) result = (EMP1 U EMP2) ⋈ eNo resp = ‘Manager’ (ASG1 U ASG2) Site 1 Site 2 Site 3 Site 4 ASG1 ASG2 EMP1 EMP2
Lets Assume size(EMP) size(ASG) 400 1000 tuple access cost tuple transfer cost 1 unit 10 units There are 20 Managers Data distributed evenly at all sites
Strategy 1 produce ASG': 20*1 20 transfer ASG' to the sites of E: 20 * 10 200 produce EMP': (10+10) *1*2 40 transfer EMP' to result site: 20*10 Total 460
Strategy 2 Transfer EMP to site 5: 400 * 10 4000 Transfer ASG to the site 5 1000 * 10 10000 Produce ASG‘ by selecting ASG 1000 Join EMP and ASG’ 8000 Total 23000
Query Optimization An important aspect of QP Minimize resource consumption I/O cost + CPU cost + communication cost First two in Centralized DB
Communication Cost will dominate in WAN Not that dominant in LANs, so total cost should be considered in LANs QO can also maximize throughput
Operators’ Complexity Select, Project (without duplicate elimination) O(n) Project (with duplicate elimination), Group O(nlogn) Join, Semi-Join, Division, Set Operators O(nlog n) Cartesian Product O(n2)
Characterization of Query Processors
Types of Optimization Exhaustive search for the cost of each strategy to find the most optimal one May be very costly in case of multiple options and more fragments Heuristics
Optimization Timing Static: during compilation Size of intermediate tables not known always Cost justified with repeated execution Dynamic: during execution Intermediate tables’ size known Re-optimzation may be required
Statistics Relation/Fragment: Cardinality, size of a tuple, fraction of tuples participating in a join with another relation Attribute: cardinality of domain, actual number of distinct values
Decision Sites Centralized: simple, need knowledge about the entire distributed database Distributed: cooperation among sites to determine the schedule, need only local information Hybrid: one site determines the global schedule, each site optimizes the local subqueries
Other factors like: Network topology Replicated fragments Use of semijoins.
Optimized Local Query SQL Query on Distributed Relations QUERY GLOBAL DECOMPOSITION GLOBAL SCHEMA Algebraic Query on Distributed Relations DATA LOCALIZATION FRAGMENT Fragment Query OPTIMIZATION STAT OF FRAGMENTS Optimized Fragment Query with Communication Operations LOCAL Optimized Local Query