Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed DBMSPage 7-9. 1© 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.

Similar presentations


Presentation on theme: "Distributed DBMSPage 7-9. 1© 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database."— Presentation transcript:

1 Distributed DBMSPage 7-9. 1© 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database Design  Distributed Query Processing à Query Processing Methodology à Distributed Query Optimization Distributed Transaction Management (Extensive) n Building Distributed Database Systems (RAID) Mobile Database Systems Privacy, Trust, and Authentication Peer to Peer Systems

2 Distributed DBMSPage 7-9. 2© 1998 M. Tamer Özsu & Patrick Valduriez Query Processing high level user query query processor low level data manipulation commands

3 Distributed DBMSPage 7-9. 3© 1998 M. Tamer Özsu & Patrick Valduriez Query Processing Components Query language that is used à SQL: “intergalactic dataspeak” Query execution methodology à The steps that one goes through in executing high- level (declarative) user queries. Query optimization à How do we determine the “best” execution plan?

4 Distributed DBMSPage 7-9. 4© 1998 M. Tamer Özsu & Patrick Valduriez SELECTENAME  Project FROMEMP,ASG  Select WHEREEMP.ENO = ASG.ENO  Join ANDDUR > 37 Strategy 1  ENAME (  DUR>37  EMP.ENO=ASG.ENO  (EMP  ASG)) Strategy 2  ENAME (EMP ENO (  DUR>37 (ASG))) Strategy 2 avoids Cartesian product, so is “better” Selecting Alternatives

5 Distributed DBMSPage 7-9. 5© 1998 M. Tamer Özsu & Patrick Valduriez What is the Problem? Site 1Site 2Site 3Site 4Site 5 EMP 1 =  ENO≤“E3” (EMP)EMP 2 =  ENO>“E3” (EMP) ASG 2 =  ENO>“E3” (ASG) ASG 1 =  ENO≤“E3” (ASG) Result Site 5 Site 1Site 2Site 3Site 4 ASG 1 EMP 1 EMP 2 ASG 2 result 2 =(EMP 1   EMP 2 ) ENO  DUR>37 (ASG 1  ASG 1 ) Site 4 result = EMP 1 ’  EMP 2 ’ Site 3 Site 1Site 2 EMP 2 ’ =EMP 2 ENO ASG 2 ’ EMP 1 ’ =EMP 1 ENO ASG 1 ’ ASG 1 ’ =  DUR>37 (ASG 1 )ASG 2 ’ =  DUR>37 (ASG 2 ) Site 5 ASG 2 ’ ASG 1 ’ EMP 1 ’ EMP 2 ’

6 Distributed DBMSPage 7-9. 6© 1998 M. Tamer Özsu & Patrick Valduriez Assume: à size (EMP) = 400, size (ASG) = 1000 à tuple access cost = 1 unit; tuple transfer cost = 10 units Strategy 1  produce ASG': (10+10)  tuple access cost 20  transfer ASG' to the sites of EMP: (10+10)  tuple transfer cost 200  produce EMP': (10+10)  tuple access cost  2 40  transfer EMP' to result site: (10+10)  tuple transfer cost 200 Total cost 460 Strategy 2  transfer EMP to site 5:400  tuple transfer cost 4,000  transfer ASG to site 5 :1000  tuple transfer cost 10,000  produce ASG':1000  tuple access cost 1,000  join EMP and ASG':400  20  tuple access cost 8,000 Total cost23,000 Cost of Alternatives

7 Distributed DBMSPage 7-9. 7© 1998 M. Tamer Özsu & Patrick Valduriez Minimize a cost function I/O cost + CPU cost + communication cost These might have different weights in different distributed environments Wide area networks à communication cost will dominate (80 – 200 ms)  low bandwidth  low speed  high protocol overhead à most algorithms ignore all other cost components Local area networks à communication cost not that dominant (1 – 5 ms) à total cost function should be considered Can also maximize throughput Query Optimization Objectives

8 Distributed DBMSPage 7-9. 8© 1998 M. Tamer Özsu & Patrick Valduriez Assume à relations of cardinality n à sequential scan Complexity of Relational Operations OperationComplexity Select Project (without duplicate elimination) O( n ) Project (with duplicate elimination) Group O( n log n ) Join Semi-join Division Set Operators O( n log n ) Cartesian ProductO( n 2 )

9 Distributed DBMSPage 7-9. 9© 1998 M. Tamer Özsu & Patrick Valduriez Query Optimization Issues – Types of Optimizers Exhaustive search à cost-based à optimal à combinatorial complexity in the number of relations Heuristics à not optimal à regroup common sub-expressions à perform selection, projection first à replace a join by a series of semijoins à reorder operations to reduce intermediate relation size à optimize individual operations

10 Distributed DBMSPage 7-9. 10© 1998 M. Tamer Özsu & Patrick Valduriez Query Optimization Issues – Optimization Granularity Single query at a time à cannot use common intermediate results Multiple queries at a time à efficient if many similar queries à decision space is much larger

11 Distributed DBMSPage 7-9. 11© 1998 M. Tamer Özsu & Patrick Valduriez Query Optimization Issues – Optimization Timing Static  compilation  optimize prior to the execution  difficult to estimate the size of the intermediate results  error propagation à can amortize over many executions à R* Dynamic à run time optimization à exact information on the intermediate relation sizes à have to reoptimize for multiple executions à Distributed INGRES Hybrid à compile using a static algorithm à if the error in estimate sizes > threshold, reoptimize at run time à MERMAID

12 Distributed DBMSPage 7-9. 12© 1998 M. Tamer Özsu & Patrick Valduriez Query Optimization Issues – Statistics Relation à cardinality à size of a tuple à fraction of tuples participating in a join with another relation Attribute à cardinality of domain à actual number of distinct values Common assumptions à independence between different attribute values à uniform distribution of attribute values within their domain

13 Distributed DBMSPage 7-9. 13© 1998 M. Tamer Özsu & Patrick Valduriez Query Optimization Issues – Decision Sites Centralized à single site determines the “best” schedule à simple à need knowledge about the entire distributed database Distributed à cooperation among sites to determine the schedule à need only local information à cost of cooperation Hybrid à one site determines the global schedule à each site optimizes the local subqueries

14 Distributed DBMSPage 7-9. 14© 1998 M. Tamer Özsu & Patrick Valduriez Query Optimization Issues – Network Topology Wide area networks (WAN) – point-to-point à characteristics  low bandwidth  low speed  high protocol overhead à communication cost will dominate; ignore all other cost factors à global schedule to minimize communication cost à local schedules according to centralized query optimization Local area networks (LAN) à communication cost not that dominant à total cost function should be considered à broadcasting can be exploited (joins) à special algorithms exist for star networks


Download ppt "Distributed DBMSPage 7-9. 1© 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database."

Similar presentations


Ads by Google