Download presentation
Presentation is loading. Please wait.
Published byΖώσιμος Κυπραίος Modified over 6 years ago
1
Distributed Databases Going Beyond Simple Client Server Configuration
Still need more explanation of the CAP theorem and BASE Copyright © Curt Hill
2
Preface We have already seen that a DBMS might use:
Multiple CPUs Multiple disks in RAID configuration or not Networking How do we take it larger and faster? Multiple sites separated by significant geography Copyright © Curt Hill
3
Definition Distributed Database Management System (DDBMS) A software system that manages multiple logically related databases distributed over a several sites while making the distribution transparent It appears to users to be a single database The pieces are geographically separate Miles rather than feet apart Copyright © Curt Hill
4
Parallel and Distributed
A parallel architecture may loosely or tightly coupled Tightly coupled Multiple CPUs sharing a common memory Loosely coupled Multiple CPUs sharing disk but not memory Neither of these is considered distributed Copyright © Curt Hill
5
Other Differences Networks connect different sites (or nodes)
Not just a server farm with many machines within a small distance of each The connected databases must be related to one another A query could involve several sites The systems do not have to be identical in terms of data, hardware and software Copyright © Curt Hill
6
Advantages Easier to develop applications for a DDBMS
Increased reliability and availability Improved flexibility Increased performance Easier to scale to large implementations Copyright © Curt Hill
7
Terminology There is quite a bit of terminology Transparency types
Some is unique to DDBMS Transparency types Autonomy Homogeneity Copyright © Curt Hill
8
Transparency The users should not be able to discern that multiple sites contribute Several types of transparency Data organization transparency Replication transparency Fragmentation transparency Design transparency Execution transparency Copyright © Curt Hill
9
Data organization Also known as distribution or network transparency
Details of where data is are hidden from user Location transparency The command to access is independent of the location of own site or the data’s site Naming transparency Named objects (such as tables) may be accessed regardless of location where they may be Copyright © Curt Hill
10
Replication transparency
We may have copies of the data at multiple sites This increases performance, reliability, availability The user cannot tell that such copies exist nor if copies are accessed rather than the original Keeping the copies up to date is also handled transparently Copyright © Curt Hill
11
Fragmentation transparency
Tables may be fragmented in either a horizontal or vertical way across sites Horizontal: not all tuples are at the same site Vertical: not all fields are at the same site A global query is transparently transformed into local queries User is unaware of any fragmentation Copyright © Curt Hill
12
Design and Execution Transparency
Design: freedom from knowing how the database is organized across the different sites Execution: ignorance of where the work of the query is actually performed The source site The site where data resides Another site Copyright © Curt Hill
13
Autonomy How much independent operation is allowed
If the network is inoperable can the separated sites still in function to what degree? High degree of autonomy increases flexibility Design, Communication and Execution autonomies exist Copyright © Curt Hill
14
Autonomy Again Design: Each site may employ its own design:
How the tables and views are organized How many and what type of indices Communication: how much sharing of informaiton with other nodes Execution: independence of users to access information Copyright © Curt Hill
15
Reliability and Availability
Two terms are related but not identical Reliability: likelihood of the system running at any point in time Availability: likelihood that the system will be continuously running during some interval Copyright © Curt Hill
16
Homogeneity How similar are the systems within a DDBMS?
Homogeneous – all sites are the same Heterogeneous – sites have different software Not just different versions of the same software Copyright © Curt Hill
17
Local autonomy How well can a site function by itself?
Such as in the case of network failure No local autonomy makes the site a slave of the system It does not function by itself Copyright © Curt Hill
18
Bad News: CAP theorem A distributed database or web service cannot guarantee all of the following: Consistency That operations occur all at once Availability Every operation must terminate in the intended operation Partition tolerance Operations will complete even if individual components fail Copyright © Curt Hill
19
CAP Again The CAP theorem is just that – a proven theorem
The best that we can hope for is to get two out of three of consistency, availability and partition tolerance Many of the NOSQL databased specifically choose which two to go for Banks and other large scale relational databases also have to choose Copyright © Curt Hill
20
ACID and BASE The CAP theorem specifically mentions the consistency of ACID The alternative is BASE Basically Available Soft state Eventually consistent Copyright © Curt Hill
21
BASE Basically Available Soft state Eventual consistency
May not get the most current (correct) value System is often in a state of flux Soft state Even without transactions the system may be changing Eventual consistency Eventually the system will do the needed updates throughout to reach consistency Copyright © Curt Hill
22
Types of DBMS Traditional central database Pure distributed database
Federated database Multidatabase AKA peer to peer database Copyright © Curt Hill
23
Pure distributed database
Made to look like traditional database except that it is distributed over many sites Sites have no local autonomy Single conceptual schema for all sites All access is through a single site that is part of it Copyright © Curt Hill
24
Federated Database Each server site has local autonomy
Users and a DBA are associated with each site A global schema that is used across the sites Heterogeneous software may be used Copyright © Curt Hill
25
Multidatabase Each server site has full local autonomy
Users and a DBA are associated with each site Each site has its own schema Each site cooperates with other sites converting the schema of a request into the local schema as needed Heterogeneous software may be used Copyright © Curt Hill
26
Pure DDB Architecture A pure distributed database has a single conceptual schema The Global Conceptual Schema (GCS) Users interact with this conceptual scheme Each site then has: A Local Conceptual Schema (LCS) A Local Internal Schema (LIS) The local schemas only reflect the pieces of the overall database possessed at the local site Copyright © Curt Hill
27
Query Translation The query compiler takes the query
It then translates it into the sub-queries for each local site This is based on the fragmentation of the whole database It may generate several alternative queries and then evaluate their costs to find the best one Local sites will also have their own query optimizers Copyright © Curt Hill
28
Pure and Federated In a pure distributed database there is typically little or no heterogeneity or local autonomy Each site is very uniform Except for the differences demanded by the fragmentation types In a federated system the greater levels of heterogeneity and autonomy make for different handling of schemas Copyright © Curt Hill
29
Federated Architecture
At the top level there is a federated schema that supports the external schema that the users see Immediately below this is an export schema This is the schema that all the other sites are able to access Below this is the local schema Query translation is similar but more difficult in this approach Copyright © Curt Hill
30
Three-Tier Client Server
The three tier approach is generally required in DDBMS The presentation layer This is the client - usually a web browser or specialized client The application server Split the request among the individual sites Database server An individual site Copyright © Curt Hill
31
DDBMS Application Server
A Three Tier Approach Server Server Server Server DDBMS Application Server Client Client Client Copyright © Curt Hill
32
Application Layer This layer largely distinguishes a DDBMS from a DBMS
It must have substantial logic so that it divides each query into the components that are to be sent to the individual sites It then reconstitutes the results and relays them to the original client The application server may be multiple machines as well Copyright © Curt Hill
33
Load Balancer A load balancer may introduce another level
Requests are sent from the client to the load balancer It chooses the application server that is the least busy Once this is chosen it gets out of the way Unlike the application server Copyright © Curt Hill
34
A Four Tier Approach Server Server Server Server App Server App Server
Load balancer Client Client Client Copyright © Curt Hill
35
Fragmentation Without some form of fragmentation there is no distribution It is this fragmentation that increases performance, reliability and availability It is also this fragmentation that complicates the global query compiler and optimizer Copyright © Curt Hill
36
Horizontal Horizontal fragmentation stores one table in multiple sites
The tuples that are stored at a particular site are determined by a condition A condition that could be in a query Example: The employees stored in a fragment at a site may be those who work at that site The products that are available at this site’s warehouse Copyright © Curt Hill
37
Vertical Similar to a decomposition to bring a schema to a higher normal form Example: A site may have employee information suitable for matching to a task, but another site have payroll information Both would need the primary key Many sites might have in-stock information but only one reordering information Copyright © Curt Hill
38
More on Fragmentation If we have use both we get mixed or hybrid fragmentation A fragmentation schema defines what pieces are stored This includes all attributes of the fragmented tuple An allocation schema maps the fragments to the sites This must also account for replicated fragments Copyright © Curt Hill
39
Replication Maintaining copies of all or a portion of database at multiple sites makes it easier for queries to be handled without much network traffic A fully replicated distributed database maintains all data at all sites No replication is the other extreme Partial replication is the middle This also increases the overhead of any type of data change The change must be propagated to other sites Copyright © Curt Hill
40
Allocation The allocation schema records the assignment of fragments to sites A replication schema records where copies of the data are maintained Finding a good replication scheme is not easy Often use frequency of usage data to assign sites to minimize network traffic Copyright © Curt Hill
41
Query Processing Distributed query processing usually involves stages
Query mapping Not much different than that on a regular DBMS Localization Global query optimization Local query optimization Copyright © Curt Hill
42
Localization Fragmentation scatters the data over several sites
Replication increases the number of sites where the data may be found Localization uses the fragmentation and replication schemas to determine where the data is that is needed for this query Often several possibilities exist Copyright © Curt Hill
43
Global Query Optimization
Localization will provide several candidate queries These will be evaluated for costs: CPU costs I/O costs Communication costs Usually the communication cost is most significant The costs are generally expressed in terms of time Copyright © Curt Hill
44
Data Transfer Costs Generally:
CPUs are faster than memory Memory is faster than disk storage Disk storage is faster than network communication In a DBMS minimizing disk storage accesses is very important In a DDBMS it is still, but minimizing network traffic is even more important Consider a join across two sites Copyright © Curt Hill
45
Joins Generally executing joins is time consuming
If the two tables are in different sites then: Transfer one table to another site Join the two tables Substantial network work, especially if the table is large Many queries have multiple joins Copyright © Curt Hill
46
The Semijoin The idea is to reduce as much volume to be transmitted over the network as possible Reduce the number of tuples prior to transmission Reduce the number of columns as well Copyright © Curt Hill
47
The Join strategy Project out all the unnecessary columns from the table Those columns not needed by result Transmit result Usually attempt to transmit the smaller to the larger Join the two You may even send the result back to original site and rejoin to get the full data needed Copyright © Curt Hill
48
Example Tables Suppose an employee table:
E_ID E_Name E_Site E_Dept Etc Suppose also a department table: D_ID D_Name D_VP Copyright © Curt Hill
49
Example Locations Employee table is horizontally fragmented
Each site has only its own employees Department table is only at company HQ Every department may be represented at each site Want a query that connects employees with their VPs and salaries Copyright © Curt Hill
50
Example Query Query might look something like this: Select loc.E_Name, glob.E_Name, loc.E_salary From employees loc, employees glob, department Where E_site = 4 AND loc.E_Dept = D_ID AND glob.E_ID = D_VP Copyright © Curt Hill
51
Example Numbers Suppose that an employee record is 88 bytes
There are 250 records at site 4 Suppose that there are 700 employees at the central office A department record is 35 bytes and there are 100 departments Copyright © Curt Hill
52
Example Query Execution
Take the local employees fragment and project out everything but E_ID, E_Dept Send this to the central site Join this with department and then with employee fragment at central Project out everything but loc.E_ID, and glob.E_Name Send back Join back to local employee file Project and display Copyright © Curt Hill
53
Example Transmissions
Transmitting the site 4 employees is 22,000 bytes Transmitting the site 1 employees is 61,600 bytes Transmitting the department table is 3,500 bytes Transmitting the E_ID, E_Dept columns from 4 to 1 is 2000 bytes Transmitting the semi-joined file back to 4 is 5000 bytes Copyright © Curt Hill
54
Transaction Management
The ACID properties may be problematic Replicated pieces must be updated with originals Locking multiple sites may slow matters down What needs to happen if communication is lost during the process? Copyright © Curt Hill
55
Strict 2PL Rules A transaction requests a lock when it desires to access the object Shared lock for reads and exclusive for writes It holds all locks until complete Either a commit or rollback A transaction that cannot get the lock is suspended until the item is available If both reading and writing is desired get the exclusive lock A transaction holding a lock for too long may be deadlocked Often aborted Copyright © Curt Hill
56
Multiple Sites A multidatabase or a distributed database complicates this A multidatabase usually needs a two phase commit (2PC) A distributed database often needs a three phase commit (3PC) In both cases a Global Recovery Manager is required AKA Global Recovery Coordinator Copyright © Curt Hill
57
2PC - Phase 1 Each local signals that the transaction has concluded
Global recovery manager sends a Prepare for Commit message Each local will write its log records for potential recovery The locals will respond with ready to commit if everything is fine and unable to commit otherwise Copyright © Curt Hill
58
2PC – Phase 2 If all locals have responded with ready to commit, then global recovery manager sends a commit message Otherwise entire transaction is rolled back Since log information is up to date this can be recovered even in the event of a failure Copyright © Curt Hill
59
2PC problems This is a blocking protocol
If the manager fails, then all sites are left hanging Generally serious performance issues result This caused the development of the three phase commit (3PC) Both 2PC and 3PC rely on votes Any problems by any local can rollback the whole transaction Copyright © Curt Hill
60
Three phase commit What is added here is the vote result
An upper bound on the time allowed to perform the commit exists The global manager sends out the vote tallies Now every site knows every other site involved If the global manager fails, another site may see the transaction to its conclusion Copyright © Curt Hill
61
3PC – Phase 1 Manager sends out the can commit Locals respond yes
If any local votes no or fails to respond the transaction is aborted Copyright © Curt Hill
62
3PC – Phase 2 Manager sends out the pre-commit Locals acknowledge
If any local fails to respond the transaction is aborted Copyright © Curt Hill
63
3PC – Phase 3 Manager sends out the do commit
Locals respond have committed Timeout causes abort Copyright © Curt Hill
64
Finally DDBMS are more complicated than single site DBMS
dUH Allow for scales unattainable with regular DBMS The various transparencies complicate implementation, but simplify usage CAP is also a problem Copyright © Curt Hill
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.