CSIS 7102 Spring 2004 Lecture 6: Distributed databases Dr. King-Ip Lin
Table of contents Limitation of locking techniques Timestamp ordering View serializability Optimistic concurrency control Graph-based locking Multi-version schemes
Distributed databases So far, we assume a centralized database Data are stored in one location (e.g. a single hard disk) A centralized database management system to handle transaction To handle multiple requests, a client-server system is used Client send requests for data to server Server handle query, transaction management etc.
Distributed databases This is not the only possibility In many cases, it may be advantageous for data to be distributed Branches of a bank Different part of the government storing different kind of data about a person Different organizations sharing part of their data Thus, distributed databases
Distributed databases Data spread over multiple machines (also referred to as sites or nodes. Network interconnects the machines Data shared by users on multiple machines
Distributed databases Homogeneous distributed databases Same software/schema on all sites, data may be partitioned among sites Goal: provide a view of a single database, hiding details of distribution Heterogeneous distributed databases Different software/schema on different sites Goal: integrate existing databases to provide useful functionality
Distributed databases Advantages of distributed databases Sharing data – users at one site able to access the data residing at some other sites. Autonomy – each site is able to retain a degree of control over data stored locally. Higher system availability through redundancy — data can be replicated at remote sites, and system can function even if a site fails.
Distributed databases Key features of distributed databases Typically geographically distributed, with (relatively) slow connections Typically autonomous, in terms of both administration and execution However, many cases allows for a coordinator site for each transaction (different coordinator for different transaction) Local vs. global transactions A local transaction accesses data in the single site at which the transaction was initiated. A global transaction either accesses data in a site different from the one at which the transaction was initiated or accesses data in several different sites.
Distributed databases Global transactions new issues in transaction processing Commit coordination: each node cannot unilaterally decide to commit Transaction cannot be committed at one site and aborted at another Data replication: The same data may reside in different sites Possibility for reading different copies locking have to be careful Ensuring correctness updates have to be careful
Distributed databases – rules of the game Transaction may access data at several sites. Each site has a local transaction manager responsible for: Maintaining a log for recovery purposes Participating in coordinating the concurrent execution of the transactions executing at that site. Each site has a transaction coordinator, which is responsible for: Starting the execution of transactions that originate at the site. Distributing subtransactions at appropriate sites for execution. Coordinating the termination of each transaction that originates at the site, which may result in the transaction being committed at all sites or aborted at all sites.
Atomicity in distributed databases Ensuring atomicity means guarding against failures. Many more kinds of failures in distributed databases Failure of a site. Loss of massages Handled by network transmission control protocols such as TCP-IP Failure of a communication link Handled by network protocols, by routing messages via alternative links Network partition A network is said to be partitioned when it has been split into two or more subsystems that lack any connection between them Note: a subsystem may consist of a single node Hard to distinguish between failure and partition
Atomicity in distributed databases Challenge with respect to atomicity Consistency over multiple sites Cannot allow one site to commit and the other site to abort Two basic protocols 2-phase commit (most common) 3-phase commit
Two-phase commit Goals Given a transaction that is running on multiple sites, ensure either all the sites commit together or abort together. Assume that when a site fail, it does not send wrong message to confuse anyone, it just stop working Need to handle the case that some sites fail during the 2-phase commit process
Two-phase commit Simple idea Issues Select one site as the coordinator (the other sites are called participants) Go ask all the sites whether each of them want to abort (phase 1) Wait to collect all the answers and make final decision; broadcast the decision to all the sites; sites act accordingly (phase 2) Issues If a site failed and then quickly recovered, how do I know what I have done? What if a site failed in the middle, does everybody have to wait for him? What if the coordinator fails?
Two-phase commit If a site failed and then quickly recovered, how do I know what I have done? Need to have a log to record what has been done Log in “stable storage” Should I log before I act? Write-ahead log
Two-phase commit What if a participant site failed? By our assumption, it will not respond The coordinator will wait for a time, and then decide that one site failed The decision should be: abort What if the coordinator failed? Trickier, will deal with it later
Two-phase commit: phase 1 Phase 1: coordinator ask for decision Coordinator (Ci) asks all participants to prepare to commit transaction T. Ci adds the records <prepare T> to the log and forces log to stable storage sends prepare T messages to all sites at which T executed Why should coordinator write the record before sending messages?
Two-phase commit: phase 1 Upon receiving message, transaction manager at site determines if it can commit the transaction if not, add a record <no T> to the log send abort T message to Ci if the transaction can be committed, then: add the record <ready T> to the log force all log records for T to stable storage send ready T message to Ci Why can’t the site commit right away?
Two-phase commit: phase 2 Phase 2: coordinator make the decision and broadcast the result T can be committed of Ci received a ready T message from all the participating sites: otherwise T must be aborted. Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur) Notice that a transaction is deemed commited/aborted at this point of time
Two-phase commit: phase 2 Coordinator sends a message to each participant informing it of the decision (commit or abort) Participants take appropriate action locally. It also record on the log whether it commit <commit T> or abort <abort T>
Two phase commit : participant failures Suppose a participating site S fails. What must it do when it come back up? First, check what is in the log Case 1: S sees <commit T> Meaning: Coordinator has decided to commit T and the decision is final Thus: S should make sure the transaction commits at that site (redo T)
Two phase commit : participant failures Case 2: S sees <abort T> Meaning: Coordinator has decided to abort T and the decision is final Thus: S should make sure the transaction aborts at that site (undo T)
Two phase commit : participant failures Case 3: S sees <ready T> Meaning: T can be committed from the point of view of S only Does S know the final decision yet? Thus: S must query the coordinator about the final decision, and act accordingly
Two phase commit : participant failures Case 4: S sees nothing Meaning: S has not even respond to the initial query from the coordinator Thus: S must send its decision to the coordinator But is it really necessary? If the coordinator does not hear from S for a long time, it will assume S has failed, thus aborting the transaction Thus S can safely decide to abort without any problem and without sending its decision to the coordinator (Why?)
Two-phase commit: coordinator failure Suppose the coordinator fails Then participants must make a decision Case 1 : a site sees <commit T> Meaning: T has commited Thus: broadcast the result and ensure everyone commited Case 2 : a site see <abort T> Meaning: T has aborted Thus: broadcast the result and ensure everyone aborted
Two-phase commit: coordinator failure Case 3 : a site sees nothing Meaning: No decision has been made (or a decision has been made to abort) Thus: it is safe to abort (instead of waiting for the coordinator)
Two-phase commit: coordinator failure Case 4 : none of the above Meaning: every participant that is alive has told the coordinator that it can commit Thus, it is possible that the coordinator have made a decision but have yet to send it out Note that the result may still be T to be aborted All participant must wait for the coordinator to return for its decision Thus two-phase commit is blocking in this case
Two-phase commit: network partition If the coordinator and all its participants remain in one partition, the failure has no effect on the commit protocol. If the coordinator and its participants belong to several partitions: Sites that are not in the partition containing the coordinator think the coordinator has failed, and execute the protocol to deal with failure of the coordinator. No harm results, but sites may still have to wait for decision from coordinator. The coordinator and the sites are in the same partition as the coordinator think that the sites in the other partition have failed, and follow the usual commit protocol. Again, no harm results
Three-phase commit Limitation of two-phase commit Blocking when coordinator dies To overcome it, create a new phase called pre-commit Coordinator tells at least k sites that it wants to commit Thus now, 3-phases Phase 1 : Coordinator check if T can commit, participant send their choice to coordinator Phase 2 : Coordinator makes decision If commit, send pre-commit message to k sites If abort, send message to everyone to abort Phase 3 : If commit, final commit decision is broadcast and everyone commits
Three-phase commit What does 3-phase buys: Limitations: If coordinator aborts, then participants can figure out commit decision by pre-commit and then go on commit If no pre-commit message is find, one can safely abort No blocking Limitations: No more than k sites can fail Otherwise, pre-commit message may be lost Network partition can cause problem Maybe pre-commit all resides in one section Thus, not widely used
Concurrency control in distributed databases Modify concurrency control schemes for use in distributed environment. Assumptions: Each site participates in the execution of a commit protocol to ensure global transaction atomicity. Data item may be replicated at multiple sites However, updates (writes) have to be done on ALL the copies of an item
Locking protocols in distributed databases Two-phase locking based protocols Key question: Who to manage the locks? Centralized vs. Distributed How many item to locks? In case when data have copies of multiple sites Tradeoff between efficiency and concurrency Efficiency includes message send between sites
Locking protocols in distributed databases – centralized vs Locking protocols in distributed databases – centralized vs. distributed Centralized lock manager All lock requests for all items go to one site Even if the item does not reside in that site When a transaction needs to lock a data item, it sends a lock request to Si and lock manager determines whether the lock can be granted immediately If yes, lock manager sends a message to the site which initiated the request If no, request is delayed until it can be granted, at which time a message is sent to the initiating site
Locking protocols in distributed databases – centralized vs Locking protocols in distributed databases – centralized vs. distributed Centralized lock manager After obtaining the lock A transaction can read from any one site that contain the item A transaction must write to ALL sites that contain the item Advantages Simple to implement Simple deadlock handling Disadvantage Bottleneck for lock manager Vulnerability – site when down, everything is blocked
Locking protocols in distributed databases – centralized vs Locking protocols in distributed databases – centralized vs. distributed Distributed lock manager Each site has its own lock manager to handle request for items Need special protocol to access data Advantages Distributed workload Fault-tolerant Disadvantages Deadlock handling complicated Potentially more messages.
Locking protocols in distributed databases – Distributed protocols Primary copy Choose one replica of data item to be the primary copy. Site containing the replica is called the primary site for that data item Different data items can have different primary sites When a transaction needs to lock a data item Q, it requests a lock at the primary site of Q. Implicitly gets lock on all replicas of the data item
Locking protocols in distributed databases – Distributed protocols Primary copy Benefit Concurrency control for replicated data handled similarly to unreplicated data - simple implementation. Drawback If the primary site of Q fails, Q is inaccessible even though other sites containing a replica may be accessible
Locking protocols in distributed databases – Distributed protocols Majority protocol Local lock manager at each site administers lock and unlock requests for data items stored at that site. When a transaction wishes to lock an unreplicated data item Q residing at site Si, a message is sent to Si ‘s lock manager. If Q is locked in an incompatible mode, then the request is delayed until it can be granted. When the lock request can be granted, the lock manager sends a message back to the initiator indicating that the lock request has been granted.
Locking protocols in distributed databases – Distributed protocols Majority protocol In case of replicated data If Q is replicated at n sites, then a lock request message must be sent to more than half of the n sites in which Q is stored. The transaction does not operate on Q until it has obtained a lock on a majority of the replicas of Q. When writing the data item, transaction performs writes on all replicas.
Locking protocols in distributed databases – Distributed protocols Majority protocol Benefit Can be used even when some sites are unavailable details on how handle writes in the presence of site failure later Drawback Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1) messages for handling unlock requests. Potential for deadlock even with single item - e.g., each of 3 transactions may have locks on 1/3rd of the replicas of a data. Can be overcome by predetermine order of sites being locked
Locking protocols in distributed databases – Distributed protocols Biased protocol (read-once, write-all) Local lock manager at each site as in majority protocol, however, requests for shared locks are handled differently than requests for exclusive locks. Shared locks. When a transaction needs to lock data item Q, it simply requests a lock on Q from the lock manager at one site containing a replica of Q. Exclusive locks. When transaction needs to lock data item Q, it requests a lock on Q from the lock manager at all sites containing a replica of Q. Advantage - imposes less overhead on read operations. Disadvantage - additional overhead on writes
Locking protocols in distributed databases – Distributed protocols Quorum Consensus Protocol A generalization of both majority and biased protocols Each site is assigned a weight. Let S be the total of all site weights Choose two values read quorum Qr and write quorum Qw Such that Qr + Qw > S and 2 * Qw > S Quorums can be chosen (and S computed) separately for each item Each read must lock enough replicas that the sum of the site weights is >= Qr Each write must lock enough replicas that the sum of the site weights is >= Qw For now we assume all replicas are written Extensions to allow some sites to be unavailable described later
Deadlocks in distributed databases Deadlock can occur in distributed databases Even worse, deadlocks can be distributed Consider the following two transactions and history, with item X and transaction T1 at site 1, and item Y and transaction T2 at site 2: T1: write (X) write (Y) T2: write (Y) write (X)
Deadlocks in distributed databases However, the following schedule can occur Now there is a deadlock between T1 and T2 However, at site 1, the only thing happening is T1 waiting for T2 At site 2, the only thing happening is T2 waiting for T1 So no deadlock is detected at individual sites X-lock(X) Write(X) X-lock(Y) -- wait X-lock(Y) Write(Y) X-lock(X) -- wait T1 T2
Deadlocks in distributed databases Deadlock detection need to be more careful Local wait-for graph constructed on each site Global wait-for graph combining information from each site Deadlock is detected from global wait-for graph Notice that no cycle for local wait-for graph no cycle for global wait-for graph
Deadlocks in distributed databases Local Global
Deadlocks in distributed databases A global wait-for graph is constructed and maintained in a single site; the deadlock-detection coordinator Real graph: Real, but unknown, state of the system. Constructed graph: Approximation generated by the controller during the execution of its algorithm. The real graph can be unknown due to Network delays (changes are not propagated) Network partition
Deadlocks in distributed databases the global wait-for graph can be constructed when: a new edge is inserted in or removed from one of the local wait-for graphs. a number of changes have occurred in a local wait-for graph. the coordinator needs to invoke cycle-detection. If the coordinator finds a cycle, it selects a victim and notifies all sites. The sites roll back the victim transaction.
Deadlocks in distributed databases Limitations: false cycles Suppose the local wait for graph is as the r.h.s: Now suppose T2 release the resources on S1 Edge from T1 to T2 should be deleted Then, T2 request resources held by T3 on S2 Edge from T2 to T3 should be added at site 2 If the second message arrive before the first, then a deadlock is detected while in fact it isn’t This can be avoided if (global) 2-phase locking is maintained
Timestamp ordering in distributed databases Timestamp based techniques can be used in distributed databases Main issues: how to generate unique timestamps for transactions across multiple sites Solution: Each site generates a unique local timestamp using either a logical counter or the local clock. Global unique timestamp is obtained by concatenating the unique local timestamp with the unique identifier.
Timestamp ordering in distributed databases A site with a slow clock will assign smaller timestamps Still logically correct: serializability not affected But: “disadvantages” transactions To fix this problem Define within each site Si a logical clock (LCi), which generates the unique local timestamp Require that Si advance its logical clock whenever a request is received from a transaction Ti with timestamp < x,y> and x is greater that the current value of LCi. i.e. whenever a site see a timestamp that is larger then its clock, it advances its clock accordingly In this case, site Si advances its logical clock to the value x + 1.
Issues with replication Replication is useful in distributed database Data at multiple sites lower access times Data warehouses In some cases, no need to access the most recent version of data However, it may have adverse effect on consistency/isolation Reading different version of data non-serializability
Issues with replication E.g.: master-slave replication: updates are performed at a single “master” site, and propagated to “slave” sites. Propagation is not part of the update transaction: its is decoupled May be immediately after transaction commits May be periodic Data may only be read at slave sites, not updated No need to obtain locks at any remote site Particularly useful for distributing information E.g. from central office to branch-office Also useful for running read-only queries offline from the main database
Issues with replication Replicas should see a transaction-consistent snapshot of the database That is, a state of the database reflecting all effects of all transactions up to some point in the serialization order, and no effects of any later transactions. E.g. Oracle provides a create snapshot statement to create a snapshot of a relation or a set of relations at a remote site snapshot refresh either by recomputation or by incremental update Automatic refresh (continuous or periodic) or manual refresh
Issues with replication With multimaster replication (also called update-anywhere replication) updates are permitted at any replica, and are automatically propagated to all replicas Basic model in distributed databases, where transactions are unaware of the details of replication, and database system propagates updates as part of the same transaction Coupled with 2 phase commit Many systems support lazy propagation where updates are transmitted after transaction commits Allow updates to occur even if some sites are disconnected from the network, but at the cost of consistency
Issues with replication Two approaches to lazy propagation Updates at any replica translated into update at primary site, and then propagated back to all replicas Updates to an item are ordered serially But transactions may read an old value of an item and use it to perform an update, result in non-serializability Updates are performed at any replica and propagated to all other replicas Causes even more serialization problems: Same data item may be updated concurrently at multiple sites!
Issues with replication Conflict detection is a problem Some conflicts due to lack of distributed concurrency control can be detected when updates are propagated to other sites (will see later, in Section 23.5.4) Conflict resolution is very messy Resolution may require committed transactions to be rolled back Durability violated Automatic resolution may not be possible, and human intervention may be required