Replication and Consistency
References r The Case for Non-transparent Replication: Examples from Bayou Douglas B. Terry, Karin Petersen, Mike J. Spreitzer, and Marvin M. Theimer. IEEE Data Engineering, December 1998 r Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer and Carl H. Hauser. In ACM Symposium on Operating Systems Principles (SOSP ’95)
ACID Transaction r Transaction has to show up everywhere at the same time or not at all. m e.g., When you withdraw cash from your ATM machine, the balance should reflect the actual money left. If it doesn’t, then you could go back to a store, use your ATM card and withdraw cash that you do not have m e.g., When you make your airline reservation and the system assigns you a seat, you expect the seat to be available to you (of course, Airlines overbook)
Replication and Availability r Replication is a powerful tool that allows us to tradeoff availability for consistency r Applications need different levels of consistency r Applications know best on how to deal with inconsistency
Application 1: Meeting room scheduler r Suppose we have two conference rooms of the same capacity. I want to schedule my meeting in one of the conference rooms. I don’t care which exact room it is. r If two people reserve the same room at the same time, there is a conflict, but if they reserve the same room at different times or reserve different rooms at the same time, there is no conflict. Rm2 Rm1 time No conflict
Application 1: Meeting room scheduler Rm2 Rm1 time No conflict Rm2 Rm1 time conflict
Application 1: Meeting room scheduler r We can lock the entire database m Not needed when there is no conflict m In the case of conflicts, there is an application specific way to deal with the conflict – we move on reservation to the other room m If the other room is reserved, we ask the user, they can easily move the reservation to another acceptable time
Application 2: Shared mailbox r Shared mailbox folders- shared between, me and my 2TAs. r We all replicate the mailbox. m OP1: I see a mail from the class, respond to it and delete it. m OP2: The TA sees the same mail and files it in CS402. m OP3: I see an from a friend and file it as important. mailbox CS402ImportantRecruitingChocolate
Application 2: Shared Mailbox r All of us operate on the same mailbox r You can lock the entire mailbox before someone operates on it. m Can’t work when disconnected m Clearly not necessary for doing only one operation m For operation OP1 and OP2, it is not clear who should win, should the mail be deleted or should it be filed in Assign1?
Two Approaches to Building Replicated Services r Transparent replication system: m Allow systems that were developed assuming a central file system or database to run unchanged on top of a strongly-consistent replicated storage system (as seen in Oceanstore) r Non-transparent replication system: m Relaxed consistency model – access-update-anywhere m Applications involved in conflict detection and resolution. Hence applications need to be modified (e.g. Bayou, Coda file system etc)
Hypothesis r Applications know best on how to resolve conflicts r The challenge is providing the right interface to support cooperation between applications and their data managers m Programmers do not want to deal with propagating updates, ensuring eventual consistency Anyone who has synchronized the project files in school, work, and home can feel the pain. m Programmers want to set replication schedules and control how conflicts or detected and resolved Record level conflict detection rather than file level
Bayou r Update-anywhere replication model m Bayou manages databases that can be fully replicated at any number of sites m Applications can read and write to any single replica of the database (lazy group update) m Once a replica accepts a write operation, this write is performed locally and propagated to all other replicas via pair-wise reconciliation protocol
Conflict Detection : Dependency Checks r Each Write operation includes a dependency check consisting of an application-supplied query and its expected result. r If the check fails, then the requested update is not performed and the server invokes a procedure to resolve the detected conflict.
Example of Bayou Write 3-tuple: For example, Update: Dependency check: Mergeproc: sometimes users like conflicts m A different merge procedure altogether could search for the next available time slot to schedule the meeting, which is an option a user might choose if any time would be satisfactory.
Conflict Resolution : Merge Procedure r In practice, Bayou merge procedures are written by application programmers in the form of templates that are instantiated with the appropriate details filled in for each Write. r In the case where automatic resolution is not possible, the merge procedure will still run to completion, but is expected to produce a revised update that logs the detected conflict in some fashion that will enable a person to resolve the conflict later.
Replica Management r Replicas held by two servers at any time may vary in their contents because they have received and processed different Writes. However, this fundamental property is satisfied: m Bayou system guarantees that all servers eventually receive all Writes via the pair-wise anti-entropy process and that two servers holding the same set of Writes will have the same data contents. It cannot enforce strict bounds on Write propagation delays since these depend on network connectivity factors that are outside of Bayou ’ s control
Replica Consistency r Bayou has two features that allows servers to achieve eventual consistency. m Writes performed in the same, well-defined order at all servers (global-ordering) m Conflict detection and merge procedures are deterministic
Replica Consistency r When a Write is accepted by a Bayou server from a client, it is deemed tentative. r Tentative writes are ordered according to timestamps assigned to them by their accepting servers. r Eventually, each Write is committed, by the anti- entropy process that will be described shortly. r Timestamps for tentative Writes must monotonically increase at each server. r Servers do not have to have synchronized clocks
Replica Consistency r Consistency is potentially an issue since servers may receive Writes from clients and from other servers in an order that differs from the required execution order and because servers immediately apply all known Writes to their replicas. r This implies that there must be support of undoing writes (use of write logs) and reapplying them r Each server maintains a log of all Write operations that it has received, sorted by their committed or tentative timestamps, with committed Writes at the head of the log.
Anti-Entropy r Entropy - a process of degradation or running down or a trend to disorder. r Bring 2 replicas up-to-date r Three Major Design Decisions m Pairwise communication between replicas m Exchange of update operations m Ordered propagation of operations
Pair Reconciliation Replica Eventual consistency Global commit order assigned by Primary server
Example r Suppose a user keeps the primary copy of his calendar with him on his laptop and allows others, such as a spouse or secretary, to keep secondary (mostly read copies). r The user updates to his own calendar; This is committed immediately. r Updates by the spouse/secretary are tentative until anti-entropy takes place with the user. At this point, the user can commit and propagate the order to the spouse/secretary during anti- entropy.
Basic Anti-Entropy r Protocol: m Between pairs of servers m The propagation of writes is constrained by the accept order. r Prefix property: A server R that holds a write stamped write, W i, that was initially accepted by another server X will also hold all writes accepted by X prior to W i
Basic Anti-Entropy r Protocol m R.V: This denotes R’s version vector; This is used to determine which writes are unknown to the receiving server R anti-entropy(S,R) { Get R.V from receiving server R #now send all the writes unknown to R w = first write in S.write-log while (w) do if R.V(w.server-id) < w.accept-stamp then # w is new for R SendWrite(R,w) w = next write in S.write-log end }
Basic Anti-Entropy r Anti-entropy is incremental r When a new write arrives at the receiver it can be immediately included in the receiver's write-log because the sending replica ensures that the receiving server will hold all writes necessary to satisfy the prefix property. r Reconciliation between two replicas can make progress independently of where the protocol may get interrupted due to network failures or voluntary disconnections. r The protocol does not address the issue of the growing size of write logs.
Effective Write-Log Management r Storage is of concern r We want to be able to prune the prefix of the write logs r A protocol is needed to stabilize writes (we look at a primary commit) protocol. r Primary replica commits write and assigns a monotonically increasing commit sequence number called CSN. r Committed writes are totally ordered r Propagation: m First send the committed writes m Second send the tentative writes
Anti-Entropy with Support for Committed Writes anti-entropy(S,R) { Get R.V from receiving server R #First send all the committed writes that R does #not know about if R.CSN < S.CSN then w = first committed write that R does not know about. while (w) do if w.accept-stamp < R.V(w.server-id) then # R has the write, but does not know it is committed. SendCommitNotification(R, w.accept-stamp,w.server-id, w.CSN) else SendWrite(R,w) end w = next committed write in S.write-log. #now send all the tentative writes while (w) do if R.V(w.server-id) < w.accept-stamp then SendWrite(R,W) w = next write in S.write-log end }
Effective Write-Log Management r It is necessary to allow replicas to truncate any prefix of the committed (stable) part of the write log when there is a need. r Implication: A write-log may not hold enough writes to allow incremental reconciliation with another replica. r A commit sequence number is maintained for the omitted part of the log. r A vector characterizing the omitted prefix of the server’s write-log is also maintained. r If the commit sequence number of the receiver is less than the omitted sequence number of the server then a (perhaps full) database transfer occurs.
Access Control r Certificates – Grant, delegate and revoke r No assumptions about trust r Mutual authentication and access control is based on public-key cryptography. r Every user possesses a public/private key pair and a set of digitally signed access control certificates granting user access to various data collections.
Performance r Size is acceptable r Write performance is acceptable
Future r Partial Databases m Carry part of the database instead of the entire database (mobile clients do not have enough storage space) The problem is that, if a client did not have a particular record, was it because it didn’t replicate that part of because it didn’t know about it?
Technology Impact r TrueSync - end-to-end synchronization software and infrastructure solutions for the wireless Internet m SyncML - SyncML is the common language for synchronizing all devices and applications over any network. m Ericsson, IBM, Lotus, Motorola, Nokia, Palm Inc., Psion, Starfish Software etc. (614 companies) m
Conclusions r Difference from other replicated systems m Non-transparency m Application-specific conflict detection m Per-write conflict resolvers m Partial and multi-object updates m Tentative and stable resolutions m Security r Future goal m Partial replication, policies for choosing servers for anti-entropy, building servers with conventional database managers, alternate data models, and finer grain access control.