Slides for Chapter 14: Distributed transactions From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Addison-Wesley 2005
Commitment of dist. trans. - intro A distributed transaction refers to a flat or nested transaction that accesses objects managed by multiple servers Atomicity must still be preserved A process on one of the servers is coordinator, it must ensure the same outcome at all of the servers. The ‘two-phase commit protocol’ is the most commonly used protocol for achieving this also going to discuss cc for distributed Tx and recovery of distributed Tx Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 •
Fig 14.1 Distributed transactions (a) Flat transaction (b) Nested transactions M X T 11 X Client T N T 1 T Y 12 T T T 21 T 2 Client Y P Z T 22 Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Fig. 14.2 Nested banking transaction a.withdraw(10) c . deposit(10) b.withdraw(20) d.deposit(20) Client A B C T 1 2 3 4 D X Y Z T = openTransaction openSubTransaction a.withdraw(10); closeTransaction b.withdraw(20); c.deposit(10); d.deposit(20); Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Fig. 14.3 A dist. banking transaction BranchZ BranchX participant C D Client BranchY B A join T a.withdraw(4); c.deposit(4); b.withdraw(3); d.deposit(3); openTransaction b.withdraw(T, 3); closeTransaction T = Note: the coordinator is in one of the servers, e.g. BranchX new protocols with new failure modes two-phase commit protocol, sometimes shortened to 2PC New coordinator for this The participants are like local coordinators that communicate with the overall coordinator Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Fig. 14.4 Ops for 2PC canCommit?(trans)-> Yes / No Call from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote. doCommit(trans) Call from coordinator to participant to tell participant to commit its part of a transaction. doAbort(trans) Call from coordinator to participant to tell participant to abort its part of a transaction. haveCommitted(trans, participant) Call from participant to coordinator to confirm that it has committed the transaction. getDecision(trans) -> Yes / No Call from participant to coordinator to ask for the decision on a transaction after it has voted Yes but has still had no reply after some delay. Used to recover from server crash or delayed messages. coordinator object can live on any one of those machines 1. It goes to each and asks “can you commit?” if any answers no, it’s an abort for the whole transaction yes means not only prepared to commit but persistent if all say yes command from coordinator goes to all and says go ahead Database things have these recovery files that get really big and you run out of disk space. The transaction manager tries to clean up the log file whenever possible. Checkpointing is a way to truncate a lot of the log file. Don’t care about the detailed history. Care about the states of the persistent objects. The haveCommitted is for that purpose. BUT messages get lost and machines crash and all that. If I say “yes” and the coordinator machine crashes, I stay in the uncertain state. For how long? timeout and abort? Aborting a transaction is ok. That’s not a safety problem. That’s a liveness problem because you can come around and try again. You are never going to be in an inconsistent state. That’s what’s important. If coordinator machine goes down, if I know the other participants, I can ask if they got a do-commit message. If they did, I can go ahead and commit even though I didn’t get a do-commit message as long as one of them did. If I check and the coordinator is there, I can ask what the decision was on the transaction. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Fig. 14.5 Two-phase commit protocol Phase 1 (voting phase): 1. The coordinator sends a canCommit? request to each of the participants in the transaction. 2. When a participant receives a canCommit? request it replies with its vote (Yes or No) to the coordinator. Before voting Yes, it prepares to commit by saving objects in permanent storage. If the vote is No the participant aborts immediately. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Fig. 14.5 Two-phase commit protocol Phase 2 (completion according to outcome of vote): 3. The coordinator collects the votes (including its own). (a) If there are no failures and all the votes are Yes the coordinator decides to commit the transaction and sends a doCommit request to each of the participants. (b) Otherwise the coordinator decides to abort the transaction and sends doAbort requests to all participants that voted Yes. 4. Participants that voted Yes are waiting for a doCommit or doAbort request from the coordinator. When a participant receives one of these messages it acts accordingly and in the case of commit, makes a haveCommitted call as confirmation to the coordinator. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Fig. 14.6 Comm in 2PC protocol canCommit? Yes doCommit haveCommitted Coordinator 1 3 (waiting for votes) committed done prepared to commit step Participant 2 4 (uncertain) status Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
a distributed transaction involves several different servers. Summary of 2PC a distributed transaction involves several different servers. A nested transaction structure allows additional concurrency and independent committing by the servers in a distributed transaction. atomicity requires that the servers participating in a distributed transaction either all commit it or all abort it. continued ... Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 •
the 2PC protocol allows a server to abort unilaterally. Summary of 2PC atomic commit protocols are designed to achieve this effect, even if servers crash during their execution. the 2PC protocol allows a server to abort unilaterally. it includes timeout actions to deal with delays due to servers crashing. 2PC protocol can take an unbounded amount of time to complete but is guaranteed to complete eventually. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000 •
14.5 Distributed deadlocks Single server transactions can experience deadlocks prevent or detect and resolve use of timeouts is clumsy, detection is preferable. it uses wait-for graphs. Distributed transactions lead to distributed deadlocks in theory can construct global wait-for graph from local ones a cycle in a global wait-for graph that is not in local ones is a distributed deadlock • Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
14.6 Transaction recovery Atomicity property of transactions durability and failure atomicity durability requires that objects are saved in permanent storage and will be available indefinitely failure atomicity requires that effects of transactions are atomic even when the server crashes database servers often just load objects into volatile memory when they are accessed • Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
14.6 Transaction recovery Recovery is concerned with ensuring that a server’s objects are durable and that the service provides failure atomicity. for simplicity we assume that when a server is running, all of its objects are in volatile memory and all of its committed objects are in a recovery file in permanent storage recovery consists of restoring the server with the latest committed versions of all of its objects from its recovery file database servers often just load objects into volatile memory when they are accessed Recovery restores server with latest committed versions of all of its objects from the recovery file. Why latest committed versions? Need to recover at a point where the state was known. • Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Recovery manager The task of the Recovery Manager (RM) is: to save objects in permanent storage (in a recovery file) for committed transactions; to restore the server’s objects after a crash; to reorganize the recovery file to improve performance; to reclaim storage space (in the recovery file). media failures i.e. disk failures affecting the recovery file need another copy of the recovery file on an independent disk. The RM deals with both durability and failure atomicity - it saves committed objects and can be used to restore server state • Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Fig 14.18 Types of entry in a recovery file Type of entry Description of contents of entry Object A value of an object. Transaction status Transaction identifier, transaction status ( prepared , committed aborted ) and other status values used for the two-phase commit protocol. Intentions list Transaction identifier and a sequence of intentions, each of which consists of <identifier of object>, <position in recovery file of value of object>. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Fig 14.19 Log for banking service P 1 2 3 4 5 6 7 Object: A B C Trans: T U 100 200 300 80 220 prepared committed 278 242 < , > Checkpoint End of log Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Logging - reorganizing the recovery file RM is responsible for reorganizing its recovery file Checkpointing the process of writing to a new recovery file the current committed values of a server’s objects, transaction status entries and intentions lists of transactions not yet fully resolved including information related to 2PC (see later) checkpointing makes recovery faster and saves disk space done after recovery and from time to time RM is responsible for reorganizing its recovery file so as to make the process of recovery faster and to reduce its use of space • Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Recovery of the 2PC Role Status Action of recovery manager Coordinator prepared No decision had been reached before the server failed. It sends abortTransaction to all the servers in the participant list and adds the transaction status aborted in its recovery file. Same action for state . If there is no participant list, the participants will eventually timeout and abort the transaction. committed A decision to commit had been reached before the server failed. It sends a doCommit to all the participants in its participant list (in case it had not done so before) and resumes the two-phase protocol at step 4 (Fig 13.5). Participant The participant sends a haveCommitted message to the coordinator (in case this was not done before it failed). This will allow the coordinator to discard information about this transaction at the next checkpoint. uncertain The participant failed before it knew the outcome of the transaction. It cannot determine the status of the transaction until the coordinator informs it of the decision. It will send a getDecision to the coordinator to determine the status of the transaction. When it receives the reply it will commit or abort accordingly. The participant has not yet voted and can abort the transaction. done No action is required. Figure 14.22 Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Recovery of the 2PC Coordinator Coord Particip Participant Role Status Action of RM Coordinator prepared or aborted No decision reached before server failed. Sends abortTransaction to all participants on list and sends aborted to recovery file. Coord committed A decision to commit had been reached before the server failed. Sends a doCommit to all part.’s in list and resumes 2PC at step 4. Particip Part sends haveCommitted msg to Coord. Allows coord. to discard info about trans at next ckpoint. Participant uncertain The part. failed before knew outcome of trans. Cannot determine status of transaction until coordinator informs of decision. Sends a getDecision to coord to determine status of trans. When reply received, commits or aborts Participant has not voted; can abort done No action is required Coordinator prepared or aborted If there is no participants’ list, they will eventually timeout and abort the transaction. Coord. Committed – Sends a doCommit to all participants in its list in case it hadn’t done that before it failed. Participant committed -- Sends havecommitted msg to coordinator in case it wasn’t done before it failed. Allows the coordinator to discard information about the transaction at the next checkpoint. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
ECEN5053 Software Engineering of Distributed Systems Time and Global States ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder A Short Review of Time
Topics Clock synchronization Logical clocks Global State Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
How processes can synchronize Multiple processes must be able to cooperate in granting each other temporary exclusive access to a resource Also, multiple processes may need to agree on the ordering of events, such as whether message m1 from process P was sent before or after message m2 from process Q. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Centralized system Time is unambiguous If a process wants to know the time, it makes a system call and finds out If process A asks for the time and gets it and then process B asks for the time and gets it, the time that B was told will be later than the time that A was told. Simple, no? Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Physical Clocks Physical computer clocks are not clocks; they are timers Quartz crystal that oscillates at a well-defined frequency that depends on physical properties Two registers: counter and a holding register Each oscillation decrements the counter by one When counter reaches zero, generates an interrupt and the counter is reloaded from the holding register Each interrupt is called a clock tick Interrupt service procedure adds 1 to time stored in memory so the software clock is kept up to date Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
The one and the many What if the clock is “off” by a little? All processes on single machine use the same clock so they will still be internally consistent What matters is relative time Impossible to guarantee that crystals in different computers run at exactly the same frequency Gradually software clocks get out of synch -- skew A program that expects time to be independent of the machine on which it is run ... fails Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Hey buddy, can you spare me a second? To provide UTC (translates as Universal Coordinated Time) to those who need precise time, NIST operates a short wave radio station WWV from Fort Collins, CO WWV broadcasts a short pulse at the start of each second There are stations in other countries plus satellites Using either short wave or satellite services requires an accurate knowledge of the relative position of the sender and receiver. Why? There are two phone numbers that allow you to listen to NIST time. To hear a simulcast of the WWV shortwave broadcast, call (303) 499-7111. This is not a toll-free call, except in the local Boulder/Denver, Colorado area. http://tf.nist.gov/timefreq/stations/wwvb.htm Why? to compensate for signal propagation delay Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
To WWV or not to WWV If one computer has a WWV receiver, the goal is keeping all the others synchronized to it. If no machines have WWV receivers, each machine keeps track of its own time Goal -- keep all machines together as well as possible There are many algorithms Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Underlying model for synchronization models Each machine has a timer that interrupts H times a second Interrupt handler adds 1 to a software clock that keeps track of the number of ticks since some agreed-upon time in the past Call the value of the clock C Notationally, when UTC time is t, the value of the clock on machine p is Cp(t) In a perfect world, Cp (t) = t for all p and all t Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
may be as much as twice the max drift rate apart Back to reality Theoretically, a timer with H=60 should generate 216,000 ticks per hour Relative error is about 10^-5 meaning a particular machine gets a value in the range 215,998 to 216,002 There is a constant called the maximum drift rate and a timer will work with “perfect” + maximum drift rate. If two clocks are drifting in the opposite direction at a time delta-t after they were synchronized may be as much as twice the max drift rate apart To differ by no more than delta, clocks must be resynchronized every (delta/2*max-drift-rate) seconds Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Cristian’s algorithm Well suited to one machine with a WWV receiver and a goal to have all other machines stay synchronized with it. Call the one with the WWV receiver the time server Periodically, each machine sends a message to the time server asking for the current time Machine responds with CUTC as fast as it can 1st approximation, requester sets its clock to CUTC What’s wrong with that? Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Big Trouble Major problem Time really should never run backward -- why? If sender’s clock was fast, CUTC will be smaller than the sender’s current value of C Change must be introduced gradually If timer generates 100 interrupts/second, each interrupt adds 10 ms to the time To slow down, ISR adds only 9 ms until correct To speed up, add 11 ms at each interrupt Time running backward causes problems with automatic build programs comparing date/time on source vs. object code, etc. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Little Trouble Minor problem Takes a nonzero amount of time for the time server’s reply to get back to the sender Delay may be large and vary with network load Cristian attempts to measure send and receive times, subtract, divide by 2; add this to received CUTC Better: length of time server’s ISR, I, and incoming message processing time: (T1 - T0 - I)/2 To improve accuracy, measure several and average Divide by 2 because you only want to know the time to add to the UTC time which was true just before leaving the time server T1 - T0 is the length of time from when the request was sent until the reply returned. Subtract from that the time spent servicing the interrupt. That’s the travel time. Divide by 2 and you have an approximation of the travel time of the reply. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
If no WWV Receiver Berkeley UNIX algorithm The time server (actually time daemon) is active, not passive It polls every machine and asks what time it is Based on answers, it computes an average time and tells all machines to adjust their clocks to the new time The time daemon’s time is set manually by the operator periodically Centralized algorithm though the time daemon does not have a WWV receiver Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Decentralized synchronization Cristian and Berkeley UNIX are centralized algorithms with the usual downside. What? There are several decentralized algorithms, for example: Divide time into fixed length resynchronization intervals At the beginning of each interval, every machine broadcasts its current time Each starts a local timer to collect all broadcasts arriving during a certain interval Algorithm to compute a new time based on some/all downside = single point of failure Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Internet Synchronization New hardware and software technology in the past few years make it possible to keep millions of clocks synchronized to within a few ms of UTC New algorithms using these synchronized clocks are beginning to appear Synchronized clocks can be used to achieve cache consistency to use time-out tickets in distributed system authentication to handle commitment in atomic transactions Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
See also notes from 3 weeks ago Logical Clocks See also notes from 3 weeks ago For many purposes, it is sufficient that machines agree on the same time even if it is not the “right” time Internal consistency of the clocks matters Clock synchronization is possible but does not have to be absolute If 2 processes do not interact, their clocks need not be synchronized; the lack of synch would not be seen What is important is that all processes agree on the order in which events occur Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Lamport timestamps a happens-before b means that all processes agree that first event a occurs, then afterward, event b occurs We write a happens-before b as a --> b If a occurs before b in the same process, we say a --> b is true If the event a sends a message and event b receives that message in another process, a --> b is also true because a message cannot be received until after it is sent. happens-before is transitive Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Ya cain’t say If x and y happen in different processes that do not exchange messages, then we cannot say x --> y we cannot say y --> x nothing can be said about when the events happened or which event happened first we call these events concurrent We don’t really know if they ARE simultaneous, we just call them concurrent because they might be Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Invent time Need a way of measuring time so that for every event a we can assign a time C(a) on which all processes agree. Such that, if a --> b, then C(a) < C(b) If a and b are two events in the same process and a happens before b, then C(a) < C(b) If a is the sending of a msg by one process and b is the receiving of that msg by another, then C(a) and C(b) must be assigned so that everyone agrees on the values of C(a) and C(b) with C(a) < C(b) Corrections to C can only be made by addition, never subtraction so that the clock time always goes forward Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
If msg leaves at time N, it arrives at >= N+1 Each message carries the time according to its sender’s clock When it arrives, if the receiver’s clock shows a value prior to the time the message was sent, the receiver fast forwards its clock to be 1 more than the sending time Between every two events the clock must tick at least once If a process sends or receives 2 messages in quick succession, it must advance its clock by (at least) 1 tick in between No 2 events ever occur at exactly the same time Last item: appends its own process number or something so two events at C = 40 become 40.1 and 40.2 Provides a TOTAL ORDERING of all events in the system. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Totally-ordered Multicast Consider a bank with replicated data in San Francisco and New York City. Customer in SF wants to add $100 to the account of $1000 Meanwhile, a bank employee in NY initiates an update by which the customer’s account will be increased with 1% interest. Due to communication delays, the instructions could arrive at the replicated sites in different orders with differing final answers Should have been performed at both sites in same order Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Using Lamport timestamps to get totally ordered multicast Consider group of processes multicasting messages to each other Each message is timestamped with the current (logical) time of its sender Conceptually, if multicast, the msg is also sent to its sender We assume msgs from the same sender are received in the order they were sent and that no messages were lost Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
totally ordered multicast (cont.) When a process receives a message, it goes into a local queue ordered according to its timestamp The receiver multicasts an acknowledgement Using Lamport’s algorithm for adjusting local clocks, the timestamp of the received msg is lower than the timestamp of the acknowledgement All processes will eventually have the same copy of the local queue because each msg is multicast, plus acks We assumed msgs are delivered in the order sent by sender Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
totally ordered multicast (cont. more) Each process inserts a received msg in its local queue according to the timestamp in that msg. Lamport’s clocks ensure no two messages have the same timestamp Also, the timestamps reflect a consistent global ordering of events A process delivers a queued msg to the application it is running when that message is at the head of the queue and has been acknowledged by each other process The msg removed from queue; associated acks removed. Because each process has the same copy of the queue, all messages are delivered in the same order everywhere. We have established totally-ordered multicasting. this is like the distributed semaphore we looked at 3 weeks ago. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Vector Timestamps With Lamport timestamps, nothing can be said about the relationship between a and b simply by comparing their timestamps C(a) and C(b). Just because C(a) < C(b), doesn’t mean a happened before b (remember concurrent events) Consider network news where processes post articles and react to posted articles Postings are multicast to all members Want reactions delivered after associated postings Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Will totally-ordered multicasting work? That scheme does not mean that if msg B is delivered after msg A, B is a reaction to msg A. They may be completely independent. What’s missing? If causal relationships are maintained within a group of processes, then receipt of a reaction to an article should always follow the receipt of the article. If two items are independent, their order of delivery should not matter at all What’s missing in Lamport’s algorithm is causality. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Vector Timestamps capture causality VT(a) < VT(b) means event a causally precedes event b. Let each process Pi maintain vector Vi such that Vi[i] is the number of events that have occurred so far at Pi If Vi[j] = k then Pi knows that k events have occurred at Pj We increment Vi[i] at the occurrence of each new event that happens at process Pi Piggyback vectors with msgs that are sent. When Pi sends msg m, it sends its current vector along as a timestamp vt. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Receiver thus knows the number of events that have occurred at Pi Receiver is also told how many events at other processes have taken place before Pi sent message m. timestamp vt of m tells the receiver how many events in other processes have preceded m and on which m may causally depend When Pj receives m, it adjusts its own vector by setting each entry Vj[k] to max{Vj[k], vt[k]} The vector now reflects the # of msgs that Pj must receive to have at least seen the same msgs that preceded the sending of m. Vj[i] is incremented by 1 representing the event of receiving msg m as the next message from Pi Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
When are messages delivered? Vector timestamps are used to deliver msgs when no causality constraints are violated. When process Pi posts an article, it multicasts that article as a msg a with timestamp vt(a) set equal to Vi. When another process Pj receives a, it will have adjusted its own vector such that Vj[i] > vt(a)[i] Now suppose Pj posts a reaction by multicasting msg r with timestamp vt(r) equal to Vj. vt(r)[i] > vt(a)[i]. Both msg a and msg r will arrive at Pk in some order Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
In particular, r is delivered only if the following conditions are met When receiving r, Pk inspects timestamp vt(r) and decides to postpone delivery until all msgs that causally precede r have been received as well. In particular, r is delivered only if the following conditions are met vt(r)[j] = Vk[j] + 1 vt(r)[i] <= Vk [i] for all i not equal to j says r is the next msg Pk was expecting from Pj says Pk has seen no msg not seen by Pj when it sent r. In particular, Pk has already seen message a. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000
Controversy There has been some debate about whether support for totally-ordered and causally-ordered multicasting should be provided as part of the message-communication layer or whether applications should handle ordering Comm layer doesn’t know what it contains, only potential causality 2 msgs from same sender will always be marked as causally related even if they are not Application developer may not want to think about it Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 © Addison-Wesley Publishers 2000