Global Predicate Detection and Event Ordering
Our Problem To compute predicates over the state of a distributed application
Model Message passing No failures Two possible timing assumptions: Synchronous System Asynchronous System No upper bound on message delivery time No bound on relative process speeds No centralized clock
Clock Synchronization External Clock Synchronization: keeps processor clock within some maximum deviation from an external time source. can exchange info about timing events of different systems can take actions at real-time deadlines synchronization within 0.1 ms Internal Clock Synchronization: keeps processor clocks within some maximum deviation from each other. can measure duration of distributed activities that start on one process and terminate on another can totally order events that occur on a distributed system
The Model n processes with hardware clocks , bound on drift on correct clocks , bound on message delivery time f, bound on number of faulty processes–their clocks are outside of the blue envelope (real time) (clock time)
Requirements of Synchronization Processes adjust their clocks periodically to obtain logical clocks C that satisfy: Agreement–logical clocks are never too far apart Accuracy–logical clocks maintain some relation to real time
What accuracy can be achieved? Provably, ¸ Traditionally, > synchronization caused a loss in accuracy with respect to real time... variable message delays failures Is this loss of accuracy unavoidable? (real time)
What accuracy can be achieved? Provably, ¸ Traditionally, > synchronization caused a loss in accuracy with respect to real time... variable message delays failures Is this loss of accuracy unavoidable? Optimal Accuracy: (1+ ) -1 t · C i · (1+ )t (real time)
How to synchronize? Repeat forever: Agree on “when” to resynchronize Agree on “updated clock value” at resynchronization Traditionally: Periodic resynchronization with fixed periods (e.g. every hour) Updated clock value := “average” of all clocks
Averaging
Problems with Averaging Accumulation of error Fault-tolerance
Clock Synchronization: Take 1 Assume an upper bound max and a lower bound min on message delivery time Guarantee that processes stay synchronized within max - min.
Clock Synchronization: Take 1 Assume an upper bound max and a lower bound min on message delivery time Guarantee that processes stay synchronized within max - min. Time (ms) % of messages Problem: 5000 message run (IBM Almaden)
Clock Synchronization: Take 2 No upper bound on message delivery time......but lower bound min on message delivery time Use timeout max to detect process failures slaves send messages to master Master averages slaves value; computes fault-tolerant average Precision: 4.maxp - min
Probabilistic Clock Synchronization (Cristian) Master-Slave architecture Master is connected to external time source Slaves read master’s clock and adjust their own How accurately can a slave read the master’s clock?
The Idea Clock accuracy depends on message roundtrip time if roundtrip is small, master and slave cannot have drifted by much! Since no upper bound on message delivery, no certainty of accurate enough reading... … but very accurate reading can be achieved by repeated attempts
Asynchronous systems Weakest possible assumptions Weak assumptions ´ less vulnerabilities Asynchronous slow “Interesting” model w.r.t. failures
Client-Server Processes exchange messages using Remote Procedure Call (RPC) A client requests a service by sending the server a message. The client blocks while waiting for a response s c
Client-Server Processes exchange messages using Remote Procedure Call (RPC) The server computes the response (possibly asking other servers) and returns it to the client A client requests a service by sending the server a message. The client blocks while waiting for a response s #!?%! c
Deadlock!
Goal Design a protocol by which a processor can determine whether a global predicate (say, deadlock) holds
Draw arrow from p i to p j if p j has received a request but has not responded yet Wait-For Graphs
Draw arrow from p i to p j if p j has received a request but has not responded yet Cycle in WFG ) deadlock Deadlock ) ¦ cycle in WFG Wait-For Graphs
The protocol p 0 sends a message to p 1 p 3 On receipt of p 0 ‘s message, p i replies with its state and wait-for info
An execution
Ghost Deadlock!
We have a problem... Asynchronous system no centralized clock, etc. etc. Synchrony useful to coordinate actions order events
Events and Histories Processes execute sequences of events Events can be of 3 types: local, send, and receive e p i is the i -th event of process p The local history h p of process p is the sequence of events executed by process h p k : prefix that contains first k events h p 0 : initial, empty sequence The history H is the set h p 0 [ h p 1 [ … h p n -1 N OTE: In H, local histories are interpreted as sets, rather than sequences, of events
Ordering events Observation 1: Events in a local history are totally ordered time
Ordering events Observation 1: Events in a local history are totally ordered Observation 2: For every message m, send ( m ) precedes receive ( m ) time
Happened-before (Lamport[1978]) A binary relation defined over events 1. if e i k, e i l 2h i and k<l, then e i k !e i l 2. if e i = send ( m ) and e j = receive ( m ), then e i ! e j 3. if e!e ’ and e ’ ! e ‘’, then e! e ‘’
Space-Time diagrams A graphic representation of a distributed execution time
Space-Time diagrams A graphic representation of a distributed execution time
Space-Time diagrams A graphic representation of a distributed execution time
Space-Time diagrams A graphic representation of a distributed execution time
Space-Time diagrams A graphic representation of a distributed execution time H and impose a partial order
Space-Time diagrams A graphic representation of a distributed execution time H and impose a partial order
Space-Time diagrams A graphic representation of a distributed execution time H and impose a partial order
Space-Time diagrams A graphic representation of a distributed execution time H and impose a partial order
Runs and Consistent Runs A run is a total ordering of the events in H that is consistent with the local histories of the processors Ex: h 1, h 2, …, h n is a run A run is consistent if the total order imposed in the run is an extension of the partial order induced by ! A single distributed computation may correspond to several consistent runs!
Cuts A cut C is a subset of the global history of H
A cut C is a subset of the global history of H The frontier of C is the set of events Cuts
Global states and cuts The global state of a distributed computation is an tuple of n local states = ( 1... n ) To each cut ( 1 c 1,... n c n ) corresponds a global state
Consistent cuts and consistent global states A cut is consistent if A consistent global state is one corresponding to a consistent cut
What sees
Not a consistent global state: the cut contains the event corresponding to the receipt of the last message by p 3 but not the corresponding send event
Our task Develop a protocol by which a processor can build a consistent global state Informally, we want to be able to take a snapshot of the computation Not obvious in an asynchronous system...
Our approach Develop a simple synchronous protocol Refine protocol as we relax assumptions Record: processor states channel states Assumptions: FIFO channels Each m timestamped with with T ( send ( m ))
Snapshot I 1. p 0 selects t ss 2. p 0 sends “take a snapshot at t ss ” to all processes 3. when clock of p i reads t ss then a. records its local state i b. starts recording messages received on each of incoming channels c. stops recording a channel when it receives first message with timestamp greater than or equal to t ss
Snapshot I 1. p 0 selects t ss 2. p 0 sends “take a snapshot at t ss ” to all processes 3. when clock of p i reads t ss then a. records its local state i b. sends an empty message along its outgoing channels c. starts recording messages received on each of incoming channels d. stops recording a channel when it receives first message with timestamp greater than or equal to t ss
Correctness Theorem: Snapshot I produces a consistent cut Proof: Need to prove
Clock Condition Can the Clock Condition be implemented some other way?
Lamport Clocks Each process maintains a local variable LC LC ( e ) = value of LC for event e
Increment Rules Timestamp m with
Space-Time Diagrams and Logical Clocks
A subtle problem when LC=t do S doesn’t make sense for Lamport clocks! there is no guarantee that LC will ever be t S is anyway executed after Fixes: If e is internal/send and LC = t-2 execute e and then S If e = receive(m) Æ (TS(m) ¸ t) Æ (LC · t-1) put message back in channel re-enable e ; set LC=t-1 ; execute S
An obvious problem No t ss ! Choose large enough that it cannot be reached by applying the update rules of logical clocks
An obvious problem No t ss ! Choose large enough that it cannot be reached by applying the update rules of logical clocks Doing so assumes upper bound on message delivery time upper bound relative process speeds Better relax it
Snapshot II p 0 selects p 0 sends “take a snapshot at t ss ” to all processes; it waits for all of them to reply and then sets its logical clock to when clock of p i reads then p i records its local state i sends an empty message along its outgoing channels starts recording messages received on each incoming channel stops recording a channel when receives first message with timestamp greater than or equal to
Relaxing synchrony Process does nothing for the protocol during this time! take a snapshot at empty message: monitors channels records local state sends empty message: Use empty message to announce snapshot!
Snapshot III Processor p 0 sends itself “take a snapshot “ when p i receives “take a snapshot” for the first time from p j : records its local state i sends “take a snapshot” along its outgoing channels sets channel from p j to empty starts recording messages received over each of its other incoming channels when receives “take a snapshot” beyond the first time from p k : p i stops recording channel from p k when p i has received “take a snapshot” on all channels, it sends collected state to p 0 and stops.
Snapshots: a perspective The global state s saved by the snapshot protocol is a consistent global state
Snapshots: a perspective The global state s saved by the snapshot protocol is a consistent global state But did it ever occur during the computation? a distributed computation provides only a partial order of events many total orders (runs) are compatible with that partial order all we know is that s could have occurred
Snapshots: a perspective The global state s saved by the snapshot protocol is a consistent global state But did it ever occur during the computation? a distributed computation provides only a partial order of events many total orders (runs) are compatible with that partial order all we know is that s could have occurred We are evaluating predicates on states that may have never occurred!
An Execution and its Lattice
Reachability kl is reachable from ij if there is a path from kl to ij in the lattice
Reachability kl is reachable from ij if there is a path from kl to ij in the lattice
Reachability kl is reachable from ij if there is a path from kl to ij in the lattice
Reachability kl is reachable from ij if there is a path from kl to ij in the lattice
So, why do we care about s again? Deadlock is a stable property Deadlock If a run R of the snapshot protocol starts in i and terminates in f, then
So, why do we care about s again? Deadlock is a stable property Deadlock If a run R of the snapshot protocol starts in i and terminates in f, then Deadlock in s implies deadlock in f No deadlock in s implies no deadlock in i
Same problem, different approach Monitor process does not query explicitly Instead, it passively collects information and uses it to build an observation. (reactive architectures, Harel and Pnueli [1985]) An observation is an ordering of event of the distributed computation based on the order in which the receiver is notified of the events.
Observations: a few observations An observation puts no constraint on the order in which the monitor receives notifications
Observations: a few observations An observation puts no constraint on the order in which the monitor receives notifications
Observations: a few observations An observation puts no constraint on the order in which the monitor receives notifications
Causal delivery FIFO delivery guarantees:
Causal delivery FIFO delivery guarantees: Causal delivery generalizes FIFO:
Causal delivery FIFO delivery guarantees: Causal delivery generalizes FIFO: send event receive event deliver event
Causal delivery FIFO delivery guarantees: Causal delivery generalizes FIFO: send event receive event deliver event
Causal delivery FIFO delivery guarantees: Causal delivery generalizes FIFO: send event receive event deliver event
Causal delivery FIFO delivery guarantees: Causal delivery generalizes FIFO: send event receive event deliver event 1
Causal delivery FIFO delivery guarantees: Causal delivery generalizes FIFO: send event receive event deliver event 12
Causal Delivery in Synchronous Systems We use the upper bound on message delivery time
Causal Delivery in Synchronous Systems We use the upper bound on message delivery time DR1: At time t, p 0 delivers all messages it received with timestamp up to t - in increasing timestamp order
Causal Delivery with Lamport Clocks DR1.1: Deliver all received messages in increasing (logical clock) timestamp order.
Causal Delivery with Lamport Clocks DR1.1: Deliver all received messages in increasing (logical clock) timestamp order. 1
Causal Delivery with Lamport Clocks DR1.1: Deliver all received messages in increasing (logical clock) timestamp order. 14 Should p 0 deliver?
Causal Delivery with Lamport Clocks DR1.1: Deliver all received messages in increasing (logical clock) timestamp order. Problem: Lamport Clocks don’t provide gap detection 14 Should p 0 deliver? Given two events e and e ’ and their clock values LC(e) and LC(e ’ ) — where LC(e) < LC(e’), determine whether some e ’’ event exists s.t. LC(e) < LC(e ’’ ) < LC(e ’ )
Stability DR2: Deliver all received stable messages in increasing (logical clock) timestamp order. A message m received by p is stable at p if p will never receive a future message m s.t. TS(m ’ ) < TS(m)
Implementing Stability Real-time clocks wait for time units
Implementing Stability Real-time clocks wait for time units Lamport clocks wait on each channel for m s.t. TS(m) > LC(e) Design better clocks!
Clocks and STRONG Clocks Lamport clocks implement the clock condition: We want new clocks that implement the strong clock condition:
Causal Histories The causal history of an event e in (H,!) is the set
Causal Histories The causal history of an event e in (H,!) is the set
Causal Histories The causal history of an event e in (H,!) is the set
How to build Each process : p i initializes = 0 if e i k is an internal or send event, then if e i k is a receive event for message m, then
Pruning causal histories Prune segments of history that are known to all processes (Peterson, Bucholz and Schlichting) Use a more clever way to encode (e)
Vector Clocks Consider i (e), the projection of (e) on p i i (e) is a prefix of h i : i (e) = h i k i – it can be encoded using k i ( e ) = 1 (e) [ 2 (e) [... [ n (e) can be encoded using Represent using an n-vector VC such that
Update rules Message m is timestamped with
Example [1,0,0] [0,1,0] [2,1,0] [1,0,1] [1,0,2][1,0,3] [3,1,2] [1,2,3] [4,1,2] [5,1,2] [4,3,3] [5,1,4]
Operational interpretation = [1,0,0] [0,1,0] [2,1,0] [1,0,1] [1,0,2][1,0,3] [3,1,2] [1,2,3] [4,1,2] [5,1,2] [4,3,3] [5,1,4]
Operational interpretation ´ no. of events executed p i by up to and including e i ´ [1,0,0] [0,1,0] [2,1,0] [1,0,1] [1,0,2][1,0,3] [3,1,2] [1,2,3] [4,1,2] [5,1,2] [4,3,3] [5,1,4]
Operational interpretation ´ no. of events executed p i by up to and including e i ´ no. of events executed by p j that happen before e i of p i [1,0,0] [0,1,0] [2,1,0] [1,0,1] [1,0,2][1,0,3] [3,1,2] [1,2,3] [4,1,2] [5,1,2] [4,3,3] [5,1,4]
VC properties: event ordering 1. Given two vectors V and V +, less than is defined as: V < V + ´ (V V + ) Æ (8 k : 1· k· n : V[k] · V + [k]) Strong Clock Condition: Simple Strong Clock Condition: Given e i of p i and e j of p j, where i j Concurrency: Given e i of p i and e j of p j, where i j
VC properties: consistency Pairwise inconsistency Events e i of p i and e j of p j (i j) are pairwise inconsistent (i.e. can’t be on the frontier of the same consistent cut) if and only if Consistent Cut A cut defined by (c 1,..., c n ) is consistent if and only if
VC properties: weak gap detection Weak gap detection Given e i of p i and e j of p j, if VC(e i )[k] < VC(e j )[k] for some k j, then there exists e k s.t [2,2,2] [2,0,1] [0,0,2]
VC properties: weak gap detection Weak gap detection Given e i of p i and e j of p j, if VC(e i )[k] < VC(e j )[k] for some k j, then there exists e k s.t [2,2,2] [2,0,1] [0,0,2] [2,1,1] [0,0,1] [1,0,1]
VC properties: strong gap detection Weak gap detection Given e i of p i and e j of p j, if VC(e i )[k] < VC(e j )[k] for some k j, then there exists e k s.t Strong gap detection Given e i of p i and e j of p j, if VC(e i )[i] < VC(e j )[i] for some k j, then there exists e i ’ s.t
VCs for Causal Delivery Each process increments the local component of its VC only for events that are notified to the monitor Each message notifying event e is timestamped with VC(e) The monitor keeps all notification messages in a set M
Stability Suppose p 0 has received m j from p j. When is it safe for p 0 to deliver m j ?
Stability Suppose p 0 has received m j from p j When is it safe for p 0 to deliver m j ? There is no earlier message in M
Stability Suppose p 0 has received m j from p j When is it safe for p 0 to deliver m j ? There is no earlier message in M There is no earlier message from p j no. of p j messages delivered by p 0
Stability Suppose p 0 has received m j from p j When is it safe for p 0 to deliver m j ? There is no earlier message in M There is no earlier message from p j There is no earlier message m k ’’ from p k (k j) … ? no. of p j messages delivered by p 0
Checking for. Let m k ’ be the last message p 0 delivered from p k By strong gap detection, m k ’’ exists only if Hence, deliver m j as soon as
The protocol p 0 maintains an array D[1,..., n] of counters D[i] = TS(m i )[i] where m i is the last message delivered from p i DR3: Deliver m from p j as soon as both of the following conditions are satisfied: 1. 2.
Multiple Monitors Create a group of monitor processes increased performance increased reliability Notify through a causal multicast to the group Each replica will construct a (possibly different) observation if property stable, if one monitor detects, eventually all monitors do otherwise either use Possibly and Definitely or use causal atomic multicast What about failures?