Distributed Snapshots & Termination detection Presented by Subashini Balachandran
What is a snapshot A snapshot of a distributed system is a global state where the local states of all processes and of all communication channels are recorded simultaneously Such a causally consistent state in a distributed system without a common clock is extremely complicated to achieve
Where is a snapshot used? Detection of deadlock of a distributed system Compute monotonic functions of the global state such as lower bounds on the simulation time. Check pointing and recovery of distributed data bases Monitoring and debugging of distributed systems.
Consistent and Inconsistent cuts A cut is consistent if no message arrow starts in future and ends in past. (e.g. ) AB Otherwise it is inconsistent ( e.g.) CD C A 1 2 3 4 D B
Consistent cut Algorithm Consistent cut for non-FIFO systems by piggybacking a one bit status onto basic messages every process is initially white and turns red while taking a local snapshot every message sent by a white(red) process is colored white(red) every process takes a local snapshot at its convenience-but before a red message is possibly received
Example -1 The Snapshot is taken till the white color ends for all the process P1 P2 P3 P4
Example - 2 The Snapshot is taken before looking at or processing the red message P1 P2 P3 P4
Consistent cut Algorithm(cont..) cut defined by the white events is consistent No red message sent after the cut is received by a white process before the cut a white process must be able to take a local snapshot at the moment it receives a red basic message
Catching the messages in transit Messages in transit are precisely the white messages which are received by red process so whenever a red process gets a white message ,it can send a copy of it to the snapshot initiator
The Snapshot principle After the snapshot initiator received the last copy of all in-transit messages and the local snapshots of all process, it knows the snapshot is complete. P1 End P2 P3 P4 Local snapshot Copy of messages in transit
Termination Detection A process is considered active if it is white and passive otherwise Only white messages are considered Then white computation has terminated if no process is white no white messages are in transit Problems ? cannot determine when it has received the last white message.
Deficiency Counting TD Each process had a counter being part of process state counter count = (#of basic messages that process has sent) - ( #of basic messages it has received from any other process) together with local snapshot and counters, can determine the total number of messages in transit Thus the end of the snapshot is determined.
Vector Counter Principle TD every process Pi counts the number of white messages it has sent to Pj(i=j) on the j-th component of a local vector Vi of length n (n= number of process ) when a white message is received , its own component is decrement Vi[i] = Vi[i] -1 control vector C circulates the ring ,accumulates the local vector and resets them to zero C = C+Vi ; Vi = 0
Vector Counter Principle TD at the end of first round C[i] indicates the number of white messages that are in transit to Pi for the cut no more new white messages are generated second round is necessary if C[i] >0 waits at each Pi until all the (white) in-transit messages have been received Vi[i]+C[i] <= 0 all the in-transit messages are collected guarantee termination after 2 control rounds
Example P1 P2 P3 P4 1 Accumulated control vector C 2 -1 1 1 1
Conclusion Presented a new algorithm for computing snapshot Basic idea is to use 2 colors indicating the process states to identify the past and the future Termination detection using vector counter method
Thank You :-)