Advanced Topics in Concurrency and Reactive Programming: Time and State Majeed Kassis
Time and State: Election Algorithms “Synchronization is … doing the right thing at the right time.” Synchronization in distributed systems is related to communication. Complicated by lack of global clock, shared memory. Logical clocks support global event order. Majeed Kassis
Election Algorithms Many algorithms in distributed systems require a coordinator Most of the time it does not matter which one acts as one. Election algorithms allow choosing a unique coordinator Which all other machines agree upon Done in efficient runtime Example: Berkeley clock synchronization algorithm In case of node failure, choose a new master node. Algorithms: Bully algorithm Ring algorithm Many algorithms used in distributed systems require a coordinator For example, see the centralized mutual exclusion algorithm. In general, all processes in the distributed system are equally suitable for the role Election algorithms are designed to choose a coordinator Any process can serve as coordinator Any process can “call an election” (initiate the algorithm to choose a new coordinator). There is no harm (other than extra message traffic) in having multiple concurrent elections. Elections may be needed when the system is initialized, or if the coordinator crashes or retires.
The Bully Algorithm Setup: Goal: Initialization: Each process has unique ID number Processes know the addresses of all other machines in the system Communication is assumed reliable Processes know the IDs of all other processes in the system Goal: Set coordinator as the machine with highest ID number Initialization: When process P finds out that the coordinator has failed, it initiates an election. Every process/site has a unique ID; e.g. the network address a process number Every process in the system should know the values in the set of ID numbers, although not which processors are up or down. The process with the highest ID number will be the new coordinator. Process groups (as with ISIS toolkit or MPI) satisfy these requirements. Process p calls an election when it notices that the coordinator is no longer responding. High-numbered processes “bully” low-numbered processes out of the election, until only one process remains. When a crashed process reboots, it holds an election. If it is now the highest-numbered live process, it will win.
The Bully Algorithm P broadcasts an election message to all processes with higher IDs Expecting an "I am alive" response from them If P receives no response, declares victory. Broadcasts victory message to all processes in the system. If P hears from a process with a higher ID: P waits a certain amount of time to receive the victory message. If no victory message received, it re-broadcasts the election message. If P gets an election message from another process with a lower ID P sends an "I am alive" message back and starts new elections. P will “bully” lower ID process denying them from being coordinator. Multiple processes may detect master node failure – which causes for more than one process to initiate the elections If several elections are made in parallel then the algorithm ensures one victor – a process with the highest ID.
The Bully Algorithm: Example 7 – crashed 4 – finds out, handles elections, nominates himself, waits for responses from higher ID nodes (b) 5,6 – receive 4’s message, return “OK” message. 4 stops, waits. (c) 5 – holds elections, sends to 6,7 6 – holds elections, sends to 7
The Bully Algorithm: Example (d) 5 – receives “OK” from 6, 5 halts and waits 6 – does not receive response from 7 (e) 6 - declares victory, sends message to all nodes.
Ring Algorithm Processes are arranged in a logical ring Processes know the structure of the ring A process initiates an election: if it just recovered from failure it notices that the coordinator has failed Initiator sends election message to closest downstream node that is alive Election message is forwarded around the ring Each process adds its own ID to the Election message When Election message comes back to original node: Initiator picks node with highest ID Sends a Coordinator message specifying the winner of the election Multiple elections can be in progress! Eventually, all messages will have the same values. The ring algorithm assumes that the processes are arranged in a logical ring and each process is knows the order of the ring of processes. Processes are able to “skip” faulty systems: instead of sending to process j, send to j + 1. Faulty systems are those that don’t respond in a fixed amount of time
Ring Algorithm: Example P thinks the coordinator has crashed; builds an ELECTION message which contains its own ID number. Sends to first live successor Each process adds its own number and When the message returns to p, it sees its own process ID in the list and knows that the circuit is complete. P circulates a COORDINATOR message with the new high number. Here, both 2 and 5 elect 6: [5,6,0,1,2,3,4] [2,3,4,5,6,0,1] forwards to next. OK to have two elections at once.
Bully vs Ring runtime Assume n processes and one election in progress Bully algorithm Worst case: initiator is node with lowest ID Triggers n-2 elections at higher ranked nodes: O(n2) messages Best case: immediate election: n-2 messages Ring 2 (n-1) messages always
Elections in Wireless Networks Issues: Unreliable, and processes may move Network topology constantly changing Algorithm: Any node starts by sending out an ELECTION message to neighbors When a node receives an ELECTION message for the first time, it forwards to neighbors, and designates the sender as its parent It then waits for responses from its neighbors Responses may carry resource information When a node receives an ELECTION message for the second time It just ignores it. Traditional algorithms aren’t appropriate. Can’t assume reliable message passing or stable network configuration Wireless algorithms try to find the best node to be coordinator; traditional algorithms are satisfied with any node. Any node (the source) can initiate an election by sending an ELECTION message to its neighbors – nodes within range. When a node receives its first ELECTION message the sender becomes its parent node.
Elections: Wireless Network Example (b) (a) Initial State Node a broadcasts election channels to nearby neighbors Node a is the source. Messages have a unique ID to manage possible concurrent elections
Elections: Wireless Network Example (d) (c) g receives message from b first – sends it to neighbors g receives message from j second – ignores it c receives message from b – sends it to neighbors (d) d receives message from c e receives message from g When a node R receives its first election message, it designates the source Q as its parent, and forwards the message to all neighbors except Q. When R receives an election message from a non-parent, it just acknowledges the message
Elections: Wireless Network Example (f) (e) f receives message from e first i receives message from h first (f) Now i, f, d return responses with their own values – these nodes are the edge nodes in the tree! Each receiving node check its own value with the value of receives message – then sends the highest value of the two Final step, node a receives values, and choses the highest number – denotes him as coordinator a sends broadcast message to the complete network with the new coordinate address. If R’s neighbors have parents, R is a leaf; otherwise it waits for its children to forward the message to their neighbors. When R has collected acks from all its neighbors, it acknowledges the message from Q. Acknowledgements flow back up the tree to the original source. At each stage the “most eligible” or “best” node will be passed along from child to parent. Once the source node has received all the replies, it is in a position to choose the new coordinator. When the selection is made, it is broadcast to all nodes in the network.
What about very large networks? More than one node is selected! These nodes are denoted “supernodes” Nodes organized as peers and super-peers Elections held within each peer group Super-peers coordinate among themselves Supernodes coordinate between each other They update their own ‘internal’ network
Advanced Topics in Concurrency and Reactive Programming: Time and State Majeed Kassis
Example of a global snapshot!
But that was easy • In our system of world leaders, we were able to capture their ‘state’ (i.e., likeness) easily. Synchronized in space Synchronized in time How would we take a global snapshot if the leaders were all at home? What if Obama told Trudeau that he should really put on a shirt? This message is part of our system state!
Global snapshot is global state Each distributed application has a number of processes (leaders) running on a number of physical servers. These processes communicate with each other via channels (text messaging). A snapshot captures the local states of each process (e.g., program variables) along with the state of each communication channel.
Why do we need Global State? • There are innumerable uses for this, for instance: finding the total number of files in a distributed file system, where files may be moved from one file server to another finding the total space occupied by files in such a distributed file system - in general, to detect global properties of the distributed system, such as garbage collection, deadlock, termination
Global State the states of the participating PROCESSES, together with the states of the CHANNELS through which data (i.e., the files) pass when being transferred between these processes
Example: Distributed Garbage Collection 2 1 message garbage object object reference Garbage Collector Frees up memory which is no longer in use Check’s if a reference to memory still exists What about in a distributed system? A distributed system consists of multiple processes Each process is located on a different computer No sharing of processor or memory! Each process can only determine its own “state” Garbage: An object is considered to be garbage if there are no longer any references to it anywhere in the distributed system.
How to record snapshots? Simple Solution: Create a new process that collects the states of every other process Every process will save their state at a specific time and send it to this process Problem? Based on the assumption that all processes work on a synchronized global clock! This does not work. T(p)=1PM Global State received state of each object Process p has no record of sending m – event m sent after 1PM - Current time: 1PM Process q HAS record of receiving m – event m received before 1PM! – current time 1PM Problem? Global state does not show p sending m, therefore there is confusion as to where m came from Breaks the Consistency concept p m q T(q)=1PM
Example – Global Clock Issue Send $100 B A $300 $400 Picture taken at A - $400 A sends $100 to B Picture taken at B - $400 Total is $800
Consistent Picture Let us consider the happened-before relation. If e1 ➝ e2 then e1 happened before e2 and could have caused it. A consistent picture of the global state is obtained if we include in our computation a set of possible events, H, such that: ei ∈ H ∧ ej ➝ ei => ej ∈ H If ei were in H, but ej were not, then the set of events would include the effect of an event (for instance, the receipt of a file), but not the event causing it (the sending of the file), and an inconsistent picture would arise.
Consistent Global State The consistent GLOBAL STATE is then defined by: GS(H) = The state of each process pi after pi’s last event in H + for each channel, the sequence of messages sent in H but not received in H. Consistent Cut: representing the last event that has been recorded for each process.
A possible computation
Example: Consistent Cut
Example: Consistent Cut
Example: Inconsistent Cut
How to Construct H? Idea: The CUT and associated (consistent) set of events, H, are constructed by including specific control messages (MARKERS) in the stream of ordinary messages. Remember that we assume that: A transmitted marker will be received (and dealt with) within a FINITE TIME.
Chandy-Lamport algorithm Problem: record a global snapshot (state for each process and channel). Model: N processes in the system with no failures There are two FIFO unidirectional channels between every process pair. All messages arrive, intact, not duplicated. The only events in the system which can give rise to changes in the state are communicating events. Future work relaxes these assumptions
System requirements Taking a snapshot shouldn’t interfere with normal application behavior. Don’t stop sending messages. Don’t stop the application! Each process can record its own state Collect state in a distributed manner Any process can initiate a snapshot
Initiating a snapshot Let’s say process Pi initiates the snapshot Pi records its own state and prepares a special marker message (distinct from application messages) Send the marker message to all other processes (using N-1 outbound channels) Start recording all incoming messages from channels Cji for j not equal to i
Propagating a snapshot For all processes Pj (including the initiator), consider a message on channel Ckj. If we see marker message for the first time Pj records own state and marks Ckj as empty Send the marker message to all other processes (using N-1 outbound channels) Start recording all incoming messages from channels Cij for i not equal to j or k Else add all messages from inbound channels since we began recording to their states
Terminating a snapshot All processes have received a marker (and recorded their own state) All processes have received a marker on all the N-1 incoming channels (and recorded their states) Later, a central server can gather the partial state to build a global snapshot
Example 1: The Algorithm In Action
Example 2: The Algorithm In Action.
How the Global Snapshot is Then Created? In a practical implementation, the recorded local snapshots must be put together to create a global snapshot of the distributed system. How? Several policies: each process sends its local snapshot to the initiator of the algorithm each process sends the information it records along all outgoing channels and each process receiving such information for the first time propagates it along its outgoing channels
How is that possible?!
The algorithm finds a global state based on a partial ordering ➝ of events. For instance, we know that e1 ➝ e3 and e2 ➝ e5 BUT we have no knowledge about the timing relationship of e3 and e5. With respect to ➝ , e3 and e5 are incomparable! We cannot determine what the true sequence of these events is!
So Why Recording Global State? Stable property: a property that persists, such as termination or deadlock. Idea: if a stable property holds in the system before the snapshot begins, it holds in the recorded global snapshot. A recorded global state is useful in DETECTING STABLE PROPERTIES