DC7: More Coordination Chapter 11 and 14.2 Consensus Group Communication
Topics Agreement and Consensus Group Communication Virtual Synchrony No fault, fail-stop, byzantine Group Communication Order and delivery guarantees Virtual Synchrony
Consensus section 11.5 Distributed agreement or "distributed consensus" is the fundamental problem in DS. Distributed mutual exclusion and election are basically getting processes to agree on something. Agreeing on time or the update of replicated data are special cases of the distributed consensus problem. Agreement sometimes means one process proposes a value and the others agree on it while consensus means all processes propose values and all agree on some function of those values.
Consensus (Agreement) There are M processes, P1, P2, … Pm in a DS that are trying to reach agreement. A subset F of the processes are faulty. Each process Pi is initially undecided and proposes a value Vi. During agreement, the processes each calculate a value Ai. At the end of the algorithm: All non-faulty processes reach a decision. For every pair of non-faulty processes Pi and Pj, Ai = Aj. This is the agreement value. The agreement value is a function of the initial values {Vi} of the non-faulty processes. The function is often max (as in the case of election) or average or one of the Vi. If all non-faulty processes have the same Vi, then that must be the agreement value. Communications are reliable and synchronous.
Consensus: Easy Case: No Failures No failures, synchronous, M processes If there can be no failures, reaching consensus is easy. Every process sends his value to every other process. All processes now have identical info. All processes do the same calculation and come up with the same value. Processes need to maintain an array of M values. P1 has {1,2,3,4} P2 has {1,2,3,4} P3 has {1,2,3,4} P4 has {1,2,3,4} 1 2 4 3
Consensus: Fail-stop Fairly Easy case: fail-stop, synchronous If faulty processes are fail-stop, reaching consensus is reasonably easy, all non-faulty processes send their values to all others. However, F of them may fail at sometime during the process... P1 has {1,2,3,4} P2 has {1,2,3,4} P3 has {x,2,3,4} P4 has {x,2,3,4} 1 1 2 4 3
Consensus: Fail-stop Solution is after all processes send their values to all others, then all processes now broadcast all the values they received (and who from). This continues for f+1 rounds where f = |F|. Processes maintain a tree of values. P3 and P4 have 1st round: {x,2,3,4} 2nd round: from P2 {1,2,3,4} from P3 {x,2,3,4} {1,2,3,4} {x,2,3,4} 2 4 {x,2,3,4} 3
Consensus: Fail-stop If M=4 and F=1 then we need f+1=2 rounds to get consensus (previous example). Do we really need f+1 rounds? Consider M=4, F=2 P1 crashes during 1st round after sending to P2. P2 crashes during 2nd round after sending to P3 3 4 1 2 P4:{x,2,3,4} P2:{1,2,3,4} P3:{x,2,3,4}
Consensus: Fail stop What do P3 and P4 see? Round 1 {1,2,3,4} {X,2,3,4} {X,2,3,4} Round 2 send to P3 {1,2,3,4} {X,2,3,4} and die Round 3 {1,2,3,4} {1,2,3,4} If processes are fail-stop, we can tolerate any number of faulty processes, however we need f+1 rounds 2 3 4
Difficult Case: Agreement with Byzantine Failures Similar problems: agreement (single proposer) and consensus (all propose values). The faulty process may respond like a non-faulty process so the non-faulty processes do not know who is faulty. Faulty process can send a fake value to throw off the calculation and can send one value to some and a different value to others. Faulty process is an adversary and can see the global state: has more information than non-faulty nodes. But, can only affect the faulty processes.
Variations on Byzantine Agreement Process always knows who sent the received message. Default value - some algorithms assume a default value (retreat) when there is no agreement. Oral messages - message content is controlled by latest sender (relayer) so receiver doesn’t know whether or not it was tampered with. Signed messages - messages can be authenticated with digital signatures. Assume faulty processes can send arbitrary messages but they cannot forge signatures.
BA with Oral Messages(1) Commanding general coordinates other generals. If all loyal generals attack victory is certain. If none attack, the Empire survives. If some attack, Empire is lost. Gong keeps time. Attack!
BA with Oral Messages(2) How it works. Disloyal generals have corrupt soldiers. Orders are distributed by exchange of messages, corrupt soldiers violate protocol at will. But corrupt soldiers can’t intercept and modify messages between loyal generals. The gong sounds slowly: there is ample time for exchange of messages. Commanding general sends his order. Then all other generals relay to all what they received.
BA with Oral Messages(3) Limitations Let t be the maximum number of faulty processes (disloyal generals). Byzantine agreement is not possible with fewer than 3t+1 processes Same result holds for consensus in the Byzantine model Requires t+1 rounds of messages
Byzantine Consensus Oral Messages(1) The Byzantine generals problem for 3 loyal generals and1 traitor. The generals announce their troop strengths (in units of 1 kilosoldiers) to all other generals. The vectors that each general assembles based on (a) Additional vectors that each general receives in next round (all send what they received to all). Decide other’s values by majority of the 3. If no majority, use default value.
ByzantineConsensus Oral Messages(2) The same as in previous slide, except now with 2 loyal generals and one traitor. Majority decision does not guarantee consensus.
BA with Signed Messages (1) Faulty process can send arbitrary message, but cannot forge signatures. All messages are digitally signed for authentication. Assume at most f faulty nodes. At the start, each node broadcasts his value I a signed message. Each process at round I endorses (authenticate) and forwards all messages received in round I-1 signatures help locate the faulty process
BA with Signed Messages (2) At round f+1, either: 1 value per coordinate endorsed by at least f+1 nodes, decide majority else, decide the default value f+1 rounds proven to be necessary and sufficient. Must have f+2 processes. (f+1) Ex: In round 1 node p sent me value x. In round 2 node p sent a vector with his component = y. I conclude node p is faulty.
Summary Consensus Required Required Number Rounds fail-stop N=f+1 R=f+1 byzantine (oral) N=3f+1 R=f+1 byzantine(signed) N=f+2 R=f+1
Consensus in Asynchronous Systems All of the preceding agreement and consensus algorithms are for synchronous systems, that is the algorithm works by sending messages in rounds or phases. What about Byzantine Consensus in an asynchronous system? Provably impossible if any node is faulty [FLP1985], but pratical algorithms do exist using failure dectors
Reliable Group Communication We would like a message sent to a group of processes to be delivered to every member of that group. Problems: Processes join and leave group. Processes crash (that's a leave). Sender crashes (after sending to some or doing part of the send operation). What about: Efficiency? Message delivery order? Timeliness?
Group Communication Multicast communication requires coordination and agreement. Members of a group receive copies of messages sent to the group. Many different delivery guarantees are possible e.g. agree on the set of messages received or on delivery ordering A process can multicast by the use of a single operation instead of a send to each member For example in IP multicast aSocket.send(aMessage) The single operation allows for: efficiency . send once on each link, using hardware multicast when available. delivery guarantees e.g. can’t make a guarantee if multicast is implemented as multiple sends and the sender fails. Also ordering.
Reliable Group Communication Revised definition: A message sent to a group of processes should be delivered to all non-faulty members of that group. How to implement reliability: message sequencing and ACKs. sender 2 1 N 3 y 4 5 w x
Reliable Group Communication For efficiency, many algorithms form a tree structure to handle message multiplication. Should interior nodes store the message? If not, all ack’s must be sent to originator. sender 1 2 3 x x x y y y
RGC: Handling Ack/Nacks Problem, ack implosion: does not scale well. Solution attempt: Don't ack, rather NACK missing messages. However, a receiver may not Nack because it doesn't know it missed a message because it isn't getting anything. Thus Sender has to buffer outgoing messages forever. Also, a message dropped high in the multicast tree creates a Nack implosion sender 1 2 3 x x x y y y
RGC: Handling Nacks If processes see all messages from others, can use Scalable Reliable Multicast (SRM) [Floyd 1997] No acks in SRM, only missing messages are NACKed. When a client detects a missed message, it waits for a random delay, then multicasts his NACK to everyone in the group. This feedback allows other group members who missed the same message to supress their NACK Assumption: the re-transmission of the NACKed message will be a multicast. This is called Feedback Suppression. Problems: still lots of NACK traffic.
Nonhierarchical Feedback Control Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.
Hierarchical Feedback Control Hierarchies or trees frequently formed for multicast, why not use for feedback control? Better scalability. Works if there is a single sender or local group of senders and the group membership is fairly stable. A rooted tree is formed with the sender at the root. Each other node is a group of receivers. Each group of receivers has a coordinator who buffers the message and collects NACKs or ACKs from his group and sends one on up the tree to the sender. Hard to handle group membership changes.
Hierarchical Feedback Control The essence of hierarchical reliable multicasting. Each local coordinator forwards the message to its children. A local coordinator handles retransmission requests.
Multicast Terminology Message is received by the OS and comm layer but it is not delivered to the application until it has been verifiably received by all other processes in the group.
The Meaning of Delivered
Message Ordering Unordered - P1 is delivered the messages in arbitrary order which might be different from the order in which P2 gets them. FIFO - all messages from a single source will be delivered in the order in which they were sent. Causally ordered - recall Lamport definition of causality. Potential causality must be preserved. Causally related messages from multiple sources are delivered in causal order. Total order - all processes deliver the messages in the same order. Frequently causal also. "All messages multicast to a group are delivered to all members of the group in the same order"
Unordered Messages Process P1 Process P2 Process P3 sends m1 receives m1 sends m2 receives m2 Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.
FIFO Ordering Process P1 Process P2 Process P3 Process P4 sends m1 delivers m1 delivers m3 sends m3 sends m2 sends m4 delivers m2 delivers m4 Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting
Atomic Multicast Totally ordered group communication. Atomic = message is delivered to all or none. View (also group view) is group membership at any given time. That is, the set of processes belonging to the group. The concept of a view is needed to handle membership changes.
Total Order Process P1 Process P2 Process P3 Process P4 sends m1 delivers m1 sends m3 sends m2 delivers m4 sends m4 delivers m2 delivers m3 If P1 and P4 are in the multicast group, they also deliver the messages in this order. So, P4 may send m3 at t=1 but not deliver it until t=2
Virtual (View) Synchrony How to define atomic multicast in the presence of failures? How can we guarantee delivery to all group members? 50 members in group, I multicast a message, m1, then P10 fails before getting message, but others got the message and I assume P10 got the message. Control the membership changes with view change. Virtual Synchrony - says something about the order of the message delivery with respect to a view change message, since messages must be ordered with respect to the view change message.
Properties of Virtual Synchrony Each process in the view has the same view. That is, they all agree on the group membership. When a process joins or leaves (including crash), this is announced to all (non-crashed) processes in the (old) group with a view change message VC. If one process, P1, in view v delivers message m, then all processes belonging to view v deliver message m in view v. (Recall difference between receive and deliver)
The principle of virtual synchronous multicast. Virtual Synchrony The principle of virtual synchronous multicast.
Figure 14.3 p q r p crashes view (q, r) view (p, q, r) a (allowed). b (allowed). c (disallowed). d (disallowed).
Basic Message Ordering Total-ordered Delivery? Multicast Summary Multicast Basic Message Ordering Total-ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Six different kinds of reliable multicasting.
Atomic Multicast: Amoeba, etc One node is the coordinator. Everyone sends his messages to the coordinator and the coordinator chooses the order and sends the message to everyone, or Everyone sends his messages to the coordinator and all nodes and the coordinator chooses the order and sends message number to everyone (msg 5 from p4: global order 33) 1 1 4 4 2 2 3 3
Atomic Multicast: Totem Developed at UCSB Processes are organized into a logical ring. A token is passed around the ring. The token has the message number of the next message to be multicast. Only the token holder can multicast a message. This easily establishes total order. Retransmits for missed messages are the responsibility of the original sender.
End