IS 651: Distributed Systems Fault Tolerance Sisi Duan Assistant Professor Information Systems sduan@umbc.edu
What we have learnt so far Distributed communication Time synchronization Mutual exclusion Consistency Web De-coupling, separation of roles Distributed file systems De-coupling, replication, fault tolerance
Roadmap of the rest of the class Primary-backup replication Consensus Crash fault tolerance Byzantine fault tolerance Distributed databases Google Chubby lock service Blockchains Distributed computing systems
Announcement Project Progress Report Homework 4 Due next week (Oct 31) Due in two weeks (Nov 7)
Project Progress Report Please submit What you have achieved so far You should have made some progress and discuss them in details (e.g., review of a few papers) List all the related works you plan to include in the final report
Today Fault tolerance via replication Primary backup approach Viewstamp replication
Fault Tolerance via Replication To tolerate one failure, one must replicate data on >1 places Particularly important at scale Suppose a typical server crashes every month How often some server crashes in a 10,000-server cluster? 30*24*60/10000 = 4.3 minutes
Consistency: Correctness How to replicate data “correctly”? Replicas are indistinguishable from a single object Linearizability is ideal One-copy semantics: copies of the same data should (eventually) be the same Consistency So the replicated system should be ”behave” just like a un-replicated system
Challenges Concurrency Machine failures Network failures (network is unreliable) Tricky Slow or fail?
A Motivating Example Replication on two servers Multiple client requests (might be concurrent)
Failures under concurrency! The two servers see different results
What we need… To keep replicas in sync, writes must be done in the same order Discussion: Which consistency model works? The consistency models: linearizability, sequential consistency, causal consistency, eventual consistency
Replication Active Replication (In the following few weeks) State Machine Replication (SMR) Every replica maintains a state machine Sometimes called Replicated State Machine (RSM) Given an input, the output state is deterministic Passive Replication (Today) Primary-Backup Replication Google file system has 3-way replicated chunks. What type of replication?
Passive Replication Primary-backup replication One replica is the primary Decides to process a client request and assign an order Processes the request Sends the order to other replicas Other replicas are the backups They update their states after receiving from the primary
Primary-Backup Replication Server 1 as the primary All the clients know who is the current primary
Passive Replication Primary-Backup We consider benign failures for now Fail-Stop Model A replica follows its specification until it crashes (faulty) A faulty replica does not perform any action (does not recover) A crash is eventually detected by every correct processor No replica is suspected of having crashed until after it actually crashes
How to deal with failure? #1: Backup Failure Discussion: How can we guarantee the backup gets the order? The network is unreliable!
How to deal with failure? #1: Backup Failure Acknowledgment What if a backup timed out in acknowledging?
How to deal with failure? #1: Backup Failure What if a backup timed out in acknowledging? Primary re-tries (if backup failure is transient) What if the failure is “permanent”? Is it safe to ignore the failure? How to make a backup catch up?
How to deal with failure? #2: Primary Failure Discussion: What if the primary fails?
How to deal with failure? #2: Primary Failure Discussion: What if the primary fails? Switch to another one! In GFS: How? What are the issues if we want to do it in a distributed way?
How to deal with failure? #2: Primary Failure Discussion: What if the primary fails? Switch to another one! What are the issues? Could there be accidentally two “valid” primaries? If an op is done before the switch, how to ensure it’s not ”lost” after the switch? What to re-integrate for a recovered server?
Remember the example in our 1st class…
The History of Failure Handling For a long time, people do it manually (with no guaranteed correctness) One primary, one backup. Primary ignores temporary replication failure of a backup. If primary crashes, human operator re-configures the system to use the former backup as new primary some ops done by primary might be “lost” at new primary Still true in a lot of the systems A consistency checker is run at the end of every day and fix them (according to some rules).
Viewstamp Replication First ever work that handles failures correctly Original paper, 1988 Viewstamp revisited. 2012 We will discuss her other famous algorithms in a few weeks! Babara Liskov Turing Award 2008
Viewstamp Replication (VR) Overview Static configuration of servers, e.g., p0,p1,p2 To handle primary failure, VR moves through a sequence of “views” 0,1,2,3… In each view, the primary id is deterministic. (Given a view number, we know which one is the primary.) Simple solution primary id = view id % total servers 0 -> p0, 1->p1, 2->p2…
VR conditions An operation is “committed” if it is replicated by a threshold number of servers Once committed, and operation’s order is fixed Each request is assigned a sequence number Why it is correct? No two different operations are committed with the same order!
Correctness Safety Liveness All the correct replicas maintain the same state Liveness Client eventually gets a reply
How to determine the threshold? Quorum A number of replicas The primary waits for quorum = majority servers (including itself) before considering an operation is committed Why? Discussion: Can two primaries commit different operations with the same order? (Remember the majority voting from mutual exclusion?) Because no two correct quorums exist! What’s great about it? If a backup is slow (or temporarily crashed), the primary can still commit as usual
Key to correctness: Quorum p0: primary of v0, committed operation 0 with sequence number 1 on p0, p2, and p3 P1: primary of v1, committed operation 1 with sequence number 2 on p1, p3, and p4 Overlapped replica: p3 – can ensure that op0 and op1 are not replicated with the same order
Key to correctness: Quorum intersection If the nodes decide to move to a new view, all the committed operations in previous views must be known to the primary in view v How? View v is only active after v’s primary has learned the state of the majority of nodes in earlier views
VR Protocol (at sketch) Server: Vid, current view number Lvid, last normal view number Status (NORMAL, VIEW-CHANGE, or RECOVERING) Seq, sequence number Commit-number log
Normal Case Operations
Normal Case Operations What’s the latency of committing a command? From the primary’s perspective From the client’s perspective
Normal Case Operations How does a backup learn a command’s commit status? Primary piggybacks “commit-number” in its Prepare message
When the primary fails Backups monitor the correctness of the primary In the crash failure model, backups can use failure detector (won’t cover it in this class) Other methods are available… If the primary fails, other replicas can start view change to change the primary Msg type: VIEW-CHANGE
View Change
What to include for the new view before normal operations? General rule Everything that has been committed in previous views should be included Brief procedures Select the largest sequence number from the logs of other replicas If the majority of nodes have included a request m with sequence number s, include m with s in the new log Broadcast the newLog to all the replicas Replicas adopt the order directly
Choose NewLog Which order should we choose?
Choose NewLog
Other components of the protocol After recovery from failure, how can a server catch up with other replicas? Transfer primary’s log How? Checkpointing
Primary-Backup Replication (Proof) Safety (All the correct replicas maintain the same state) Consider two client requests op1 and op2, which return histories H1 and H2 If op1 and op2 are processed by the same replica, safety is trivial since a benign replica processes operations sequentially (H1 is a prefix of H2 in this case) If op1 and op2 are processed by different replicas p1 and p2. If op1 < op2, then H1 is a prefix of H2, because p1 only updates its state after it receives acknowledgement from backups when p1 is the primary. (Other replicas must have updated their states that include op1) When p2 processes op2, it must have already acknowledged op1. Therefore, H1 is a prefix of H2. If op2 < op1, the proof is the same.
Primary-Backup Replication (Proof) Liveness (Client eventually gets a reply) The client eventually contacts a correct replica (Since faulty replicas are eventually detected) The correct replica will process the request and respond to the client
Primary-Backup Replication Observation Only the primary actively participates in the agreement Every replica only needs to update its state (doesn’t necessarily have to execute the requests) If the request is compute intensive, backups won’t need computational resources Primary has to send the entire updated state to the backups (need to consume large network bandwidth if the state is large) If the primary fails, we need to immediately detect it and recovery it (may incur large latency) Service may be unavailable for a period of time Client can contact a backup instead to use the service
Discussion If the primary fails, we need to immediately detect it and recovery it (may incur large latency) Client can contact a backup instead to use the service What are the potential issues?
Reading List (Optional) Brian M. Oki and Barbara H. Liskov. Viewstamped Replication: A new primary copy method to support highly-available distributed systems. PODC 1988. James Cowling and Barbara H. Liskov. Viewstamped replication revisited. MIT, 2012.