IS 651: Distributed Systems Fault Tolerance

IS 651: Distributed Systems Fault Tolerance
Sisi Duan Assistant Professor Information Systems

What we have learnt so far
Distributed communication Time synchronization Mutual exclusion Consistency Web De-coupling, separation of roles Distributed file systems De-coupling, replication, fault tolerance

Roadmap of the rest of the class
Primary-backup replication Consensus Crash fault tolerance Byzantine fault tolerance Distributed databases Google Chubby lock service Blockchains Distributed computing systems

Announcement Project Progress Report Homework 4 Due next week (Oct 31)
Due in two weeks (Nov 7)

Project Progress Report
Please submit What you have achieved so far You should have made some progress and discuss them in details (e.g., review of a few papers) List all the related works you plan to include in the final report

Today Fault tolerance via replication Primary backup approach
Viewstamp replication

Fault Tolerance via Replication
To tolerate one failure, one must replicate data on >1 places Particularly important at scale Suppose a typical server crashes every month How often some server crashes in a 10,000-server cluster? 30*24*60/10000 = 4.3 minutes

Consistency: Correctness
How to replicate data “correctly”? Replicas are indistinguishable from a single object Linearizability is ideal One-copy semantics: copies of the same data should (eventually) be the same Consistency So the replicated system should be ”behave” just like a un-replicated system

Challenges Concurrency Machine failures
Network failures (network is unreliable) Tricky Slow or fail?

A Motivating Example Replication on two servers
Multiple client requests (might be concurrent)

Failures under concurrency!
The two servers see different results

What we need… To keep replicas in sync, writes must be done in the same order Discussion: Which consistency model works? The consistency models: linearizability, sequential consistency, causal consistency, eventual consistency

Replication Active Replication (In the following few weeks)
State Machine Replication (SMR) Every replica maintains a state machine Sometimes called Replicated State Machine (RSM) Given an input, the output state is deterministic Passive Replication (Today) Primary-Backup Replication Google file system has 3-way replicated chunks. What type of replication?

Passive Replication Primary-backup replication
One replica is the primary Decides to process a client request and assign an order Processes the request Sends the order to other replicas Other replicas are the backups They update their states after receiving from the primary

Primary-Backup Replication
Server 1 as the primary All the clients know who is the current primary

Passive Replication Primary-Backup We consider benign failures for now
Fail-Stop Model A replica follows its specification until it crashes (faulty) A faulty replica does not perform any action (does not recover) A crash is eventually detected by every correct processor No replica is suspected of having crashed until after it actually crashes

How to deal with failure? #1: Backup Failure
Discussion: How can we guarantee the backup gets the order? The network is unreliable!

Acknowledgment What if a backup timed out in acknowledging?

What if a backup timed out in acknowledging? Primary re-tries (if backup failure is transient) What if the failure is “permanent”? Is it safe to ignore the failure? How to make a backup catch up?

How to deal with failure? #2: Primary Failure
Discussion: What if the primary fails?

Discussion: What if the primary fails? Switch to another one! In GFS: How? What are the issues if we want to do it in a distributed way?

Discussion: What if the primary fails? Switch to another one! What are the issues? Could there be accidentally two “valid” primaries? If an op is done before the switch, how to ensure it’s not ”lost” after the switch? What to re-integrate for a recovered server?

Remember the example in our 1st class…

The History of Failure Handling
For a long time, people do it manually (with no guaranteed correctness) One primary, one backup. Primary ignores temporary replication failure of a backup. If primary crashes, human operator re-configures the system to use the former backup as new primary some ops done by primary might be “lost” at new primary Still true in a lot of the systems A consistency checker is run at the end of every day and fix them (according to some rules).

Viewstamp Replication
First ever work that handles failures correctly Original paper, 1988 Viewstamp revisited. 2012 We will discuss her other famous algorithms in a few weeks! Babara Liskov Turing Award 2008

Viewstamp Replication (VR) Overview
Static configuration of servers, e.g., p0,p1,p2 To handle primary failure, VR moves through a sequence of “views” 0,1,2,3… In each view, the primary id is deterministic. (Given a view number, we know which one is the primary.) Simple solution primary id = view id % total servers 0 -> p0, 1->p1, 2->p2…

VR conditions An operation is “committed” if it is replicated by a threshold number of servers Once committed, and operation’s order is fixed Each request is assigned a sequence number Why it is correct? No two different operations are committed with the same order!

Correctness Safety Liveness
All the correct replicas maintain the same state Liveness Client eventually gets a reply

How to determine the threshold?
Quorum A number of replicas The primary waits for quorum = majority servers (including itself) before considering an operation is committed Why? Discussion: Can two primaries commit different operations with the same order? (Remember the majority voting from mutual exclusion?) Because no two correct quorums exist! What’s great about it? If a backup is slow (or temporarily crashed), the primary can still commit as usual

Key to correctness: Quorum
p0: primary of v0, committed operation 0 with sequence number 1 on p0, p2, and p3 P1: primary of v1, committed operation 1 with sequence number 2 on p1, p3, and p4 Overlapped replica: p3 – can ensure that op0 and op1 are not replicated with the same order

Key to correctness: Quorum intersection
If the nodes decide to move to a new view, all the committed operations in previous views must be known to the primary in view v How? View v is only active after v’s primary has learned the state of the majority of nodes in earlier views

VR Protocol (at sketch)
Server: Vid, current view number Lvid, last normal view number Status (NORMAL, VIEW-CHANGE, or RECOVERING) Seq, sequence number Commit-number log

Normal Case Operations

What’s the latency of committing a command? From the primary’s perspective From the client’s perspective

How does a backup learn a command’s commit status? Primary piggybacks “commit-number” in its Prepare message

When the primary fails Backups monitor the correctness of the primary
In the crash failure model, backups can use failure detector (won’t cover it in this class) Other methods are available… If the primary fails, other replicas can start view change to change the primary Msg type: VIEW-CHANGE

View Change

What to include for the new view before normal operations?
General rule Everything that has been committed in previous views should be included Brief procedures Select the largest sequence number from the logs of other replicas If the majority of nodes have included a request m with sequence number s, include m with s in the new log Broadcast the newLog to all the replicas Replicas adopt the order directly

Choose NewLog Which order should we choose?

Choose NewLog

Other components of the protocol
After recovery from failure, how can a server catch up with other replicas? Transfer primary’s log How? Checkpointing

Primary-Backup Replication (Proof)
Safety (All the correct replicas maintain the same state) Consider two client requests op1 and op2, which return histories H1 and H2 If op1 and op2 are processed by the same replica, safety is trivial since a benign replica processes operations sequentially (H1 is a prefix of H2 in this case) If op1 and op2 are processed by different replicas p1 and p2. If op1 < op2, then H1 is a prefix of H2, because p1 only updates its state after it receives acknowledgement from backups when p1 is the primary. (Other replicas must have updated their states that include op1) When p2 processes op2, it must have already acknowledged op1. Therefore, H1 is a prefix of H2. If op2 < op1, the proof is the same.

Primary-Backup Replication (Proof)
Liveness (Client eventually gets a reply) The client eventually contacts a correct replica (Since faulty replicas are eventually detected) The correct replica will process the request and respond to the client

Primary-Backup Replication
Observation Only the primary actively participates in the agreement Every replica only needs to update its state (doesn’t necessarily have to execute the requests) If the request is compute intensive, backups won’t need computational resources Primary has to send the entire updated state to the backups (need to consume large network bandwidth if the state is large) If the primary fails, we need to immediately detect it and recovery it (may incur large latency) Service may be unavailable for a period of time Client can contact a backup instead to use the service

Discussion If the primary fails, we need to immediately detect it and recovery it (may incur large latency) Client can contact a backup instead to use the service What are the potential issues?

Reading List (Optional)
Brian M. Oki and Barbara H. Liskov. Viewstamped Replication: A new primary copy method to support highly-available distributed systems. PODC 1988. James Cowling and Barbara H. Liskov. Viewstamped replication revisited. MIT, 2012.

IS 651: Distributed Systems Fault Tolerance

Similar presentations

Presentation on theme: "IS 651: Distributed Systems Fault Tolerance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IS 651: Distributed Systems Fault Tolerance

Similar presentations

Presentation on theme: "IS 651: Distributed Systems Fault Tolerance"— Presentation transcript:

Similar presentations

About project

Feedback