IS 651: Distributed Systems Fault Tolerance

Slides:



Advertisements
Similar presentations
Replication techniques Primary-backup, RSM, Paxos Jinyang Li.
Advertisements

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Byzantine fault tolerance
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Byzantine fault tolerance
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Replication Chapter Katherine Dawicki. Motivations Performance enhancement Increased availability Fault Tolerance.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
BChain: High-Throughput BFT Protocols
The consensus problem in distributed systems
Distributed Systems – Paxos
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
COS 418: Distributed Systems
View Change Protocols and Reconfiguration
EECS 498 Introduction to Distributed Systems Fall 2017
Replication and Consistency
Byzantine Fault Tolerance
Replication and Consistency
Outline Announcements Fault Tolerance.
Principles of Computer Security
Replication Improves reliability Improves availability
EEC 688/788 Secure and Dependable Computing
Active replication for fault tolerance
Fault-tolerance techniques RSM, Paxos
PERSPECTIVES ON THE CAP THEOREM
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
From Viewstamped Replication to BFT
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Consensus, FLP, and Paxos
Lecture 21: Replication Control
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
IS 651: Distributed Systems Final Exam
Replicated state machine and Paxos
View Change Protocols and Reconfiguration
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
THE GOOGLE FILE SYSTEM.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Implementing Consistency -- Paxos
Replication and Consistency
IS 698/800-01: Advanced Distributed Systems State Machine Replication
Sisi Duan Assistant Professor Information Systems
Sisi Duan Assistant Professor Information Systems
Presentation transcript:

IS 651: Distributed Systems Fault Tolerance Sisi Duan Assistant Professor Information Systems sduan@umbc.edu

What we have learnt so far Distributed communication Time synchronization Mutual exclusion Consistency Web De-coupling, separation of roles Distributed file systems De-coupling, replication, fault tolerance

Roadmap of the rest of the class Primary-backup replication Consensus Crash fault tolerance Byzantine fault tolerance Distributed databases Google Chubby lock service Blockchains Distributed computing systems

Announcement Project Progress Report Homework 4 Due next week (Oct 31) Due in two weeks (Nov 7)

Project Progress Report Please submit What you have achieved so far You should have made some progress and discuss them in details (e.g., review of a few papers) List all the related works you plan to include in the final report

Today Fault tolerance via replication Primary backup approach Viewstamp replication

Fault Tolerance via Replication To tolerate one failure, one must replicate data on >1 places Particularly important at scale Suppose a typical server crashes every month How often some server crashes in a 10,000-server cluster? 30*24*60/10000 = 4.3 minutes

Consistency: Correctness How to replicate data “correctly”? Replicas are indistinguishable from a single object Linearizability is ideal One-copy semantics: copies of the same data should (eventually) be the same Consistency So the replicated system should be ”behave” just like a un-replicated system

Challenges Concurrency Machine failures Network failures (network is unreliable) Tricky Slow or fail?

A Motivating Example Replication on two servers Multiple client requests (might be concurrent)

Failures under concurrency! The two servers see different results

What we need… To keep replicas in sync, writes must be done in the same order Discussion: Which consistency model works? The consistency models: linearizability, sequential consistency, causal consistency, eventual consistency

Replication Active Replication (In the following few weeks) State Machine Replication (SMR) Every replica maintains a state machine Sometimes called Replicated State Machine (RSM) Given an input, the output state is deterministic Passive Replication (Today) Primary-Backup Replication Google file system has 3-way replicated chunks. What type of replication?

Passive Replication Primary-backup replication One replica is the primary Decides to process a client request and assign an order Processes the request Sends the order to other replicas Other replicas are the backups They update their states after receiving from the primary

Primary-Backup Replication Server 1 as the primary All the clients know who is the current primary

Passive Replication Primary-Backup We consider benign failures for now Fail-Stop Model A replica follows its specification until it crashes (faulty) A faulty replica does not perform any action (does not recover) A crash is eventually detected by every correct processor No replica is suspected of having crashed until after it actually crashes

How to deal with failure? #1: Backup Failure Discussion: How can we guarantee the backup gets the order? The network is unreliable!

How to deal with failure? #1: Backup Failure Acknowledgment What if a backup timed out in acknowledging?

How to deal with failure? #1: Backup Failure What if a backup timed out in acknowledging? Primary re-tries (if backup failure is transient) What if the failure is “permanent”? Is it safe to ignore the failure? How to make a backup catch up?

How to deal with failure? #2: Primary Failure Discussion: What if the primary fails?

How to deal with failure? #2: Primary Failure Discussion: What if the primary fails? Switch to another one! In GFS: How? What are the issues if we want to do it in a distributed way?

How to deal with failure? #2: Primary Failure Discussion: What if the primary fails? Switch to another one! What are the issues? Could there be accidentally two “valid” primaries? If an op is done before the switch, how to ensure it’s not ”lost” after the switch? What to re-integrate for a recovered server?

Remember the example in our 1st class…

The History of Failure Handling For a long time, people do it manually (with no guaranteed correctness) One primary, one backup. Primary ignores temporary replication failure of a backup. If primary crashes, human operator re-configures the system to use the former backup as new primary some ops done by primary might be “lost” at new primary Still true in a lot of the systems A consistency checker is run at the end of every day and fix them (according to some rules).

Viewstamp Replication First ever work that handles failures correctly Original paper, 1988 Viewstamp revisited. 2012 We will discuss her other famous algorithms in a few weeks! Babara Liskov Turing Award 2008

Viewstamp Replication (VR) Overview Static configuration of servers, e.g., p0,p1,p2 To handle primary failure, VR moves through a sequence of “views” 0,1,2,3… In each view, the primary id is deterministic. (Given a view number, we know which one is the primary.) Simple solution primary id = view id % total servers 0 -> p0, 1->p1, 2->p2…

VR conditions An operation is “committed” if it is replicated by a threshold number of servers Once committed, and operation’s order is fixed Each request is assigned a sequence number Why it is correct? No two different operations are committed with the same order!

Correctness Safety Liveness All the correct replicas maintain the same state Liveness Client eventually gets a reply

How to determine the threshold? Quorum A number of replicas The primary waits for quorum = majority servers (including itself) before considering an operation is committed Why? Discussion: Can two primaries commit different operations with the same order? (Remember the majority voting from mutual exclusion?) Because no two correct quorums exist! What’s great about it? If a backup is slow (or temporarily crashed), the primary can still commit as usual

Key to correctness: Quorum p0: primary of v0, committed operation 0 with sequence number 1 on p0, p2, and p3 P1: primary of v1, committed operation 1 with sequence number 2 on p1, p3, and p4 Overlapped replica: p3 – can ensure that op0 and op1 are not replicated with the same order

Key to correctness: Quorum intersection If the nodes decide to move to a new view, all the committed operations in previous views must be known to the primary in view v How? View v is only active after v’s primary has learned the state of the majority of nodes in earlier views

VR Protocol (at sketch) Server: Vid, current view number Lvid, last normal view number Status (NORMAL, VIEW-CHANGE, or RECOVERING) Seq, sequence number Commit-number log

Normal Case Operations

Normal Case Operations What’s the latency of committing a command? From the primary’s perspective From the client’s perspective

Normal Case Operations How does a backup learn a command’s commit status? Primary piggybacks “commit-number” in its Prepare message

When the primary fails Backups monitor the correctness of the primary In the crash failure model, backups can use failure detector (won’t cover it in this class) Other methods are available… If the primary fails, other replicas can start view change to change the primary Msg type: VIEW-CHANGE

View Change

What to include for the new view before normal operations? General rule Everything that has been committed in previous views should be included Brief procedures Select the largest sequence number from the logs of other replicas If the majority of nodes have included a request m with sequence number s, include m with s in the new log Broadcast the newLog to all the replicas Replicas adopt the order directly

Choose NewLog Which order should we choose?

Choose NewLog

Other components of the protocol After recovery from failure, how can a server catch up with other replicas? Transfer primary’s log How? Checkpointing

Primary-Backup Replication (Proof) Safety (All the correct replicas maintain the same state) Consider two client requests op1 and op2, which return histories H1 and H2 If op1 and op2 are processed by the same replica, safety is trivial since a benign replica processes operations sequentially (H1 is a prefix of H2 in this case) If op1 and op2 are processed by different replicas p1 and p2. If op1 < op2, then H1 is a prefix of H2, because p1 only updates its state after it receives acknowledgement from backups when p1 is the primary. (Other replicas must have updated their states that include op1) When p2 processes op2, it must have already acknowledged op1. Therefore, H1 is a prefix of H2. If op2 < op1, the proof is the same.

Primary-Backup Replication (Proof) Liveness (Client eventually gets a reply) The client eventually contacts a correct replica (Since faulty replicas are eventually detected) The correct replica will process the request and respond to the client

Primary-Backup Replication Observation Only the primary actively participates in the agreement Every replica only needs to update its state (doesn’t necessarily have to execute the requests) If the request is compute intensive, backups won’t need computational resources Primary has to send the entire updated state to the backups (need to consume large network bandwidth if the state is large) If the primary fails, we need to immediately detect it and recovery it (may incur large latency) Service may be unavailable for a period of time Client can contact a backup instead to use the service

Discussion If the primary fails, we need to immediately detect it and recovery it (may incur large latency) Client can contact a backup instead to use the service What are the potential issues?

Reading List (Optional) Brian M. Oki and Barbara H. Liskov. Viewstamped Replication: A new primary copy method to support highly-available distributed systems. PODC 1988. James Cowling and Barbara H. Liskov. Viewstamped replication revisited. MIT, 2012.