IS 651: Distributed Systems Fault Tolerance

Slides:

Advertisements

Similar presentations

Replication techniques Primary-backup, RSM, Paxos Jinyang Li.

Advertisements

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

CS 582 / CMPE 481 Distributed Systems

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Byzantine fault tolerance

Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.

Practical Byzantine Fault Tolerance

Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.

From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.

1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.

Byzantine fault tolerance

IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.

Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.

Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.

EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Replication Chapter Katherine Dawicki. Motivations Performance enhancement Increased availability Fault Tolerance.

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.

BChain: High-Throughput BFT Protocols

The consensus problem in distributed systems

Distributed Systems – Paxos

CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.

COS 418: Distributed Systems

View Change Protocols and Reconfiguration

EECS 498 Introduction to Distributed Systems Fall 2017

Replication and Consistency

Byzantine Fault Tolerance

Replication and Consistency

Outline Announcements Fault Tolerance.

Principles of Computer Security

Replication Improves reliability Improves availability

EEC 688/788 Secure and Dependable Computing

Active replication for fault tolerance

Fault-tolerance techniques RSM, Paxos

PERSPECTIVES ON THE CAP THEOREM

CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

From Viewstamped Replication to BFT

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Consensus, FLP, and Paxos

Lecture 21: Replication Control

EEC 688/788 Secure and Dependable Computing

EECS 498 Introduction to Distributed Systems Fall 2017

IS 651: Distributed Systems Final Exam

Replicated state machine and Paxos

View Change Protocols and Reconfiguration

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

THE GOOGLE FILE SYSTEM.

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

The SMART Way to Migrate Replicated Stateful Services

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Lecture 21: Replication Control

Implementing Consistency -- Paxos

Replication and Consistency

IS 698/800-01: Advanced Distributed Systems State Machine Replication

Sisi Duan Assistant Professor Information Systems

Sisi Duan Assistant Professor Information Systems

Presentation transcript:

IS 651: Distributed Systems Fault Tolerance Sisi Duan Assistant Professor Information Systems sduan@umbc.edu

What we have learnt so far Distributed communication Time synchronization Mutual exclusion Consistency Web De-coupling, separation of roles Distributed file systems De-coupling, replication, fault tolerance

Roadmap of the rest of the class Primary-backup replication Consensus Crash fault tolerance Byzantine fault tolerance Distributed databases Google Chubby lock service Blockchains Distributed computing systems

Announcement Project Progress Report Homework 4 Due next week (Oct 31) Due in two weeks (Nov 7)

Project Progress Report Please submit What you have achieved so far You should have made some progress and discuss them in details (e.g., review of a few papers) List all the related works you plan to include in the final report

Today Fault tolerance via replication Primary backup approach Viewstamp replication

Fault Tolerance via Replication To tolerate one failure, one must replicate data on >1 places Particularly important at scale Suppose a typical server crashes every month How often some server crashes in a 10,000-server cluster? 30*24*60/10000 = 4.3 minutes

Consistency: Correctness How to replicate data “correctly”? Replicas are indistinguishable from a single object Linearizability is ideal One-copy semantics: copies of the same data should (eventually) be the same Consistency So the replicated system should be ”behave” just like a un-replicated system

Challenges Concurrency Machine failures Network failures (network is unreliable) Tricky Slow or fail?

A Motivating Example Replication on two servers Multiple client requests (might be concurrent)

Failures under concurrency! The two servers see different results

What we need… To keep replicas in sync, writes must be done in the same order Discussion: Which consistency model works? The consistency models: linearizability, sequential consistency, causal consistency, eventual consistency

Replication Active Replication (In the following few weeks) State Machine Replication (SMR) Every replica maintains a state machine Sometimes called Replicated State Machine (RSM) Given an input, the output state is deterministic Passive Replication (Today) Primary-Backup Replication Google file system has 3-way replicated chunks. What type of replication?

Passive Replication Primary-backup replication One replica is the primary Decides to process a client request and assign an order Processes the request Sends the order to other replicas Other replicas are the backups They update their states after receiving from the primary

Primary-Backup Replication Server 1 as the primary All the clients know who is the current primary

Passive Replication Primary-Backup We consider benign failures for now Fail-Stop Model A replica follows its specification until it crashes (faulty) A faulty replica does not perform any action (does not recover) A crash is eventually detected by every correct processor No replica is suspected of having crashed until after it actually crashes

How to deal with failure? #1: Backup Failure Discussion: How can we guarantee the backup gets the order? The network is unreliable!

How to deal with failure? #1: Backup Failure Acknowledgment What if a backup timed out in acknowledging?

How to deal with failure? #1: Backup Failure What if a backup timed out in acknowledging? Primary re-tries (if backup failure is transient) What if the failure is “permanent”? Is it safe to ignore the failure? How to make a backup catch up?

How to deal with failure? #2: Primary Failure Discussion: What if the primary fails?

How to deal with failure? #2: Primary Failure Discussion: What if the primary fails? Switch to another one! In GFS: How? What are the issues if we want to do it in a distributed way?

How to deal with failure? #2: Primary Failure Discussion: What if the primary fails? Switch to another one! What are the issues? Could there be accidentally two “valid” primaries? If an op is done before the switch, how to ensure it’s not ”lost” after the switch? What to re-integrate for a recovered server?

Remember the example in our 1st class…

The History of Failure Handling For a long time, people do it manually (with no guaranteed correctness) One primary, one backup. Primary ignores temporary replication failure of a backup. If primary crashes, human operator re-configures the system to use the former backup as new primary some ops done by primary might be “lost” at new primary Still true in a lot of the systems A consistency checker is run at the end of every day and fix them (according to some rules).

Viewstamp Replication First ever work that handles failures correctly Original paper, 1988 Viewstamp revisited. 2012 We will discuss her other famous algorithms in a few weeks! Babara Liskov Turing Award 2008

Viewstamp Replication (VR) Overview Static configuration of servers, e.g., p0,p1,p2 To handle primary failure, VR moves through a sequence of “views” 0,1,2,3… In each view, the primary id is deterministic. (Given a view number, we know which one is the primary.) Simple solution primary id = view id % total servers 0 -> p0, 1->p1, 2->p2…

VR conditions An operation is “committed” if it is replicated by a threshold number of servers Once committed, and operation’s order is fixed Each request is assigned a sequence number Why it is correct? No two different operations are committed with the same order!

Correctness Safety Liveness All the correct replicas maintain the same state Liveness Client eventually gets a reply

How to determine the threshold? Quorum A number of replicas The primary waits for quorum = majority servers (including itself) before considering an operation is committed Why? Discussion: Can two primaries commit different operations with the same order? (Remember the majority voting from mutual exclusion?) Because no two correct quorums exist! What’s great about it? If a backup is slow (or temporarily crashed), the primary can still commit as usual

Key to correctness: Quorum p0: primary of v0, committed operation 0 with sequence number 1 on p0, p2, and p3 P1: primary of v1, committed operation 1 with sequence number 2 on p1, p3, and p4 Overlapped replica: p3 – can ensure that op0 and op1 are not replicated with the same order

Key to correctness: Quorum intersection If the nodes decide to move to a new view, all the committed operations in previous views must be known to the primary in view v How? View v is only active after v’s primary has learned the state of the majority of nodes in earlier views

VR Protocol (at sketch) Server: Vid, current view number Lvid, last normal view number Status (NORMAL, VIEW-CHANGE, or RECOVERING) Seq, sequence number Commit-number log

Normal Case Operations

Normal Case Operations What’s the latency of committing a command? From the primary’s perspective From the client’s perspective

Normal Case Operations How does a backup learn a command’s commit status? Primary piggybacks “commit-number” in its Prepare message

When the primary fails Backups monitor the correctness of the primary In the crash failure model, backups can use failure detector (won’t cover it in this class) Other methods are available… If the primary fails, other replicas can start view change to change the primary Msg type: VIEW-CHANGE

View Change

What to include for the new view before normal operations? General rule Everything that has been committed in previous views should be included Brief procedures Select the largest sequence number from the logs of other replicas If the majority of nodes have included a request m with sequence number s, include m with s in the new log Broadcast the newLog to all the replicas Replicas adopt the order directly

Choose NewLog Which order should we choose?

Choose NewLog

Other components of the protocol After recovery from failure, how can a server catch up with other replicas? Transfer primary’s log How? Checkpointing

Primary-Backup Replication (Proof) Safety (All the correct replicas maintain the same state) Consider two client requests op1 and op2, which return histories H1 and H2 If op1 and op2 are processed by the same replica, safety is trivial since a benign replica processes operations sequentially (H1 is a prefix of H2 in this case) If op1 and op2 are processed by different replicas p1 and p2. If op1 < op2, then H1 is a prefix of H2, because p1 only updates its state after it receives acknowledgement from backups when p1 is the primary. (Other replicas must have updated their states that include op1) When p2 processes op2, it must have already acknowledged op1. Therefore, H1 is a prefix of H2. If op2 < op1, the proof is the same.

Primary-Backup Replication (Proof) Liveness (Client eventually gets a reply) The client eventually contacts a correct replica (Since faulty replicas are eventually detected) The correct replica will process the request and respond to the client

Primary-Backup Replication Observation Only the primary actively participates in the agreement Every replica only needs to update its state (doesn’t necessarily have to execute the requests) If the request is compute intensive, backups won’t need computational resources Primary has to send the entire updated state to the backups (need to consume large network bandwidth if the state is large) If the primary fails, we need to immediately detect it and recovery it (may incur large latency) Service may be unavailable for a period of time Client can contact a backup instead to use the service

Discussion If the primary fails, we need to immediately detect it and recovery it (may incur large latency) Client can contact a backup instead to use the service What are the potential issues?

Reading List (Optional) Brian M. Oki and Barbara H. Liskov. Viewstamped Replication: A new primary copy method to support highly-available distributed systems. PODC 1988. James Cowling and Barbara H. Liskov. Viewstamped replication revisited. MIT, 2012.