Reliable Distributed Systems Fault Tolerance (Recoverability  High Availability)

Slides:



Advertisements
Similar presentations
Reliable Distributed Systems Quorums. Quorum replication We developed a whole architecture based on our four-step recipe But there is a second major approach.
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 9: Sept. 21.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
1 ICS 214B: Transaction Processing and Distributed Data Management Replication Techniques.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Reliable Distributed Systems Membership. Agreement on Membership Recall our approach: Detecting failure is a lost cause. Too many things can mimic failure.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Distributed Transactions Chapter 13
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
University of Tampere, CS Department Distributed Commit.
Reliable Distributed Systems Logical Clocks. Time and Ordering We tend to casually use temporal concepts Example: “membership changes dynamically” Implies.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Two-Phase Commit Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 8: Sept. 19.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Consensus, impossibility results and Paxos Ken Birman.
Primary-Backup Replication
Distributed Systems – Paxos
Operating System Reliability
View Change Protocols and Reconfiguration
CS514: Intermediate Course in Operating Systems
Outline Announcements Fault Tolerance.
Reliable Distributed Systems
Active replication for fault tolerance
Fault-tolerance techniques RSM, Paxos
EEC 688/788 Secure and Dependable Computing
Distributed Transactions
Lecture 21: Replication Control
EEC 688/788 Secure and Dependable Computing
Distributed Databases Recovery
View Change Protocols and Reconfiguration
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Last Class: Fault Tolerance
Presentation transcript:

Reliable Distributed Systems Fault Tolerance (Recoverability  High Availability)

Reliability and transactions Transactions are well matched to database model and recoverability goals Transactions don’t work well for non- database applications (general purpose O/S applications) or availability goals (systems that must keep running if applications fail) When building high availability systems, encounter replication issue

Types of reliability Recoverability Server can restart without intervention in a sensible state Transactions do give us this High availability System remains operational during failure Challenge is to replicate critical data needed for continued operation

Replicating a transactional server Two broad approaches Just use distributed transactions to update multiple copies of each replicated data item We already know how to do this, with 2PC Each server has “equal status” Somehow treat replication as a special situation Leads to a primary server approach with a “warm standby”

Replication with 2PC Our goal will be “1-copy serializability” Defined to mean that the multi-copy system behaves indistinguishably from a single-copy system Considerable form and theoretical work has been done on this As a practical matter Replicate each data item Transaction manager Reads any single copy Updates all copies

Observation Notice that transaction manager must know where the copies reside In fact there are two models Static replication set: basically, the set is fixed, although some members may be down Dynamic: the set changes while the system runs, but only has operational members listed within it Today stick to the static case

Replication and Availability A series of potential issues How can we update an object during periods when one of its replicas may be inaccessible? How can 2PC protocol be made fault- tolerant? A topic we’ll study in more depth But the bottom line is: we can’t!

Usual responses? Quorum methods: Each replicated object has an update and a read quorum Designed so Q u +Q r > # replicas and Q u+ Q u > # replicas Idea is that any read or update will overlap with the last update

Quorum example X is replicated at {a,b,c,d,e} Possible values? Q u = 1, Q r = 5 (violates Q U +Q u > 5) Q u = 2, Q r = 4 (same issue) Q u = 3, Q r = 3 Q u = 4, Q r = 2 Q u = 5, Q r = 1 (violates availability) Probably prefer Q u =4, Q r =2

Things to notice Even reading a data item requires that multiple copies be accessed! This could be much slower than normal local access performance Also, notice that we won’t know if we succeeded in reaching the update quorum until we get responses Implies that any quorum replication scheme needs a 2PC protocol to commit

Next issue? Now we know that we can solve the availability problem for reads and updates if we have enough copies What about for 2PC? Need to tolerate crashes before or during runs of the protocol A well-known problem

Availability of 2PC It is easy to see that 2PC is not able to guarantee availability Suppose that manager talks to 3 processes And suppose 1 process and manager fail The other 2 are “stuck” and can’t terminate the protocol

What can be done? We’ll revisit this issue soon Basically, Can extend to a 3PC protocol that will tolerate failures if we have a reliable way to detect them But network problems can be indistinguishable from failures Hence there is no commit protocol that can tolerate failures Anyhow, cost of 3PC is very high

A quandry? We set out to replicate data for increased availability And concluded that Quorum scheme works for updates But commit is required And represents a vulnerability Other options?

Other options We mentioned primary-backup schemes These are a second way to solve the problem Based on the log at the data manager

Server replication Suppose the primary sends the log to the backup server It replays the log and applies committed transactions to its replicated state If primary crashes, the backup soon catches up and can take over

Primary/backup primary backup Clients initially connected to primary, which keeps backup up to date. Backup tracks log log

Primary/backup primary backup Primary crashes. Backup sees the channel break, applies committed updates. But it may have missed the last few updates!

Primary/backup primary backup Clients detect the failure and reconnect to backup. But some clients may have “gone away”. Backup state could be slightly stale. New transactions might suffer from this

Issues? Under what conditions should backup take over Revisits the consistency problem seen earlier with clients and servers Could end up with a “split brain” Also notice that still needs 2PC to ensure that primary and backup stay in same states!

Split brain: reminder primary backup Clients initially connected to primary, which keeps backup up to date. Backup follows log log

Split brain: reminder Transient problem causes some links to break but not all. Backup thinks it is now primary, primary thinks backup is down primary backup

Split brain: reminder Some clients still connected to primary, but one has switched to backup and one is completely disconnected from both primary backup

Implication? A strict interpretation of ACID leads to conclusions that There are no ACID replication schemes that provide high availability Most real systems solve by weakening ACID

Real systems They use primary-backup with logging But they simply omit the 2PC Server might take over in the wrong state (may lag state of primary) Can use hardware to reduce or eliminate split brain problem

How does hardware help? Idea is that primary and backup share a disk Hardware is configured so only one can write the disk If server takes over it grabs the “token” Token loss causes primary to shut down (if it hasn’t actually crashed)

Reconciliation This is the problem of fixing the transactions impacted by lack of 2PC Usually just a handful of transactions They committed but backup doesn’t know because never saw commit record Later. server recovers and we discover the problem Need to apply the missing ones Also causes cascaded rollback Worst case may require human intervention

Summary Reliability can be understood in terms of Availability: system keeps running during a crash Recoverability: system can recover automatically Transactions are best for latter Some systems need both sorts of mechanisms, but there are “deep” tradeoffs involved

Replication and High Availability All is not lost! Suppose we move away from the transactional model Can we replicate data at lower cost and with high availability? Leads to “virtual synchrony” model Treats data as the “state” of a group of participating processes Replicated update: done with multicast

Steps to a solution First look more closely at 2PC, 3PC, failure detection 2PC and 3PC both “block” in real settings But we can replace failure detection by consensus on membership Then these protocols become non-blocking (although solving a slightly different problem) Generalized approach leads to ordered atomic multicast in dynamic process groups

Non-blocking Commit Goal: a protocol that allows all operational processes to terminate the protocol even if some subset crash Needed if we are to build high availability transactional systems (or systems that use quorum replication)

Definition of problem Given a set of processes, one of which wants to initiate an action Participants may vote for or against the action Originator will perform the action only if all vote in favor; if any votes against (or don’t vote), we will “abort” the protocol and not take the action Goal is all-or-nothing outcome

Non-triviality Want to avoid solutions that do nothing (trivial case of “all or none”) Would like to say that if all vote for commit, protocol will commit... but in distributed systems we can’t be sure votes will reach the coordinator! any “live” protocol risks making a mistake and counting a live process that voted to commit as a failed process, leading to an abort Hence, non-triviality condition is hard to capture

Typical protocol Coordinator asks all processes if they can take the action Processes decide if they can and send back “ok” or “abort” Coordinator collects all the answers (or times out) Coordinator computes outcome and sends it back

Commit protocol illustrated ok to commit?

Commit protocol illustrated ok to commit? ok with us

Commit protocol illustrated ok to commit? ok with us commit Note: garbage collection protocol not shown here

Failure issues So far, have implicitly assumed that processes fail by halting (and hence not voting) In real systems a process could fail in arbitrary ways, even maliciously This has lead to work on the “Byzantine generals” problem, which is a variation on commit set in a “synchronous” model with malicious failures

Failure model impacts costs! Byzantine model is very costly: 3t+1 processes needed to overcome t failures, protocol runs in t+1 rounds This cost is unacceptable for most real systems, hence protocols are rarely used Main area of application: hardware fault- tolerance, security systems For these reasons, we won’t study such protocols

Commit with simpler failure model Assume processes fail by halting Coordinator detects failures (unreliably) using timouts. It can make mistakes! Now the challenge is to terminate the protocol if the coordinator fails instead of, or in addition to, a participant!

Commit protocol illustrated ok to commit? ok with us … times out abort! Note: garbage collection protocol not shown here crashed!

Example of a hard scenario Coordinator starts the protocol One participant votes to abort, all others to commit Coordinator and one participant now fail... we now lack the information to correctly terminate the protocol!

Commit protocol illustrated ok to commit? ok decision unknown! vote unknown! ok

Example of a hard scenario Problem is that if coordinator told the failed participant to abort, all must abort If it voted for commit and was told to commit, all must commit Surviving participants can’t deduce the outcome without knowing how failed participant voted Thus protocol “blocks” until recovery occurs

Skeen: Three-phase commit Seeks to increase availability Makes an unrealistic assumption that failures are accurately detectable With this, can terminate the protocol even if a failure does occur

Skeen: Three-phase commit Coordinator starts protocol by sending request Participants vote to commit or to abort Coordinator collects votes, decides on outcome Coordinator can abort immediately To commit, coordinator first sends a “prepare to commit” message Participants acknowledge, commit occurs during a final round of “commit” messages

Three phase commit protocol illustrated ok to commit? ok.... commit prepare to commit prepared... Note: garbage collection protocol not shown here

Observations about 3PC If any process is in “prepare to commit” all voted for commit Protocol commits only when all surviving processes have acknowledged prepare to commit After coordinator fails, it is easy to run the protocol forward to commit state (or back to abort state)

Assumptions about failures If the coordinator suspects a failure, the failure is “real” and the faulty process, if it later recovers, will know it was faulty Failures are detectable with bounded delay On recovery, process must go through a reconnection protocol to rejoin the system! (Find out status of pending protocols that terminated while it was not operational)

Problems with 3PC With realistic failure detectors (that can make mistakes), protocol still blocks! Bad case arises during “network partitioning” when the network splits the participating processes into two or more sets of operational processes Can prove that this problem is not avoidable: there are no non-blocking commit protocols for asynchronous networks

Situation in practical systems? Most use protocols based on 2PC: 3PC is more costly and ultimately, still subject to blocking! Need to extend with a form of garbage collection mechanism to avoid accumulation of protocol state information (can solve in the background) Some systems simply accept the risk of blocking when a failure occurs Others reduce the consistency property to make progress at risk of inconsistency with failed proc.

Process groups To overcome cost of replication will introduce dynamic process group model (processes that join, leave while system is running) Will also relax our consistency goal: seek only consistency within a set of processes that all remain operational and members of the system In this model, 3PC is non-blocking! Yields an extremely cheap replication scheme!

Failure detection Basic question: how to detect a failure Wait until the process recovers. If it was dead, it tells you I died, but I feel much better now Could be a long wait Use some form of probe But might make mistakes Substitute agreement on membership Now, failure is a “soft” concept Rather than “up” or “down” we think about whether a process is behaving acceptably in the eyes of peer processes

Architecture Membership Agreement, “join/leave” and “P seems to be unresponsive” 3PC-like protocols use membership changes instead of failure notification Applications use replicated data for high availability

Issues? How to “detect” failures Can use timeout Or could use other system monitoring tools and interfaces Sometimes can exploit hardware Tracking membership Basically, need a new replicated service System membership “lists” are the data it manages We’ll say it takes join/leave requests as input and produces “views” as output

Architecture GMS A B C D join leave join A seems to have failed {A} {A,B,D} {A,D} {A,D,C} {D,C} XYZ Application processes GMS processes membership views

Issues Group membership service (GMS) has just a small number of members This core set will tracks membership for a large number of system processes Internally it runs a group membership protocol (GMP) Full system membership list is just replicated data managed by GMS members, updated using multicast

GMP design What protocol should we use to track the membership of GMS Must avoid split-brain problem Desire continuous availability We’ll see that a version of 3PC can be used But can’t “always” guarantee liveness

Reading ahead? Read chapters 12, 13 Thought problem: how important is external consistency (called dynamic uniformity in the text)? Homework: Read about FLP. Identify other “impossibility results” for distributed systems. What is the simplest case of an impossibility result that you can identify?