COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655.

Slides:

Advertisements

Similar presentations

Chapter 8 Fault Tolerance

Advertisements

1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.

L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.

Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.

Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.

Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Last Class: Weak Consistency

1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.

1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.

1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.

Composition Model and its code. bound:=bound+1.

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.

1 Distributed Systems Fault Tolerance Chapter 8. 2 Course/Slides Credits Note: all course presentations are based on those developed by Andrew S. Tanenbaum.

Chapter 19 Recovery and Fault Tolerance Copyright © 2008.

Distributed Transactions Chapter 13

Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.

Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.

1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.

Group Communication Group oriented activities are steadily increasing. There are many types of groups:  Open and Closed groups  Peer-to-peer and hierarchical.

ICS362 – Distributed Systems

Hwajung Lee. A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types of groups:

More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.

Fault Tolerance Chapter 7.

Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.

Kyung Hee University 1/33 Fault Tolerance Chap 7.

Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,

V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.

Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.

- Manvitha Potluri. Client-Server Communication It can be performed in two ways 1. Client-server communication using TCP 2. Client-server communication.

Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.

Building Dependable Distributed Systems, Copyright Wenbing Zhao

Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable.

Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.

Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.

PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.

Fault Tolerance CSCI 4780/6780. RPC Semantics in Presence of Failures 5 types of exceptions Client cannot locate server Request to server is lost Server.

Fault Tolerance (2). Topics r Reliable Group Communication.

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.

Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.

More on Fault Tolerance

Fault Tolerance Prof. Orhan Gemikonakli

Fault Tolerance Chap 7.

Fault Tolerance In Operating System

Chapter 8 Fault Tolerance Part I Introduction.

Reliable group communication

Outline Announcements Fault Tolerance.

Fault Tolerance Distributed Web-based Systems

Distributed Systems CS

Distributed Systems CS

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Distributed Systems CS

Reliable Client-Server Communication

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Distributed Systems - Comp 655

Abstractions for Fault Tolerance

Last Class: Fault Tolerance

Presentation transcript:

COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655

11/13/2015Distributed Systems - COMP 6552 Fault Tolerance Fault tolerance concepts Implementation – distributed agreement Distributed agreement meets transaction processing: 2- and 3-phase commit Bonus material Implementation – reliable point-to-point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing

11/13/2015Distributed Systems - COMP 6553 Fault tolerance concepts Availability – can I use it now? –Usually quantified as a percentage Reliability – can I use it for a certain period of time? –Usually quantified as MTBF Safety – will anything really bad happen if it does fail? Maintainability – how hard is it to fix when it fails? –Usually quantified as MTTR

11/13/2015Distributed Systems - COMP 6554 Comparing nines 1 year = 8760 hr Availability levels –90% = 876 hr downtime/yr –99% = 87.6 hr downtime/yr –99.9% = 8.76 hr downtime/yr –99.99% = min downtime/yr –99.999% = min downtime/yr

11/13/2015Distributed Systems - COMP 6555 Exercise: how to get five nines 1.Brainstorm what you would have to deal with to build a single-machine system that could run for five years with 25 min downtime. Consider: –Hardware failures, especially disks –Power failures –Network outages –Software installation –What else? 2.Come up with some ideas about how to solve the problems you identify

11/13/2015Distributed Systems - COMP 6556 Multiple machines at 99% Assuming independent failures

11/13/2015Distributed Systems - COMP 6557 Multiple machines at 95% Assuming independent failures

11/13/2015Distributed Systems - COMP 6558 Multiple machines at 80% Assuming independent failures

11/13/2015Distributed Systems - COMP ,000 components

11/13/2015Distributed Systems - COMP Things to watch out for in availability requirements What constitutes an outage … –A client PC going down? –A client applet going into an infinite loop? –A server crashing? –A network outage? –Reports unavailable? –If a transaction times out? –If 100 transactions time out in a 10 min period? –etc

11/13/2015Distributed Systems - COMP More to watch out for What constitutes being back up after an outage? When does an outage start? When does it end? Are there outages that don’t count? –Natural disasters? –Outages due to operator errors? What about MTBF?

11/13/2015Distributed Systems - COMP Ways to get 99% availability 1.MTBF = 99 hr, MTTR = 1 hr 2.MTBF = 99 min, MTTR = 1 min 3.MTBF = 99 sec, MTTR = 1 sec

11/13/2015Distributed Systems - COMP More definitions failure error fault causes may cause Fault tolerance is continuing to work correctly in the presence of faults. Types of faults: transient intermittent permanent

11/13/2015Distributed Systems - COMP Types of failures

11/13/2015Distributed Systems - COMP If you remember one thing Components fail in distributed systems on a regular basis. Distributed systems have to be designed to deal with the failure of individual components so that the system as a whole –Is available and/or –Is reliable and/or –Is safe and/or –Is maintainable depending on the problem it is trying to solve and the resources available …

11/13/2015Distributed Systems - COMP Fault Tolerance Fault tolerance concepts Implementation – distributed agreement Distributed agreement meets transaction processing: 2- and 3-phase commit

11/13/2015Distributed Systems - COMP Two-army problem Red army has 5,000 troops Blue army and White army have 3,000 troops each Attack together and win Attack separately and lose in serial Communication is by messenger, who might be captured Blue and white generals have no way to know when a messenger is captured

11/13/2015Distributed Systems - COMP Activity: outsmart the generals Take your best shot at designing a protocol that can solve the two-army problem Spend ten minutes Did you think of anything promising?

11/13/2015Distributed Systems - COMP Conclusion: go home “agreement between even two processes is not possible in the face of unreliable communication”

11/13/2015Distributed Systems - COMP Byzantine generals Assume perfect communication Assume n generals, m of whom should not be trusted The problem is to reach agreement on troop strength among the non-faulty generals

11/13/2015Distributed Systems - COMP Byzantine generals - example n = 4, m = 1 (units are K-troops) (a)Multicast troop-strength messages (b)Construct troop-strength vectors (c)Compare notes: majority rules in each component Result: 1, 2, and 4 agree on (1,2,unknown,4)

11/13/2015Distributed Systems - COMP Doesn’t work with n =3, m =1

11/13/2015Distributed Systems - COMP Fault Tolerance Fault tolerance concepts Implementation – distributed agreement Distributed agreement meets transaction processing: 2- and 3-phase commit

11/13/2015Distributed Systems - COMP Distributed commit protocols What is the problem they are trying to solve? –Ensure that a group of processes all do something, or none of them do –Example: in a distributed transaction that involves updates to data on three different servers, ensure that all three commit or none of them do

11/13/2015Distributed Systems - COMP phase commit CoordinatorParticipant What to do when P, in READY state, contacts Q

11/13/2015Distributed Systems - COMP If coordinator crashes Participants could wait until the coordinator recovers Or, they could try to figure out what to do among themselves –Example, if P contacts Q, and Q is in the COMMIT state, P should COMMIT as well

11/13/2015Distributed Systems - COMP phase commit What to do when P, in READY state, contacts Q If all surviving participants are in READY state, 1.Wait for coordinator to recover 2.Elect a new coordinator (?)

11/13/2015Distributed Systems - COMP phase commit Problem addressed: –Non-blocking distributed commit in the presence of failures –Interesting theoretically, but rarely used in practice

11/13/2015Distributed Systems - COMP phase commit CoordinatorParticipant

11/13/2015Distributed Systems - COMP Bonus material Implementation – reliable point-to- point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing

11/13/2015Distributed Systems - COMP RPC, RMI crash & omission failures Client can’t locate server Request lost Server crashes after receipt of request Response lost Client crashes after sending request

11/13/2015Distributed Systems - COMP Can’t locate server Raise an exception, or Send a signal, or Log an error and return an error code Note: hard to mask distribution in this case

11/13/2015Distributed Systems - COMP Request lost Timeout and retry Back off to “cannot locate server” if too many timeouts occur

11/13/2015Distributed Systems - COMP Server crashes after receipt of request Possible semantic commitments –Exactly once –At least once –At most once NormalWork doneWork not done

11/13/2015Distributed Systems - COMP Behavioral possibilities Server events –Process (P) –Send completion message (M) –Crash (C) Server order –P then M –M then P Client strategies –Retry every message –Retry no messages –Retry if unacknowledged –Retry if acknowledged

11/13/2015Distributed Systems - COMP Combining the options

11/13/2015Distributed Systems - COMP Lost replies Make server operations idempotent whenever possible Structure requests so that server can distinguish retries from the original

11/13/2015Distributed Systems - COMP Client crashes The server-side activity is called an orphan computation Orphans can tie up resources, hold locks, etc Four strategies (at least) –Extermination, based on client-side logs Client writes a log record before and after each call When client restarts after a crash, it checks the log and kills outstanding orphan computations Problems include: –Lots of disk activity –Grand-orphans

11/13/2015Distributed Systems - COMP Client crashes, continued More approaches for handling orphans –Re-incarnation, based on client-defined epochs When client restarts after a crash, it broadcasts a start-of-epoch message On receipt of a start-of-epoch message, each server kills any computation for that client –“Gentle” re-incarnation Similar, but server tries to verify that a computation is really an orphan before killing it

11/13/2015Distributed Systems - COMP Yet more client-crash strategies One more strategy –Expiration Each computation has a lease on life If not complete when the lease expires, a computation must obtain another lease from its owner Clients wait one lease period before restarting after a crash (so any orphans will be gone) Problem: what’s a reasonable lease period?

11/13/2015Distributed Systems - COMP Common problems with client-crash strategies Crashes that involve network partition (communication between partitions will not work at all) Killed orphans may leave persistent traces behind, for example –Locks –Requests in message queues

11/13/2015Distributed Systems - COMP Bonus material Implementation – reliable point-to- point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing

11/13/2015Distributed Systems - COMP How to do it? Redundancy applied –In the appropriate places –In the appropriate ways Types of redundancy –Data (e.g. error correcting codes, replicated data) –Time (e.g. retry) –Physical (e.g. replicated hardware, backup systems)

11/13/2015Distributed Systems - COMP Triple Modular Redundancy

11/13/2015Distributed Systems - COMP Tandem Computers TMR on –CPUs –Memory Duplicated –Buses –Disks –Power supplies A big hit in operations systems for a while

11/13/2015Distributed Systems - COMP Replicated processing Based on process groups A process group consists of one or more identical processes Key events –Message sent to one member of a group –Process joins group –Process leaves group –Process crashes Key requirements –Messages must be received by all members –All members must agree on group membership

11/13/2015Distributed Systems - COMP Flat or non-flat?

11/13/2015Distributed Systems - COMP Effective process groups require Distributed agreement –On group membership –On coordinator elections –On whether or not to commit a transaction Effective communication –Reliable enough –Scalable enough –Often, multicast –Typically looking for atomic multicast

11/13/2015Distributed Systems - COMP Process groups also require Ability to tolerate crash failures and omission failures –Need k+1 processes to deal with up to k silent failures Ability to tolerate performance, response, and arbitrary failures –Need 3k+1 processes to reach agreement with up to k Byzantine failures –Need 2k+1 processes to ensure that a majority of the system produces the correct results with up to k Byzantine failures

11/13/2015Distributed Systems - COMP Bonus material Implementation – reliable point-to- point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing

11/13/2015Distributed Systems - COMP Reliable multicasting

11/13/2015Distributed Systems - COMP Scalability problem Too many acknowledgements –One from each receiver –Can be a huge number in some systems –Also known as “feedback implosion”

11/13/2015Distributed Systems - COMP Basic feedback suppression in scalable reliable multicast If a receiver decides it has missed a message, it waits a random time, then multicasts a retransmission request while waiting, if it sees a sufficient request from another receiver, it does not send its own request server multicasts all retransmissions

11/13/2015Distributed Systems - COMP Hierarchical feedback suppression for scalable reliable multicast messages flow from root toward leaves acks and retransmit requests flow toward root from coordinators each group can use any reliable small- group multicast scheme

11/13/2015Distributed Systems - COMP Atomic multicast Often, in a distributed system, reliable multicast is a step toward atomic multicast Atomic multicast is atomicity applied to communications: –Either all members of a process group receive a message, OR –No members receive it Often requires some form of order agreement as well

11/13/2015Distributed Systems - COMP How atomic multicast helps 1.Assume we have atomic multicast, among a group of processes, each of which owns a replica of a database 2.One replica goes down 3.Database activity continues 4.The process comes back up 5.Atomic multicast allows us to figure out exactly which transactions have to be re-played (see pp )

11/13/2015Distributed Systems - COMP More concepts Group view View change Virtually synchronous –Each message is received by all non-faulty processes, or –If sender crashes during multicast, message could be ignored by all processes

11/13/2015Distributed Systems - COMP Virtual synchrony picture Basic idea: in virtual synchrony, a multicast cannot cross a view-change

11/13/2015Distributed Systems - COMP Receipt vs Delivery Remember totally-ordered multicast …

11/13/2015Distributed Systems - COMP What about multicast message order? Two aspects: –Relationship between sending order and delivery order –Agreement on delivery order Send/delivery ordering relationships –Unordered –FIFO-ordered –Causally-ordered If receivers agree on delivery order, it’s called totally-ordered multicast

11/13/2015Distributed Systems - COMP Unordered Process P1Process P2Process P3 sends m1 sends m2 delivers m1 delivers m2 delivers m1

11/13/2015Distributed Systems - COMP FIFO-ordered Agreement on: m1 before m2 m3 before m4 Process P1Process P2Process P3 sends m1 sends m2 delivers m1 delivers m3 delivers m2 delivers m4 delivers m3 delivers m1 delivers m2 delivers m4 Process P4 sends m3 sends m4

11/13/2015Distributed Systems - COMP Six types of virtually synchronous reliable multicast Relationship between sending order and delivery order Agreement on delivery order

11/13/2015Distributed Systems - COMP Implementing virtual synchrony Don’t deliver a message until it’s been received everywhere - but “everywhere” can change (a)7’s crash is detected by 4, which sends a view-change message (b)Processes forward unstable messages, followed by flush (c)When have flush from all processes in new view, install new view

11/13/2015Distributed Systems - COMP Bonus material Implementation – reliable point-to- point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing

11/13/2015Distributed Systems - COMP Recovery from error Two main types: –Backward recovery to a checkpoint (assumed to be error-free) –Forward recovery (infer a correct state from available data)

11/13/2015Distributed Systems - COMP More about checkpoints They are expensive Usually combined with a message log Message logs are cleared at checkpoints Recovering a crashed process: –Restart it –Restore its state to the most recent checkpoint –Replay the message log

11/13/2015Distributed Systems - COMP Recovery line == most recent distributed snapshot

11/13/2015Distributed Systems - COMP Domino effect

11/13/2015Distributed Systems - COMP Bonus material Implementation – reliable point-to- point communication Implementation – process groups Implementation – reliable multicast Recovery Sparing

11/13/2015Distributed Systems - COMP Sparing Not really fault tolerance But it can be cheaper, and provide fast restoration time after a failure Types of spares –Cold –Hot –Warm The spare may or may not also have regular responsibilities in the system

11/13/2015Distributed Systems - COMP Switchover Repair is accomplished by switching processing away from a failed server to a spare

11/13/2015Distributed Systems - COMP Questions on switchover Has the failed system really failed? Is the spare operational? Can the spare handle the load? –May need a way to block medium to low priority work during switchovers How will the spare get access to the failed server’s data? What client session data will be preserved, and how?

11/13/2015Distributed Systems - COMP More switchover questions What about configuration files? What about network addressing? What about switching back after the failed server has been repaired? –Partial shutdown of the spare –Updating directories to redirect part of the load –Making up for lost medium-to-low priority work