Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Remote Procedure Call (RPC)
1 CSIS 7102 Spring 2004 Lecture 8: Recovery (overview) Dr. King-Ip Lin.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Last Class: Weak Consistency
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Distributed Deadlocks and Transaction Recovery.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Practical Byzantine Fault Tolerance
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CprE 458/558: Real-Time Systems
The concept of RAID in Databases By Junaid Ali Siddiqui.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 8: Fault Tolerance and Replication Dr. Michael R. Lyu Computer Science.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fault Tolerance (2). Topics r Reliable Group Communication.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Week#3 Software Quality Engineering.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
Distributed Systems – Paxos
8.2. Process resilience Shreyas Karandikar.
Operating System Reliability
Operating System Reliability
Fault Tolerance In Operating System
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Outline Announcements Fault Tolerance.
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
Middleware for Fault Tolerant Applications
Distributed Systems CS
Active replication for fault tolerance
Operating System Reliability
Distributed Systems CS
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

Fault Tolerance in Distributed Systems

A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for the duration of a mission –requirement might be stated as availability for a 10 hr mission -> the probability of failure during the mission must be at most –Very high reliability is most important in critical applications - space shuttle, industrial control, in which failure could mean loss of life!

A system’s ability to tolerate failure-2 Availability: expresses the fraction of time a system is operational –a availability means the system is not operational at most one hour in a million hours –a system with high availability may in fact fail -> its recovery time and failure frequency must be small enough to achieve the desired availability –high availability is important in airline reservations, telephone switching etc, in which every minute of downtime translates into revenue loss…

Importance of Design A good fault- tolerant system design requires a careful study of failures, causes of failures, and system responses to failures Such a study should be carried out in detail before the design begin and must remain part of the design process

Requirement Specification-1 Planning to avoid failures is most important A designer must analyze the environment and determine the failures that must be tolerated to achieve the desired level of reliability To optimize fault tolerance, it is important to estimate actual failure rate for each possible failure

Requirement Specification-2 Failure types: –Some are more probable than others –Some are transient, others permanent –Some occur in hardware, others in software

Design-1 Design of systems that tolerate faults that occur while system is in use Basic Principle - Redundancy –Spatial - redundant hardware –Informational - redundant data structures –Temporal - redundant computation

Design-2 Redundancy costs money and time –One must optimize the design by trading off amount of redundancy used against the desired level of fault tolerance –Temporal redundancy usually requires re- computation and it results in a slower recovery from failure –Spatial has faster recovery but increases hardware costs, space, power etc. requirements

Design-3 Commonly Used Techniques for Redundancy –Modular redundancy Uses multiple, identical replicas of hardware modules and a voter mechanism The outputs from the replicas are compared, and correct output is determined - majority vote Can tolerate most hardware faults that can affect the minority of the hardware modules

Design-4 N- Version Programming –Write multiple versions of a software module –Outputs from these versions are received and correct output is determined via voting mechanism –Each version is written by different team, with the hope that they will not contain the same bugs –Can tolerate software bugs that affect a minority of versions –Cannot tolerate correlated fault - reason for failure is common to two (or more) modules eg two modules share a single power supply, failure of which causes both to fail

Error- Control Coding-1 Replication is expensive For certain applications - RAM, Buses, error correcting codes can be used –Hamming or other codes Checkpoints and rollbacks A checkpoint is a copy of an application’s state saved in some storage that is immune to the failures under consideration A rollback restarts the execution from a previously saved checkpoint

Error- Control Coding-2 When a failure occurs, the application’s state is rolled back to the previous checkpoint and restarted from there. Can be used to recover from transient as well as permanent hardware failures Can be used for uniprocessor and distributed applications

Recovery Blocks Uses multiple alternates to perform the same function One module is primary others are secondary When primary completes execution, its outcome is checked by an acceptance test If the output is not acceptable, a secondary module executes and so on until either an acceptable output is obtained or alternates are exhausted This method can tolerate software failures, because alternates are usually implemented with different approaches (software algorithms)

Dependability Evaluation-1 Once a system has been designed, it must be evaluated to determine if it meets reliability and dependability objectives Two dependability approaches: Use an Analytical Model  can help developers to determine a system’s possible states and probabilities of transitions among them  can be difficult to analyze models accurately  Injecting Faults oVarious types of faults can be injected to determine various dependability metrics

Dependability Evaluation-2  In distributed systems a transaction based Service can accept occasional failures followed by a lengthy recovery procedure  A Realtime Service - Process Control omay have inputs that are readings taken from sensors omay have outputs to actuators that are used to control a process directly or to activate alarms so that humans can intervene in the process odue to strict timing requirements, recovery must be achieved within a very small time limit e.g. air traffic control, monitoring patients, controlling reactors

Dependability Evaluation-3  A Fault- Tolerant Service  For a service to perform correctly, both the effect on a server’s resources and the response sent to the client must be correct  Correct behavior must be specified  Failure Semantics - the ways in which the service can fail, must be specified  Can detect a fault, thus,  fails predictably  masks fault from its users  operates in the presence of faults in services on which it depends

Fault models  Omission Failure  A server omits to respond to a request or receive request  Response Failure  Value failure - returns wrong value  State transition failure - has wrong effect on resources  Timing Failure- any response that is not available to a client within a specified real time interval  Server Crash Failure: a server repeatedly fails to respond to requests until it is restarted  Amnesia- crash - a server starts in its initial state, having forgotten its state at the time of the crash, ie loses the values of the data items  Pause- crash - a server restarts in the state before the crash  Halting- crash - server never restarts

Fault Example  UDP service  has omission failures because it occasionally looses messages  does not have value failures because it does not transmit corrupt messages.  UDP uses checksums to mask the value failures of the underlying IP by converting them to omission failures

Handling Failures

Process Resilience Protection against process failures can be achieved through process replication into groups. Flat Group: All the members are equal Hierarchical Group: There is a coordinator

Flat Groups versus Hierarchical Groups Figure 8-3. (a) Communication in a flat group. (b) Communication in a simple hierarchical group.

Failure masking In most simple fault case, k+1 processes provide k fault tolerance. If the faulty processes continue to run, providing faulty response but do not team up to give wrong response, then to have k tolerant system we need 2k+1 processes. Assume all the messages arrive at all nodes at the same time.

Agreement issue in Faulty Systems Possible assuptions about the underlying system: 1.Synchronous versus asynchronous systems. 2.Communication delay is bounded or not. 3.Message delivery is ordered or not. 4.Message transmission is done through unicasting or multicasting.

Circumstances under which distributed agreement can be reached

Byzantine Agreement Problem: Lamport et al Assume reliable synchronous ordered unicast based message system. There are N process, k of which may act as faulty or even malicious. A faulty process may send different values to different processes.

Byzantine failure In fault-tolerant distributed computing, a Byzantine failure is an arbitrary fault that occurs during the execution of an algorithm in a distributed system. When a Byzantine failure has occurred, the system may respond in any unpredictable way. These arbitrary failures may be loosely categorized as follows: –a failure to take another step in the algorithm, also known as a crash failure; –a failure to correctly execute a step of the algorithm; and –arbitrary execution of a step other than the one indicated by the algorithm.

Byzantine failure Byzantine refers to the Byzantine Generals' Problem, an agreement problem in which generals of the Byzantine Empire's army must decide unanimously whether or not to attack some enemy army. The problem is complicated by the geographic separation of the generals, who must communicate by sending messengers to each other, and by the presence of traitors amongst the generals. These traitors can act arbitrarily in order to force good generals into a wrong decision: trick some generals into attacking; force a decision that is not consistent with the generals' desires, –e.g. forcing an attack when no general wished to attack; or so confusing some generals that they never make up their minds. If the traitors succeed in any of these goals, any resulting attack is doomed, as only a concerted effort can result in victory. –Lamport et al., proved that with number of bad generals 1/3 or less, there is a solution.

Example: Byzantine Agreement problem for four processes Three non-faulty and one faulty process. (a) Each process sends their value to the others. Process 3 lies, giving different values x, y,z to different processes.

Byzantine Agreement problem-2 (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives in step 3. Process 3 sends different vectors to different processes.

Byzantine Agreement problem-3 Three processes can agree on the values received from 1, 2, and 4. So that malicious process 3 value is irrelevant… If N=3 and k=1, that is only two non-faulty process this will not work! Lamport proved that, with 2k+1 nonfaulty processes, the system will survive k faulty processes, which makes a total of 3k+1 process…

BAP with two correct on faulty processes.. Figure 8-6. The same as Fig. 8-5, except now with two correct process and one faulty process.

Client Server Communication TCP-Transport Control Protocol as a connection oriented end-to-end communication protocol is a reliable protocol. But it does not prevent connection crash failures, which require searching for new connections…

RPC Semantics in the Presence of Failures Five different classes of failures that can occur in RPC systems: 1.The client is unable to locate the server. 2.The request message from the client to the server is lost. 3.The server crashes after receiving a request. This can be handled with principles such as:- –At least once –At most once –The preferred principle is exactly once, which is virtually impossible to implement 4.The reply message from the server to the client is lost. 5.The client crashes after sending a request. Each case needs to be resolved properly to mask the failures

Failure Examples  UDP service  has omission failures because it occasionally looses messages  does not have value failures because it does not transmit corrupt messages.  UDP uses checksums to mask the value failures of the underlying IP by converting them to omission failures

Reliable Multicasting Use negative acknowledgement, known as scalable reliable multicasting-SRM Non-hierarchical and hierarchical solutions are possible Atomic multicasting requires all the replicas reaching agreement on the success or failure of multicast. This is known as distributed commit: two-phase or three-phase commit protocols can be used.

Recovery from a failure When and how the state of a distributed system be recorded and recovered to by beans of check-pointing and logging. To be able to recover to a stable state, it is important that the state is safely stored..

Stable Storage  Sable storage is an example of group masking at the disk block level  Designed to ensure permanent data is recoverable after a system failure during a disk write operation or after a disk block has been damaged

Stable Storage  Provided by a Careful Storage Service  Unit of storage is the stable block  Each stable block is represented by two careful blocks that hold the contents of the stable block in duplicate  Write operation writes one careful block ensuring it is correct before writing the second block  Careful blocks are disk blocks stored with a checksum to mask value failures, the blocks are located on different disk drives with independent failure modes  Value failures are converted to omission failures  The Read operation reads one of the pair of careful blocks, if an omission failure occurs then it reads the other, thus masking the omission failures of the Careful Storage Service.

Stable Storage  Stable Storage - Crash Recovery –When a server is restarted after a crash, the pair of careful blocks (representing the stable block) will be in one of the following states: both good and the same both good and different one good, one bad  What does the recovery procedure do in each of the above cases ?