PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
Chapter 8 Fault Tolerance
Byzantine Generals. Outline r Byzantine generals problem.
Teaser - Introduction to Distributed Computing
Distributed Systems Overview Ali Ghodsi
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
The Byzantine Generals Problem Boon Thau Loo CS294-4.
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
Byzantine Generals Problem: Solution using signed messages.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Last Class: Weak Consistency
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
Composition Model and its code. bound:=bound+1.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Lab 2 Group Communication Farnaz Moradi Based on slides by Andreas Larsson 2012.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
Farnaz Moradi Based on slides by Andreas Larsson 2013.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Fault Tolerance Chapter 7.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
SysRép / 2.5A. SchiperEté The consensus problem.
Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Impossibility of Distributed Consensus with One Faulty Process By, Michael J.Fischer Nancy A. Lynch Michael S.Paterson.
Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Fault Tolerance (2). Topics r Reliable Group Communication.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerance Chap 7.
Intrusion Tolerant Architectures
8.2. Process resilience Shreyas Karandikar.
COMP28112 – Lecture 14 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 13-Oct-18 COMP28112.
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Outline Distributed Mutual Exclusion Distributed Deadlock Detection
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 19-Nov-18 COMP28112.
Outline Announcements Fault Tolerance.
Distributed Systems, Consensus and Replicated State Machines
Distributed Systems CS
Active replication for fault tolerance
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
EEC 688/788 Secure and Dependable Computing
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 22-Feb-19 COMP28112.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Presentation transcript:

PROCESS RESILIENCE By Ravalika Pola

outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure Detection

Process Resilience Problem: –How fault tolerance in distributed system is achieved, especially against Process Failures? Solution: –Replicating processes into groups. –Consider collections of process as a single abstraction –All members of the group receive the same message, if one process fails, the others can take over for it. –Process groups are dynamic and a Process can be member of several groups. –Hence we need some mechanisms to manage the groups.

Flat Group vs. Hierarchical Group Design Issues

Group Membership Process can enter and leave the groups, groups can be created and destroyed. Hence we need to keep Track of them. Group Server –Straight forward, simple and easy to implement –Major disadvantage  Single point of failure Distributed Approach –Broadcast message to join and leave the group –In case of fault, how to identify between a really dead and a dead slow member –Joining and Leaving must be synchronized  on joining send all previous messages to the new member –Another issue is how to create a new group?

Failure Masking & Replication Replicate Process and organize them into groups Replace a single vulnerable process with the whole fault tolerant Group A system is said to be K fault tolerant if it can survive faults in K components and still meet its specifications. How much replication is needed to support K Fault Tolerance? –K+1 or 2K+1 ? Case: 1)If K processes stop, then the answer from the other one can be used.  K+1 2)If meet Byzantine failure, the number is  2K+1

Agreement in Faulty Systems Why we need Agreements? Goal of Agreement –Make all the non-faulty processes reach consensus on some issue –Establish that consensus within a finite number of steps. A process group typically requires reaching an agreement in: –Electing a coordinator –Deciding whether or not to commit a transaction –Dividing tasks among workers –Synchronization

When the communication and processes: –are perfect, reaching an agreement is often straightforward –are not perfect, there are problems in reaching an agreement Problems of two cases –Good process, but unreliable communication Example: Two-army problem –Good communication, but crashed process Example: Byzantine generals problem

Two-army problem This problem is classically stated as the two-army problem, and is insoluble. The agreed upon action will never take place, because the last sender will never be certain that the last confirmation went through.(Due to unreliable communication)

Byzantine generals problem The Byzantine generals problem for 3 loyal generals and1 traitor. a)The generals announce their troop strengths (in units of 1 thousand soldiers). b)The vectors that each general assembles based on (a) c)The vectors that each general receives in step 3.

Step 4: Each process examines the ith element of each of the newly received vectors If any value has a majority, that value is put into the result vector If no value has a majority, the corresponding element of the result vector is marked UNKNOWN Cont.. Result Vector: (1, 2, UNKNOWN, 4) THE ALGORITHM REACHES AN AGREEMENT

Cont.. The same as in previous slide, except now with 2 loyal generals and one traitor.

Step 4: Each process examines the ith element of each of the newly received vectors If any value has a majority, that value is put into the result vector If no value has a majority, the corresponding element of the result vector is marked UNKNOWN Cont.. Result Vector: (UNKOWN, UNKNOWN, UNKNOWN)

Concluding Remarks on the Byzantine Agreement Problem  In their paper, Lamport et al. (1982) proved that in a system with k faulty processes, an agreement can be achieved only if 2k+1 correctly functioning processes are present, for a total of 3k+1.  i.e., An agreement is possible only if more than two-thirds of the processes are working properly.  Fisher et al. (1985) proved that in a distributed system in which ordering of messages cannot be guaranteed to be delivered within a known, finite time, no agreement is possible even if only one process is faulty.

Process Failure Detection Before we properly mask failures, we generally need to detect them For a group of processes, non-faulty members should be able to decide who is still a member and who is not Two policies:  Processes actively send “are you alive?” messages to each other (i.e., pinging each other)  Processes passively wait until messages come in from different processes

Failure Considerations There are various issues that need to be taken into account when designing a failure detection subsystem:  Failure detection can be done as a side-effect of regularly exchanging information with neighbors (e.g., gossip based information dissemination)  A failure detection subsystem should ideally be able to distinguish network failures from node failures  When a member failure is detected, how should other non-faulty processes be informed

THANK YOU