8.2. Process resilience Shreyas Karandikar.

Slides:

Advertisements

Similar presentations

Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.

Advertisements

Impossibility of Distributed Consensus with One Faulty Process

Chapter 8 Fault Tolerance

CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.

Byzantine Generals. Outline r Byzantine generals problem.

Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.

The Byzantine Generals Problem Boon Thau Loo CS294-4.

Byzantine Generals Problem: Solution using signed messages.

Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Last Class: Weak Consistency

Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.

Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.

CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.

CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.

V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.

Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.

UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department

Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.

Impossibility of Distributed Consensus with One Faulty Process By, Michael J.Fischer Nancy A. Lynch Michael S.Paterson.

1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.

Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.

PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.

Fault Tolerance (2). Topics r Reliable Group Communication.

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.

1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 486/586 Distributed Systems Byzantine Fault Tolerance

The Consensus Problem in Fault Tolerant Computing

reaching agreement in the presence of faults

Synchronizing Processes

COMP28112 – Lecture 15 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 26-Jan-18 COMP28112.

Fault Tolerance Chap 7.

CSE 486/586 Distributed Systems Leader Election

When Is Agreement Possible

DC7: More Coordination Chapter 11 and 14.2

COMP28112 – Lecture 14 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 13-Oct-18 COMP28112.

Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.

Byzantine Fault Tolerance

Outline Distributed Mutual Exclusion Distributed Deadlock Detection

CSE 486/586 Distributed Systems Byzantine Fault Tolerance

COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 19-Nov-18 COMP28112.

Alternating Bit Protocol

Distributed Consensus

Agreement Protocols CS60002: Distributed Systems

Outline Announcements Fault Tolerance.

Distributed Systems, Consensus and Replicated State Machines

Distributed Consensus

Jacob Gardner & Chuan Guo

Distributed Systems CS

Active replication for fault tolerance

EEC 688/788 Secure and Dependable Computing

CSE 486/586 Distributed Systems Leader Election

Consensus in Synchronous Systems: Byzantine Generals Problem

The Byzantine Generals Problem

Distributed Systems CS

EEC 688/788 Secure and Dependable Computing

COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 22-Feb-19 COMP28112.

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 24-Apr-19 COMP28112.

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Implementing Consistency -- Paxos

CSE 486/586 Distributed Systems Byzantine Fault Tolerance

CSE 486/586 Distributed Systems Leader Election

Presentation transcript:

8.2. Process resilience Shreyas Karandikar

Overview of Process Resilience Design Issues Failure Masking and Replication Agreement in Faulty Systems Failure Detection

Process Resilience Answer: Question: How to achieve fault tolerance in distributed system and especially against Process Failures? Answer: Replicating processes into groups. Consider collections of process as a single abstraction All members of the group receive the same message, if one process fails, the others can take over for it. Process groups are dynamic and a Process can be member of several groups. Hence we need some mechanisms to manage the groups.

Design Issues Flat Group vs. Hierarchical Group

Flat groups: They are symmetric. In flat group all the process are equal and the decisions made in the flat group are collective. No single point-of-failure is seen but the decision making is complicated as consensus is required. Hierarchical Groups These groups are asymmetric. One of the process is elected to be the coordinator, which selects another process(a worker ) to perform the operation. Single point failure is observed in this case. Decisions are easily and quickly made by the coordinator without having to get consensus.

Group Members Hierarchical Groups Flat Groups A process can join or leave the group depending on the work load. A group can be broken and created. So it is necessary to keep the track of these groups. Hierarchical Groups Less complex, simple and easy to execute. Most important downfall of this group is Single point of failure. Flat Groups In order to join or leave a group each process has to broadcast a message. Fault is occurred because of a dead member or a very slow member. Synchronization is required while joining and leaving the group. When a new process joins the group all the messages which were exchanged previously are sent to the newly joint group. But how are the groups formed?

The system call setside() is used to create a new session containing a single new process group. The groups are identified by a positive integer, process group ID is used to identify the groups. The system call setpgid() is used to set the process group ID.

Replicating and masking failure Process are replicated and organized into groups. A single vulnerable process in the whole fault tolerant Group is replaced. A system is said to be K fault tolerant if it can survive faults in K components and still meet its specifications. A system is K fault tolerant if it can have K faulty components and still meet its specification. How much replication is needed to support K Fault Tolerance? K+1 or 2K+1 ? Case: If K processes stop, then the answer from the other one can be used. K+1 If it meets Byzantine failure, the number is  2K+1

Agreement in Faulty Systems Goal of Agreement Making all the non-faulty processes reach consensus on some issue Establishing that consensus within a finite number of steps. A process group typically requires reaching an agreement in: Dividing tasks among workers Electing a coordinator Deciding whether or not to commit a transaction Synchronization

When the communication and processes: are good, agreement is reached in a simple and a straightforward way. Whereas when it is bad, reaching to an agreement is troublesome. Problems of two cases Good process, but unreliable communication Example: Two-army problem Good communication, but crashed process Example: Byzantine generals problem

Two-army problem This problem is classically stated as the two-army problem, and is insoluble. The agreed upon action will never take place, because the last sender will never be certain that the last confirmation went through.(Due to unreliable communication)

Byzantine generals problem The Byzantine generals problem for 3 loyal generals and1 traitor. The generals announce their troop strengths (in units of 1 thousand soldiers). The vectors that each general assembles based on (a) The vectors that each general receives in step 3.

THE ALGORITHM REACHES AN AGREEMENT Cont.. Step 4: Each process examines the ith element of each of the newly received vectors If any value has a majority, that value is put into the result vector If no value has a majority, the corresponding element of the result vector is marked UNKNOWN THE ALGORITHM REACHES AN AGREEMENT Result Vector: (1, 2, UNKNOWN, 4)

Cont.. The same as in previous slide, except now with 2 loyal generals and one traitor.

Each process examines the ith element of each of the newly received Cont.. Step 4: Each process examines the ith element of each of the newly received vectors If any value has a majority, that value is put into the result vector If no value has a majority, the corresponding element of the result vector is marked UNKNOWN Result Vector: (UNKOWN, UNKNOWN, UNKNOWN)

Concluding Remarks on the Byzantine Agreement Problem In their paper, Lamport et al. (1982) proved that in a system with k faulty processes, an agreement can be achieved only if 2k+1 correctly functioning processes are present, for a total of 3k+1. i.e., An agreement is possible only if more than two-thirds of the processes are working properly. Fisher et al. (1985) proved that in a distributed system in which ordering of messages cannot be guaranteed to be delivered within a known, finite time, no agreement is possible even if only one process is faulty.

Process Failure Detection Before we properly mask failures, we generally need to detect them For a group of processes, non-faulty members should be able to decide who is still a member and who is not Two policies: Processes actively send “are you alive?” messages to each other (i.e., pinging each other) Processes passively wait until messages come in from different processes

Failure Considerations There are various issues that need to be taken into account when designing a failure detection subsystem: Failure detection can be done as a side-effect of regularly exchanging information with neighbors. Network failure and node failures must be distinguished by the fault detecting system.

Thank you!