On Scalable and Efficient Distributed Failure Detectors Presented By : Sindhu Karthikeyan.

Slides:



Advertisements
Similar presentations
Global States.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
DISTRIBUTED SYSTEMS II FAULT-TOLERANT BROADCAST Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
6.852: Distributed Algorithms Spring, 2008 Class 7.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
Distributed Algorithms – 2g1513 Lecture 10 – by Ali Ghodsi Fault-Tolerance in Asynchronous Networks.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
Gossip Algorithms and Implementing a Cluster/Grid Information service MsSys Course Amar Lior and Barak Amnon.
Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©
1 Complexity of Network Synchronization Raeda Naamnieh.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
Building Low-Diameter P2P Networks Eli Upfal Department of Computer Science Brown University Joint work with Gopal Pandurangan and Prabhakar Raghavan.
Clock Synchronization Ken Birman. Why do clock synchronization?  Time-based computations on multiple machines Applications that measure elapsed time.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Little’s Theorem Examples Courtesy of: Dr. Abdul Waheed (previous instructor at COE)
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
Composition Model and its code. bound:=bound+1.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
1 Fault Tolerance in Collaborative Sensor Networks for Target Detection IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
Logical Clocks (2). Topics r Logical clocks r Totally-Ordered Multicasting r Vector timestamps.
Distributed Systems – CS425/CSE424/ECE428 – Fall Nikita Borisov — UIUC1.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
CSE 486/586 Distributed Systems Failure Detectors
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
CS542: Topics in Distributed Systems Diganta Goswami.
Lecture 8-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 16, 2010 Lecture 8 The Consensus.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Consensus and Its Impossibility in Asynchronous Systems.
Analysis of a Protocol for Dynamic Configuration of IPv4 Link Local Addresses Using Uppaal Miaomiao Zhang Frits W. Vaandrager Department of Computer Science.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 8 Instructor: Haifeng YU.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
Impossibility of Distributed Consensus with One Faulty Process By, Michael J.Fischer Nancy A. Lynch Michael S.Paterson.
Chapter 21 Asynchronous Network Computing with Process Failures By Sindhu Karthikeyan.
Failure detection The design of fault-tolerant systems will be easier if failures can be detected. Depends on the 1. System model, and 2. The type of failures.
Alternating Bit Protocol S R ABP is a link layer protocol. Works on FIFO channels only. Guarantees reliable message delivery with a 1-bit sequence number.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Lecture 9: Multicast Sep 22, 2015 All slides © IG.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Performance Comparison of Ad Hoc Network Routing Protocols Presented by Venkata Suresh Tamminiedi Computer Science Department Georgia State University.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
The consensus problem in distributed systems
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Failure Detectors
TCP - Part II Relates to Lab 5. This is an extended module that covers TCP flow control, congestion control, and error control in TCP.
Alternating Bit Protocol
Distributed Consensus
Distributed Systems, Consensus and Replicated State Machines
CSE 486/586 Distributed Systems Failure Detectors
M. Mock and E. Nett and S. Schemmer
Presentation transcript:

On Scalable and Efficient Distributed Failure Detectors Presented By : Sindhu Karthikeyan.

2 INTRODUCTION : Failure detectors are a central component in fault-tolerant distributed systems based on process groups running over unreliable, asynchronous networks eg., group membership protocols, supercomputers, computer clusters, etc. The ability of the failure detector to detect process failures completely and efficiently, in the presence of unreliable messaging as well as arbitrary process crashes and recoveries, can have a major impact on the performance of these systems. "Completeness" is the guarantee that the failure of a group member is eventually detected by every non-faulty group member. "Completeness" is the guarantee that the failure of a group member is eventually detected by every non-faulty group member. "Efficiency" means that failures are detected quickly, as well as accurately (i.e., without too many mistakes).

3 The first work to address these properties of failure detectors was by Chandra and Toueg. The authors showed that it is impossible for a failure detector algorithm to deterministically achieve both completeness and accuracy over an asynchronous unreliable network. The first work to address these properties of failure detectors was by Chandra and Toueg. The authors showed that it is impossible for a failure detector algorithm to deterministically achieve both completeness and accuracy over an asynchronous unreliable network. This result has lead to a flurry of theoretical research on other ways of classifying failure detectors, but more importantly, has served as a guide to designers of failure detector algorithms for real systems. This result has lead to a flurry of theoretical research on other ways of classifying failure detectors, but more importantly, has served as a guide to designers of failure detector algorithms for real systems. For example, most distributed applications have opted to circumvent the impossibility result by relying on failure detector algorithms that guarantee completeness deterministically while achieving efficiency only probabilistically. For example, most distributed applications have opted to circumvent the impossibility result by relying on failure detector algorithms that guarantee completeness deterministically while achieving efficiency only probabilistically. In this paper they have dealt with complete failure detectors that satisfy application-defined efficiency constraints of : In this paper they have dealt with complete failure detectors that satisfy application-defined efficiency constraints of : 1) (quickness) detection of any group member failure by some non-faulty member within a time bound, and 2) (accuracy) probability (within this time bound) of no other non-faulty member detecting a given non-faulty member as having failed. 2) (accuracy) probability (within this time bound) of no other non-faulty member detecting a given non-faulty member as having failed.

4 For Accuracy the first requirement merit (quickness) leads to more discussions. Consider a cluster, which rely on some few central computers to aggregate failure detection information from across the system. In such systems, efficient detection of a failure depends on the time the failure is first detected by a non-faulty member. Even in the absence of central server, notification of a failure is typically communicated, by the first member who detected it, to the entire group via a broadcast. In such systems, efficient detection of a failure depends on the time the failure is first detected by a non-faulty member. Even in the absence of central server, notification of a failure is typically communicated, by the first member who detected it, to the entire group via a broadcast. Thus, although achieving completeness is important, efficient detection of a failure is more often related with the time to the first detection, by another non-faulty member, of the failure. In this paper they have discussed In Section 2 why the traditional and popular heartbeating failure detecting schemes do not achieve the optimal scalability limits. In Section 2 why the traditional and popular heartbeating failure detecting schemes do not achieve the optimal scalability limits.

5 Finally they present a randomized distributed failure detector in Section 5 that can be configured to meet the application-defined constraints of completeness and accuracy, and expected speed of detection. Finally they present a randomized distributed failure detector in Section 5 that can be configured to meet the application-defined constraints of completeness and accuracy, and expected speed of detection. With reasonable assumptions on the network unreliability (member and message failure rates of up to 15%), the worst-case network load imposed by this protocol has a sub-optimality factor that is much lower than that of traditional distributed heartbeat schemes. With reasonable assumptions on the network unreliability (member and message failure rates of up to 15%), the worst-case network load imposed by this protocol has a sub-optimality factor that is much lower than that of traditional distributed heartbeat schemes. This sub-optimality factor does not depend on group size (in large groups), but only on the application specified efficiency constraints and the network unreliability probabilities. This sub-optimality factor does not depend on group size (in large groups), but only on the application specified efficiency constraints and the network unreliability probabilities. Furthermore, the average load imposed per member is independent of the group size. Furthermore, the average load imposed per member is independent of the group size.

6 2. PREVIOUS WORK In most real-life distributed systems, the failure detection service is implemented via variants of the "Heartbeat mechanism", which have been popular as they guarantee the completeness property. However, all existing heartbeat approaches have shortcomings. Centralized heartbeat schemes create hot-spots that prevent them from scaling. However, all existing heartbeat approaches have shortcomings. Centralized heartbeat schemes create hot-spots that prevent them from scaling. Distributed heartbeat schemes offer different levels of accuracy and scalability depending on the exact heartbeat dissemination mechanism used, but in this paper we show that they are inherently not as efficient and scalable as claimed. This paper work differs from all this prior work. Here they quantify the performance of a failure detector protocol as the network load it requires to impose on the network, in order to satisfy the application-defined constraints of completeness, and quick and accurate detection. They also present an efficient and scalable distributed failure detector. The new failure detector incurs a constant expected load per process, thus avoiding the heartbeat problem of centralized heartbeating schemes

7 3. MODEL We consider a large group of n (>> 1) members. This set of potential group members is fixed a priori. Group members have unique identifiers. Each group member maintains a list, called a view, containing the identities of all other group members (faulty or otherwise). Members may suffer crash (non-Byzantine) failures, and recover subsequently. Unlike other papers on failure detectors that consider a member as faulty if they are perturbed and sleep for a time greater than some pre-specified duration, our notion of failure considers that a member is faulty if and only if it has really crashed. Perturbations at members that might lead to message losses are accounted for in the message loss rate pml. Whenever a member recovers from a failure, it does so into a new incarnation that is distinguishable from all its earlier incarnations. At each member, an integer in non-volatile storage, that is incremented every time the member recovers, suffices to serve as the member's incarnation number. The members in our group model thus have crash-recovery semantics with incarnation numbers distinguishing different failures and recoveries.

8 We characterize the member failure probability by a parameter pf. pf is the probability that a random group member is faulty at a random time. Member crashes are assumed to be independent across members. Some messages sent out on the network fails to be delivered at its recipient (due to network congestion, buffer overflow at the sender or receiver due to member perturbations, etc.) with probability pml Є (0, 1). The worst-case message propagation delay (from sender to receiver through the network) for any delivered message is assumed to be so small compared to the application-specified detection time (typically O( several seconds )) that henceforth, for all practical purposes, we can assume that each message is either delivered immediately at the recipient with probability (1 - pml ), or never reaches the recipient. In the rest of the paper we use shorthands for (1-pf) = qf, and (1-pml) = qml.

9 4. SCALABLE AND EFFICIENT FAILURE DETECTORS The first formal characterization of the properties of failure detectors, has laid down the following properties for distributed failure detectors in process groups: {Strong/Weak} Completeness: crash-failure of any group member is detected by {all/some} non-faulty members, {Strong/Weak} Completeness: crash-failure of any group member is detected by {all/some} non-faulty members, Strong Accuracy: no non-faulty group member is declared as failed by any other non-faulty group member. Strong Accuracy: no non-faulty group member is declared as failed by any other non-faulty group member. Subsequent work on designing efficient failure detectors has attempted to trade off the Completeness and Accuracy properties in several ways. However, the completeness properties required by most distributed applications have lead to the popular use of failure detectors that guarantee Strong Completeness always, even if eventually. This of course means that such failure detectors cannot guarantee Strong Accuracy always, but only with a probability less than 1.

10 For example, all-to-all (distributed) heartbeating schemes have been popular because they guarantee Strong Completeness (since a faulty member will stop sending heartbeats), while providing varying degrees of accuracy. For example, all-to-all (distributed) heartbeating schemes have been popular because they guarantee Strong Completeness (since a faulty member will stop sending heartbeats), while providing varying degrees of accuracy. The requirements imposed by an application (or its designer) on a failure detector protocol can be formally specified and parameterized as follows: 1. COMPLETENESS : satisfy eventual Strong Completeness for member failures. 2. EFFICIENCY: (a) SPEED: every member failure is detected by some non-faulty group member within T- time units after its occurrence (T >> worst- case message round trip time). (b) ACCURACY: at any time instant, for every non faulty member Mi not yet detected as failed, the probability that no other non-faulty group member will (mistakenly) detect Mi as faulty within the next T time units is at least (1 - PM(T)).

11 To measure the scalability of a failure detector algorithm, we use the worst-case network load it imposes - this is denoted as L. Since several messages may be transmitted simultaneously even from one group member, we define: Definition 1. The worst-case network load L of a failure detector protocol is the maximum number of messages transmitted by any run of the protocol within any time interval of length T, divided by T. We also require that the failure detector impose a uniform expected send and receive load at each member due to this traffic. The goal of a near-optimal failure detector algorithm is thus to satisfy the above requirements (COMPLETENESS, EFFICIENCY) while guaranteeing: Scale. the worst-case network load L imposed by the algorithm is close to the optimal possible, with equal expected load per member. Scale. the worst-case network load L imposed by the algorithm is close to the optimal possible, with equal expected load per member. i.e; L ≈ L* where the optimal worst-case network load is L*.

12 THEOREM 1. Any distributed failure detector algorithm for a group of size n (>> 1) that deterministically satisfies the COMPLETENESS, SPEED (T), ACCURACY (PM(T)) requirements ( > 1) that deterministically satisfies the COMPLETENESS, SPEED (T), ACCURACY (PM(T)) requirements (<< pml), imposes a minimal worst-case network load (messages per time unit, as defined above) of: L* = n.[ log(PM(T)) / log(pml).T] L* is thus the optimal worst-case network load required to satisfy the COMPLETENESS, SPEED, ACCURACY requirements. PROOF. We prove the first part of the theorem by showing that each non-faulty group member could transmit up to log(PM(T)) / log(pml) messages in a time interval of length T. Consider a group member Mi at a random point in time t. Let Mi not be detected as failed yet by any other group member, and stay non-faulty until at least time t + T. Let m be the maximum number of messages sent by Mi, in the time interval [t, t + T], in any possible run of the failure detector protocol starting from time t. Now, at time t, the event that "all messages sent by Mi in the time interval [t, t+T] are lost" happens with probability at least Pml^m. Occurrence of this event entails that it is indistinguishable to the set of the rest of the non-faulty group members (i.e., members other than Mi) as to whether Mi is faulty or not. Occurrence of this event entails that it is indistinguishable to the set of the rest of the non-faulty group members (i.e., members other than Mi) as to whether Mi is faulty or not. By the SPEED requirement, this event would then imply that Mi is detected as failed by some non-faulty group member between t and t + T. By the SPEED requirement, this event would then imply that Mi is detected as failed by some non-faulty group member between t and t + T.

13 Thus, the probability that at time t, a given non-faulty member Mi that is not yet detected as faulty, is detected as failed by some other non- faulty group member within the next T time units, is at least pml^m. By the ACCURACY requirement, we have pml^m < PM(T), which implies that m ≥ log(PM(T)) / log(pml) m ≥ log(PM(T)) / log(pml) A failure detector that satisfies the COMPLETENESS, SPEED, ACCURACY requirements and meets the L* bound works as follows. It uses a highly available, non-faulty server as a group leader. It uses a highly available, non-faulty server as a group leader. Every other group member sends log(PM(T)) / log(Pml) "I am alive" messages to this server every T time units. Every other group member sends log(PM(T)) / log(Pml) "I am alive" messages to this server every T time units. The server declares the member as failed if it doesn’t receive the “I am alive” message from it for T time units. The server declares the member as failed if it doesn’t receive the “I am alive” message from it for T time units.

14 Definition 2. The sub-optimality factor of a failure detector algorithm that imposes a worst-case network load L, while satisfying the COMPLETENESS and EFFICIENCY requirements, is defined as L/ L*. In the traditional Distributed Heartbeating failure algorithm, every group member periodically transmits a “heartbeat” message to all the other group member. A member Mj is declared as failed by an another non-faulty member Mi, when Mi doesn’t receive heartbeats from Mj for some consecutive heartbeat periods. Distributed Heartbeat Scheme guarantees “COMPLETENESS”, however it cannot guarantee “ACCURACY” and “SCALABILITY”, because it depends totally on the mechanism used to disseminate Heartbeats. The worst-case number of messages transmitted by each member per unit time is 0(n), and the worst-case total network load L is 0(n^2). The sub-optimality factor (i.e., L/ L*) varies as O(n), for any values of pml, pf and PM(T).

15 The distributed heartbeating schemes do not meet the optimality bound of Theorem 1 because they inherently attempt to communicate a failure notification to all group members. The distributed heartbeating schemes do not meet the optimality bound of Theorem 1 because they inherently attempt to communicate a failure notification to all group members. Other heartbeating schemes, such as Centralized heartbeating (as discussed in the proof of Theorem 1), can be configured to meet the optimal load L*, but have problems such as creating hot-spots (centralized heartbeating). Other heartbeating schemes, such as Centralized heartbeating (as discussed in the proof of Theorem 1), can be configured to meet the optimal load L*, but have problems such as creating hot-spots (centralized heartbeating).

16 5. A RANDOMIZED DISTRIBUTEDFAILURE DETECTOR PROTOCOL In this section, we relax the SPEED condition to detect a failure within an expected (rather than exact, as before) time bound of T time units after the failure. We then present a randomized distributed failure detector algorithm that guarantees COMPLETENESS with probability 1, detection of any member failure within an expected time T from the failure and an ACCURACY probability of (1 – PM(T)). The protocol imposes an equal expected load per group member, and a worst-case (and average case) network load L that differs from the optimal L* of Theorem 1 by a sub- optimality factor (i.e., L/L* ) that is independent of group size n (>> 1). This sub-optimality factor is much lower than the sub-optimality factors of the traditional distributed heartbeating schemes discussed in the previous section. 5.1 New Failure Detector Algorithm The failure detector algorithm uses two parameters: protocol period T’ (in time units) and integer value k. The algorithm is formally described in Figure 1. At each non-faulty member Mi, steps (1-3) are executed once every T’ time units (which we call a protocol period), while steps (4,5,6) are executed whenever necessary. The data contained in each message is shown in parentheses after the message.

17 Integer pr; /* Local period number */ Every T’ time units at Mi : O. pr := pr Select random member Mj from view Send a ping(Mi, Mj, pr) message to Mj Wait for the worst-ease message round-trip time for an ack(Mi, Mj, pr) message 2. If have not received an ack(Mi, My, pr) message yet Select k members randomly from view Send each of them a ping-req(Mi, My, pr) message Walt for an ack(Mi, Mj, pr) message until the end of period pr 3. If have not received an ack(Mi, Mj, pr) message yet Declare Mj as failed Anytime at Mi : 4. On receipt of a ping-req(Mm, Mj, pr) (Mj # Mi) Send a ping(Mi, Mj, Mm,pr) message to Mj On receipt of an ack(Mi, Mj, Mm, pr) message from Mj Send an ack(Mm, Mj, pr) message to received to Mm Anytime at Mi : 5. On receipt of a ping(Mm, Mi, Ml, pr) message from member Mm Reply with an ack(Mm, Mi, Ml, pr) message to Mm Anytime at Mi : 6. On receipt of a ping(Mm, Mi, pr) message from member Mm Reply with an ack(Mm, Mr, pr) message to Mm Figure 1: Protocol steps a t a group member Mi.Data in each message is shown in parentheses after the message. Each message also contains the current incarnation number of the sender.

18 Figure 2: Example protocol period a t Mi. This shows all the possible messages that a protocol period may initiate. Some message contents excluded for simplicity.

19 Figure 2 illustrates the protocol steps initiated by a member Mi, during one protocol period of length T' time units. At the start of this protocol period at Mi, a random members selected, in this case Mj, and a ping message sent to it. If Mi does not receive a replying ack from Mj within sometime-out (determined by the message round-trip time, which is << T), it selects k members at random and sends to each a ping-req message. At the start of this protocol period at Mi, a random members selected, in this case Mj, and a ping message sent to it. If Mi does not receive a replying ack from Mj within sometime-out (determined by the message round-trip time, which is << T), it selects k members at random and sends to each a ping-req message. Each of the non-faulty members among these k which receives the ping-req message subsequently pings Mj and forwards the ack received from Mj, if any, back to Mi. Each of the non-faulty members among these k which receives the ping-req message subsequently pings Mj and forwards the ack received from Mj, if any, back to Mi. In the example of Figure 2, one of the k members manages to complete this cycle of events as Mj is up, and Mi does not suspect Mj as faulty at the end of this protocol period. In the example of Figure 2, one of the k members manages to complete this cycle of events as Mj is up, and Mi does not suspect Mj as faulty at the end of this protocol period.

20 The effect of using the randomly selected subgroup is to distribute the decision on failure detection across a subgroup of (k + 1) members. The effect of using the randomly selected subgroup is to distribute the decision on failure detection across a subgroup of (k + 1) members. So it can be shown that the new protocol's properties are preserved even in the presence of some degree of variation of message delivery loss probabilities across group members. Sending k repeat ping messages may not satisfy this property. Our analysis in Section 5.2 shows that the cost (in terms of sub-optimality factor of network load) of using a (k + 1)-sized subgroup is not too significant. 5.2 Analysis 5.2 Analysis In this section, we calculate, for the above protocol, the expected detection time of a member failure, as well as the probability of an inaccurate detection of a non-faulty member by some other (at least one) non-faulty member.

21 For any group member Mj, faulty or otherwise, Pr [at least one non-faulty member chooses to ping Mj (directly) in a time interval T‘ ] = 1 - ( 1 – 1/n. qf)^n = 1 - ( 1 – 1/n. qf)^n ≈ 1 – e^-qf (since n >> 1) Thus, the expected time between a failure of member Mj and its detection by some non-faulty member is E[T] = T’. (1/1 – e^-qf) = T’( e^qf/ (e^qf) – 1) (1). Now, denote C(pf) = e^qf/ (e^qf ) – 1. If PM(T) is the probability of inaccurate failure detection of a member in a set within the time T. Then a random group member Ml is non-faulty with probability qf, and The prob. Of such a member to ping Mj within a time interval T : 1/n. C(pf).

22 Now, the prob. that Mi receives no ack’s, direct or indirect, according to the protocol of section 5.1: ((1-qml^2).(1-qf.qml^4)^k). Therefore, PM(T) = 1- [1-qf/n.C(pf).(1-qml^2).(1-qf.qml^4)^k]^(n-1) ≈ 1- e^(-qf.(1-qml^2).(1-qf.qml^4)^k.C(pf) (since( n >> 1) ≈ qf.(1 – qml^2).(1-qf.qml^4)^k.C(pf) (since PM(T)<<1) So, K = log[PM(T)/(qf.(1-qml^2).e^qf/e^ql – 1)] / log(1 – qf.qml^4) (2). Thus, the new randomized failure detector protocol can be configured using equations (1) and (2) to satisfy the SPEED and ACCURACY requirements with parameters E[T], PM(T). Moreover, given a member Mj that has failed (and stays failed), every other non-faulty member Mi will eventually choose to ping Mj in some protocol period, and discover Mj as having failed. Hence,

23 THEOREM 2. This randomized failure detector protocol: (a) satisfies eventual Strong Completeness, i.e., the COMPLETENESS requirement. (b) can be configured via equations (1) and (2) to meet the requirements of (expected) SPEED, and ACCURACY, and (c) has a uniform expected send/receive load at all group members. Proof: From the above discussions and equations (1) and (2).

24 For calculating L/L*: Finally, we upper-bound the worst-case and expected network load (L, E[L] respectively) imposed by this failure detector protocol. The worst-case network load occurs when, every T' time units, each member initiates steps (1-6) in the algorithm of Figure 1. Steps (1,6) involve at most 2 messages, while Steps (1,6) involve at most 2 messages, while Steps (2-5) involve at most 4 messages per ping-req target member. Steps (2-5) involve at most 4 messages per ping-req target member. Therefore, the worst-case network load imposed by this protocol (in messages/time unit) is: Therefore, the worst-case network load imposed by this protocol (in messages/time unit) is: L = n. [ k ].1/T’ From Theorem 1 an equations (1) and (2). L/L* = [2+4.{ log[PM(T)/(qf.(1-qml^2).e^qf/e^ql – 1)] / log(1 – qf.qml^4) }] / n.[ log(PM(T)) / log(pml).T] (3). L thus differs from the optimal L* by a factor that is independent of the group size n. Equation (3) can be written as a linear function of (1/ - log(PM(T))) as:

25

26 Theorem 3: The sub-optimality factor L/L* of the protocol of Figure 1, is independent of group size n(>>1). Furthermore, Proof: From equations (4a) through (4c).

27 Now we calculate the average network load E[L] imposed by the new failure detector algorithm. At every T’ time units, each non-faulty member (n. qf) on an average executes steps 1 – 3 in the algorithm of figure 1. At every T’ time units, each non-faulty member (n. qf) on an average executes steps 1 – 3 in the algorithm of figure 1. Steps 1 – 6 involves at most 2 messages – this happens only if Mj sends an ack back to Mi. Steps 1 – 6 involves at most 2 messages – this happens only if Mj sends an ack back to Mi. Steps is executed only if Mj doesn’t send ack back to Mi and pings a request to K other members in the system, with a probability of Steps is executed only if Mj doesn’t send ack back to Mi and pings a request to K other members in the system, with a probability of (1 – qf. qml^2), and involves 4 messages per non-faulty ping-req. Therefore the average network load is given as: E[L] < n. q f. [ 2 + ( 1 - q f. qml^2 ). 4. k ]. 1/ T’ So even E(L)/L* is independent of the group size n.

28 Figure 3(a) shows the variation of L/L* as in equation (3). This plot shows that the sub-optimality (L/L*) factor of the network load imposed by the new failure detector rises as pml and pf increase, or PM(T) decreases, ie; L/L* when pml and, L/L* when PM(T), but is bounded by the function g(pf, pml). (We can make out from the graph plot that L/L* < 26 for pf and pml below 15%.). Now see figure 3(c) it can be easily seen from the graph that E[L] / L* stays very low below 8 for the values of pf and pml upto 15%. As PM(T) decreases the bound on E[L] / L* also decreases. (This curve reveals the advantage of using randomization in failure detection, unlike Traditional distributed heartbeating algorithm E[L] < L* i.e; E[L] < 8.L ).

29 Concluding Comments We have thus quantified the L* required by a complete Failure detector algorithm in a process group over a simple, probabilistically lossy network model derived from application specification constraints of Detection time of a group member failure by some non-faulty group member. ( E[T] ). Detection time of a group member failure by some non-faulty group member. ( E[T] ). Accuracy ( 1 – PM(T) ). Accuracy ( 1 – PM(T) ). So the Randomized Failure detection Algorithm. Imposes an equal load on all group members. Imposes an equal load on all group members. Is configured to satisfy the group specified requirements of completeness, accuracy and speed of failure detection (in average). Is configured to satisfy the group specified requirements of completeness, accuracy and speed of failure detection (in average). For very stingent accuracy requirements pml and pf in the network (upto 15% each), the Sub-optimality factor (L/L*) is not as large as the traditional distributed heartbeating protocols. For very stingent accuracy requirements pml and pf in the network (upto 15% each), the Sub-optimality factor (L/L*) is not as large as the traditional distributed heartbeating protocols. This Sub-optimality factor (L/L*) does not vary with group size, when groups are large. This Sub-optimality factor (L/L*) does not vary with group size, when groups are large.