Distributed Error- Confinement Shay Kutten (Technion) with Boaz Patt-Shamir (Tel Aviv U.) Yossi Azar (Tel Aviv U.)

Slides:

Advertisements

Similar presentations

1 Routing Protocols I. 2 Routing Recall: There are two parts to routing IP packets: 1. How to pass a packet from an input interface to the output interface.

Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.

Chapter 6 - Convergence in the Presence of Faults1-1 Chapter 6 Self-Stabilization Self-Stabilization Shlomi Dolev MIT Press, 2000 Shlomi Dolev, All Rights.

Chapter 7 - Local Stabilization1 Chapter 7: roadmap 7.1 Super stabilization 7.2 Self-Stabilizing Fault-Containing Algorithms 7.3 Error-Detection Codes.

COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.

Teaser - Introduction to Distributed Computing

PROTOCOL VERIFICATION & PROTOCOL VALIDATION. Protocol Verification Communication Protocols should be checked for correctness, robustness and performance,

UNIT-IV Computer Network Network Layer. Network Layer Prepared by - ROHIT KOSHTA In the seven-layer OSI model of computer networking, the network layer.

EE 4272Spring, 2003 Chapter 10 Packet Switching Packet Switching Principles  Switching Techniques  Packet Size  Comparison of Circuit Switching & Packet.

CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.

3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.

CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.

LSRP: Local Stabilization in Shortest Path Routing Hongwei Zhang and Anish Arora Presented by Aviv Zohar.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 22 Introduction to Computer Networks.

1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:

CS294, YelickSelf Stabilizing, p1 CS Self-Stabilizing Systems

A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.

Network Layer Design Isues Store-and-Forward Packet Switching Services Provided to the Transport Layer The service should be independent of the router.

Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.

CPSC 668Set 11: Asynchronous Consensus1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.

1 Shortest Path Calculations in Graphs Prof. S. M. Lee Department of Computer Science.

On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.

© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.

1 Pertemuan 20 Teknik Routing Matakuliah: H0174/Jaringan Komputer Tahun: 2006 Versi: 1/0.

Link-state routing  each node knows network topology and cost of each link  quasi-centralized: each router periodically broadcasts costs of attached.

Securing Every Bit: Authenticated Broadcast in Wireless Networks Dan Alistarh, Seth Gilbert, Rachid Guerraoui, Zarko Milosevic, and Calvin Newport.

Multicast Routing in Mobile Ad Hoc Networks (MANETs)

On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.

Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 11: Asynchronous Consensus 1.

CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.

Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.

Ch11 Distributed Agreement. Outline Distributed Agreement Adversaries Byzantine Agreement Impossibility of Consensus Randomized Distributed Agreement.

Practical Byzantine Fault Tolerance

Data Communications and Networking Chapter 11 Routing in Switched Networks References: Book Chapters 12.1, 12.3 Data and Computer Communications, 8th edition.

Fault Tolerance Computer Programs that Can Fix Themselves! Prof’s Paul Sivilotti and Tim Long Dept. of Computer & Info. Science The Ohio State University.

1 Chapter 10 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld © 2014 Synchronization Algorithms and Concurrent Programming Synchronization.

TELE202 Lecture 6 Routing in WAN 1 Lecturer Dr Z. Huang Overview ¥Last Lecture »Packet switching in Wide Area Networks »Source: chapter 10 ¥This Lecture.

CS294, Yelick Consensus revisited, p1 CS Consensus Revisited

Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.

UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department

1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.

Spring 2000CS 4611 Routing Outline Algorithms Scalability.

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb

DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb

1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.

Spring Routing: Part I Section 4.2 Outline Algorithms Scalability.

PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.

Distance Vector Routing

Distributed Error- Confinement Shay Kutten (Technion) with Yossi Azar (Tel Aviv U.) Boaz Patt-Shamir (Tel Aviv U.)

Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.

Day 13 Intro to MANs and WANs. MANs Cover a larger distance than LANs –Typically multiple buildings, office park Usually in the shape of a ring –Typically.

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.

The consensus problem in distributed systems

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

8.2. Process resilience Shreyas Karandikar.

Intra-Domain Routing Jacob Strauss September 14, 2006.

Outline Distributed Mutual Exclusion Distributed Deadlock Detection

Alternating Bit Protocol

Distributed Consensus

Agreement Protocols CS60002: Distributed Systems

Fault-Tolerant State Machine Replication

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

CS 6290 Many-core & Interconnect

Distributed Error- Confinement

Presentation transcript:

Distributed Error- Confinement Shay Kutten (Technion) with Boaz Patt-Shamir (Tel Aviv U.) Yossi Azar (Tel Aviv U.)

Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.

S A B C Motivation: “error propagation” (example) 7 Message from S to A: 4 My distance to C via S : 7+4=11 Internet routing: Node A compute shortest path to C based on messages from S. Traffic to C (1) Assume no fault: distance 7 to C C

Motivation: “error propagation” (example) 7 Message from S to A: 42 Traffic to C (2) with fault (at B): distance 7 to C State corrupting fault (adversary modifies data memory) distance 0 to C My distance to C via S : 7+4=11 S A B C C

State corrupting fault (self stabilization): Not malicious! Just a one time change of memory content State corrupting fault (adversary modifies data memory) distance 0 to C S A B C

Motivation: “error propagation” (example) 7 Message from S to A: 4 2 Traffic to C (2) With fault (at B): distance 7 to C distance 0 to C fault My distance to C via S : 7+4=11 S A B C C

Motivation: “error propagation” (example) 7 Message from S to A: 4 My distance to C via B : 2+0=2 2 Traffic to C distance 7 to C distance 0 to C fault (3) B’s fault propagated to A S A B C C

Motivation: “error propagation” (example) 7 Message from S to A: 4 My distance to C via B : 2+0=2 2 (4) Traffic to C is sent the wrong way as a result of the fault propagation distance 7 to C distance 0 to C fault B’s fault propagated to A S A B C C C

I have distance 0 to everybody fault C C This is, actually, how the Internet (than Called “ARPANET”) in 1980 D D D S crashed S A B C

A B distance 0 to C fault My distance to C via S : 7+4=11 C S “Error confinement”: non faulty node A outputs only correct output (or no output at all) Output (to routing:) I do not believe you! Sounds impossible?

Error Confinement (Formally )  : problem specification, P: protocol. P solves  with error confinement if for any execution of P with behavior  (possibly containing a state corrupting fault), there exists a behavior  ’  & for all non-faulty nodes v:  ’ v =  v –(“stabilization” deals also with faulty nodes) –(behavior- ignoring time)

Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The core- bootstrapping idea idea for algorithm (4) Optimization question and answer for “core” construction.

S t 0 A B C D time t t 1 2 Introducing a new measure of fault resilience: The resilience of a protocol is smaller at first Environment (e.g. user) 0 Input is given to S at time t

t 0 S A B C D time t t 1 2 The resilience of a protocol is smaller at first (cont.) Environment (e.g. user) gives Input is to S at time t 0 t f If adversary changes the state of S at time t f shortly after the input

t 0 S A B C D time t 1 t 2 The resilience of a protocol is smaller at first (cont.) Environment (e.g. user) gives Input to S at time t 0 t f If adversary changes the state of S at time t f shortly after the input then the input is lost forever

S A BD S A B C D time t f The resilience of a protocol grows with time t 0 t 1 t 2 However, a fault, even in S, can be tolerated if it waits until after S distributed the input value input C

S A B C D S A B C D time t f t f The resilience of a protocol grows with time (cont.) t 0 t 1 t 2 However, a fault, even in S, can be tolerated if it waits until after S distributed the input value distribution input

S A B C D S A B C D time tftf tftf The resilience of a protocol grows with time t 0 t 1 t 2 A fault even in S can be tolerated if it “waits” until after S distributed the input value distribution input

A B C D S A B C D time t f t f The resilience of a protocol grows with time t 0 t 1 t 2 A fault even in S can be tolerated if it “waits” until after S distributed the input value input distribution S

S A B C D S A B C D time t f t f The resilience of a protocol grows with time input To destroy the replicated value the adversary needs to hit more nodes at > > tftf t1t1 t0t0 t0t0 t1t1 t0t0

t 1 S S A B C D S A B C D S A B C D time t 2 t 3 The resilience continues to grows with time If no faults occurred by some later, then the input is t 3 replicated even further

S t 1 A B C D S A B C D S A B C D time t t 2 3 The resilience continues to grows with time The later the faults, the more faults can be tolerated tftf

t 1 S S A B C D BD S A B C D Time time t t 2 3 Space The later the faults, the more faults can be tolerated if the protocol is designed to be robust C S A Cone

t 1 S S A B C D BD S A B C D time t t 2 3 “Narrow” cone a LESS fault tolerant algorithm S A C Slower replication less nodes offer help

t 1 Replication to more nodes faster S S A B C D BD S A B C D time t t 2 3 A “Wider” cone a more fault tolerant algorithm C S A

So, a recovery of corrupted values is theoretically possible, for an adversary that is constrained according to a space-time-cone, but what is the algorithm that does the recovery? S time

Constraining faults: Agility c -constrained environment: environment generating faults t f time units after the input, ( c  1), only in: with agility c: Broadcast algorithm that guarantees error confinement against c -constrained environments. minority of · |Ball s (c·t f )| nodes. S Ball s c·tfc·tf algorithm V V

S S C D D time Algorithm’s “agility” measures the rate the constraint on the adversary can be lifted C S Agility: S

Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.

The message resides at some nodes we term “core”

A node can join the core when it “made sure” it heard the votes of all core nodes

A node can join the core when it “made sure” it heard the votes of all core nodes

A node can join the core when it “made sure” it heard the votes of all core nodes

A node can join the core when it “made sure” it heard the votes of all core nodes

and even the fault can be corrected

Let us view again the join of one node

If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly

If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly Disclaimer: any connection to Actual historical rivalry is coincidental

If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly Disclaimer: any connection to Actual historical rivalry is coincidental

A B distance 0 to C fault My distance to C via S : 7+4=11 C S D “Error confinement”: non faulty node A outputs Only correct output (or no output at all) Output (to routing:) I do not believe you!

and even the fault can be corrected

When the core grow, the algorithm can withstand more faults.

Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.

S A B C D E F G H Dilemma- should we add a node to core ASAP?

S A B C D E F G H Advantage- enlarges the core now.

S A B C D E F G H Dilemma- should we add a node to core ASAP? Disadvantage: slows future core growth

S A B C D E F G H Dilemma- should we add a node to core ASAP? Disadvantage: slows future core growth

S A B C D E F G H Dilemma- should we add a node to core ASAP? Disadvantage: slows future core growth

Stages of core growth (example: bad greedy policy) Greedy core growth  O(D 2 ) time to reach diameter D  Agility = D / D 2 = D

Core (T i ) Core(T i-1 ) S R i-1 RiRi U V feasibility: Core at time T i as a function of Core(T i-1 ) Optimize agility subject to

Agility at time T i : Radius of Core at time T i Divided by time (T i : i’th time Core grows) Core radius time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility

Agility at time T i : Radius of Core at time T i Divided by time (T i : i’th time Core grows) Core radius We calculated optimal T’s, R’s. time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility

Agility at time T i : Radius of Core at time T i Divided by time (T i : i-th time Core grows) Core radius We calculated optimal T’s, R’s. Constant Agility. time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility

Agility at time T i : Radius of Core at time T i Divided by time (T i : i-th time Core grows) Core radius Optimal number of core increases (logarithmic!). We calculated optimal T’s, R’s. Constant Agility. time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility

Additional results An error confined protocol to compute distances from a node S ( Bellman-Ford’s algorithm is self stabilizing but is not error-confined). An error confined protocol for broadcast with correct source. Lower bound (mandatory slow down): Consider an error-confined algorithm for BROADCAST, even with correct source. Then no correct node v outputs before time 2·dist(source,v), even in the absence of faults. Generalizations, practical considerations.

Some related notions Self stabilization [Dijkstra, Lamport] would bring the system to a consistent global state eventually. If the input is re-injected repeatedly (as in the routing information example) this corrects the states eventually. If the input is not reintroduced by the environment, then the input may be lost forever. Local checking [Afek-K-Yung90], or local detection [Awerbuch-P-Varghese91], together with self stabilizing reset [KatzPerry90,AKY90,AroraGouda90,APV91] would yield agility 1/|V|.

Some related notions fault local algorithms [K-Peleg95], or Time adaptive [KP97], or fault containment [GhoshGuptaHermanPemmaraju96] or local stabilization [AroraZhang03] would allow a “small” (relative to the number of faults) error propagation. Snap stabilization [ BuiDattaPetitVillain99 ] considers only the case that some nodes performed a special “purifying” “initiation” action after the faults. A fault may propagate to a non- initiator node until it communicate with initiators. Local stabilization of [AfekDolev97] assumes a node can detect its own faults (since a fault puts a node in a random state).

Open problems Many!