A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of.

Slides:

Advertisements

Similar presentations

Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.

Advertisements

Impossibility of Distributed Consensus with One Faulty Process

CS 542: Topics in Distributed Systems Diganta Goswami.

Teaser - Introduction to Distributed Computing

Distributed Systems Overview Ali Ghodsi

1 The Case for Byzantine Fault Detection. 2 Challenge: Byzantine faults Distributed systems are subject to a variety of failures and attacks Hacker break-in.

LADIS workshop (Oct 11, 2009) A Case for the Accountable Cloud Andreas Haeberlen MPI-SWS.

P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.

Announcements. Midterm Open book, open note, closed neighbor No other external sources No portable electronic devices other than medically necessary medical.

Nummenmaa & Thanish: Practical Distributed Commit in Modern Environments PDCS’01 PRACTICAL DISTRIBUTED COMMIT IN MODERN ENVIRONMENTS by Jyrki Nummenmaa.

A. Haeberlen Having your Cake and Eating it too: Routing Security with Privacy Protections 1 HotNets-X (November 15, 2011) Alexander Gurney * Andreas Haeberlen.

1 © P. Kouznetsov On the weakest failure detector for non-blocking atomic commit Rachid Guerraoui Petr Kouznetsov Distributed Programming Laboratory Swiss.

Making Services Fault Tolerant

Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Ashish Gupta Under Guidance of Prof. B.N. Jain Department of Computer Science and Engineering Advanced Networking Laboratory.

Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.

Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )

Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.

Brent Dingle Marco A. Morales Texas A&M University, Spring 2002

A Secure Fault-Tolerant Conference- Key Agreement Protocol Wen-Guey Tzeng Source : IEEE Transactions on computers Speaker ： LIN, KENG-CHU.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

The Atomic Commit Problem. 2 The Problem Reaching a decision in a distributed environment Every participant: has an opinion can veto.

2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

Josef WidderBooting Clock Synchronization1 The  - Model, and how to Boot Clock Synchronization in it Josef Widder Embedded Computing Systems Group

EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.

© 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel.

Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.

On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

Composition Model and its code. bound:=bound+1.

Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.

Fault Tolerance via the State Machine Replication Approach Favian Contreras.

IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.

Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.

Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.

Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.

From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.

Association Rule Mining in Peer-to-Peer Systems Ran Wolff Assaf Shcuster Department of Computer Science Technion I.I.T. Haifa 32000,Isreal.

1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.

Byzantine fault tolerance

BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom.

CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.

CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.

UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department

SysRép / 2.5A. SchiperEté The consensus problem.

PeerReview: Practical Accountability for Distributed Systems SOSP 07.

Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.

SOSP 2007 © 2007 Andreas Haeberlen, MPI-SWS 1 Practical accountability for distributed systems Andreas Haeberlen MPI-SWS / Rice University Petr Kuznetsov.

Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.

Motivation: Finding the root cause of a symptom

CS 542: Topics in Distributed Systems Self-Stabilization.

1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.

Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.

Distributed Agreement. Agreement Problems High-level goal: Processes in a distributed system reach agreement on a value Numerous problems can be cast.

Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

Problem: Internet diagnostics and forensics

The consensus problem in distributed systems

CSE 486/586 Distributed Systems Byzantine Fault Tolerance

Outline Announcements Fault Tolerance.

Distributed Systems, Consensus and Replicated State Machines

Jacob Gardner & Chuan Guo

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

The Synchronous Data Center

Presentation transcript:

A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of Computer and Information Science University of Pennsylvania

A. Haeberlen Faults in Distributed Systems Nodes in a distributed system can fail Example: Online banking The consequences can be serious Example: Monetary loss Solution: Use fault-tolerance techniques 2 HotOS XV (May 18, 2015) bank.com Balance: $100 Transfer $9 to Marcos New balance: $1

A. Haeberlen Faults in Real Life Transactions in real life can fail, too! Example: Paying with cash at the checkout counter Failures can have bad consequences Example: Getting shortchanged Solution: Use fault-tolerance techniques? 3 HotOS XV (May 18, 2015) $9, please! Your change:

A. Haeberlen Online vs. offline How do we do handle this in the real world? No masking: The transaction is allowed to fail initially Detection: Participants check the results Recovery: Detected failures are fixed if possible Timeliness: Checking happens quickly (to limit damage) Can we do the same in distributed systems? Our proposal: Bounded-time recovery (BTR) Intuition: When a node fails, the system may make mistakes for a limited time (e.g., 100ms), but then it recovers Should be a provable property - not just best-effort! 4 HotOS XV (May 18, 2015)

A. Haeberlen When would BTR be sufficient? Not all systems can use BTR Example: Systems where failures are immediately fatal But there are systems that could benefit! Example: Cyber-physical systems Physical part often has some inertia Control algorithms can often tolerate some mistakes Time bound is key: Fixing problems 'eventually' is not enough! 5 HotOS XV (May 18, 2015)

A. Haeberlen What could we gain from BTR? Opportunity #1: Lower cost Detection is cheaper than masking Particularly important for CPS Opportunity #2: Timing guarantees Even most BFT solutions cannot guarantee timely responses when the system is under attack Opportunity #3: Fine-grained responses Typical fault-tolerance guarantee is "all or nothing" BTR can recover failures in many ways, e.g., by dropping less important tasks or by adjusting the service level 6 HotOS XV (May 18, 2015)

A. Haeberlen Outline Motivation Idea: Bounded-Time Recovery (BTR) Pros and Cons of BTR BTR defined Solution sketch Summary 7 HotOS XV (May 18, 2015) NEXT

A. Haeberlen Bounded-time recovery: A system offers BTR with a time bound R if its outputs are correct in any interval [t 1,t 2 ] such that no fault has manifested in [t 1 -R,t 2 ] Some special cases: R=0: Similar to BFT (but with timing guarantees!) R=  : Similar to self-stabilization Small values of R are the most interesting (and the hardest) A proposed definition 8 HotOS XV (May 18, 2015) Time R R R System correct Fault

A. Haeberlen What assumptions do we need? BTR talks about time  Need synchrony! Must have strong bounds on execution times Must have strong bounds on message delays This is reasonable (in the CPS domain) WCETs are often known or can be derived Networks have FEC and support bandwidth reservations Can we assume Byzantine faults? Real, growing concern for CPS! Qualified yes: Some hardware features needed Example: Protection against Babbling Idiots -- e.g., bus guardians 9 HotOS XV (May 18, 2015)

A. Haeberlen A B C D Plan for "B faulty" mode A Solution sketch: Planning Ingredient #1: Planner System can run in several modes, has a (static) plan for what to run where in each mode Online vs. offline planning Several interesting challenges (see paper for details) Example: Inter-mode dependencies; connections to game theory Example: Distributed mixed-mode scheduling Interesting opportunites, e.g., fine-grained responses 10 HotOS XV (May 18, 2015) B C D Plan for "no faults" mode Tasks Data flow

A. Haeberlen 4 Ingredient #2: Fault detector Need to detect (at runtime) when a node misbehaves Can we use PeerReview [SOSP’07] for this? No - PeerReview is for asynchronous systems! Challenge: Detecting temporal faults Example: Faulty node might send the right message at the wrong time Challenge: Bounding time to detection Adversary can ‘win’ simply by delaying detection (and thus recovery) for too long! A Solution sketch: Detection 11 HotOS XV (May 18, 2015) B 2+2? 4

A. Haeberlen Solution sketch: Recovery Ingredient #3: Evidence distributor Need to convince other nodes that a fault really exists Adversary might try to confuse the system by reporting non-existent faults PeerReview-style protocols can provide evidence of faults Challenge: Needs resources, new kinds of evidence Ingredient #4: Mode switcher Each node needs to switch to the new plan Involves transferring state, starting/terminating tasks Some existing work on mode-change protocols Surprisingly, global agreement may not be needed 12 HotOS XV (May 18, 2015)

A. Haeberlen Putting it all together Planning: Decide what to run where in each mode Detection: Nodes audit each other to look for faults Evidence: Nodes prove existence of detected faults Mode change: System reconfigures 13 HotOS XV (May 18, 2015) A B C D B is faulty! Evidence Time R ✔ ✖ ✔ ✔

A. Haeberlen Summary We propose Bounded-Time Recovery (BTR) New approach to fault tolerance System is allowed to produce wrong outputs after a fault, but only for a limited time Case study: Cyber-physical systems Support the additional assumptions that BTR requires BTR could offer lower cost, fine-grained responses to faults Interesting research challenges Unusual scheduling problems, new detection protocols, HotOS XV (May 18, 2015) Questions?