1 DRAFTS Fault Tolerance Some background Claudio Pinello

Slides:



Advertisements
Similar presentations
Large-Scale Distributed Systems Andrew Whitaker CSE451.
Advertisements

Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Chapter 8 Fault Tolerance
CS 542: Topics in Distributed Systems Diganta Goswami.
Byzantine Generals. Outline r Byzantine generals problem.
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
Consensus Hao Li.
The Byzantine Generals Problem Boon Thau Loo CS294-4.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Effective Policies & Procedures Getting the Performance You Need Ken Langer Workwell Manager, Central East (416)
Byzantine Generals Problem: Solution using signed messages.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Last Class: Weak Consistency
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?
Dept. of Computer Science & Engineering, CUHK Fault Tolerance and Performance Analysis in Wireless CORBA Chen Xinyu Supervisor: Markers: Prof.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Composition Model and its code. bound:=bound+1.
Welcome to Design Studies 1A STRUCTURES. who am I ? Mike Rosenman where am I ? Room 279 contact ? Ph: Fax:
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
1 Chapter 12 Consensus ( Fault Tolerance). 2 Reliable Systems Distributed processing creates faster systems by exploiting parallelism but also improve.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Byzantine Fault Tolerance
Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.
Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Distributed Agreement. Agreement Problems High-level goal: Processes in a distributed system reach agreement on a value Numerous problems can be cast.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Software Dependability
Faults and fault-tolerance
Large Distributed Systems
8.2. Process resilience Shreyas Karandikar.
Fault Tolerance In Operating System
COMP28112 – Lecture 14 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 13-Oct-18 COMP28112.
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 19-Nov-18 COMP28112.
Distributed Consensus
Faults and fault-tolerance
Jacob Gardner & Chuan Guo
Distributed Systems CS
Distributed Systems CS
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 22-Feb-19 COMP28112.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Presentation transcript:

1 DRAFTS Fault Tolerance Some background Claudio Pinello

2 DRAFTS Hammurabi code (~1750 BC) 229 If a builder build a house for some one, and does not construct it properly, and the house which he built fall in and kill its owner, then that builder shall be put to death If it kill the son of the owner the son of that builder shall be put to death If it kill a slave of the owner, then he shall pay slave for slave to the owner of the house If it ruin goods, he shall make compensation for all that has been ruined, and inasmuch as he did not construct properly this house which he built and it fell, he shall re-erect the house from his own means. Inspiration: Prof. Patterson; reproduced without permission from

3 DRAFTS Some Terminology A fault is the cause of an error; an error is the part of the system state which may cause a failure; a failure is the deviation of the system from the specification Adapted from: J.C. Laprie, “Dependability : basic concepts and terminology in English, French, German, Italian, and Japanese”, Springer-Verlag 1992, Series title: Dependable computing and fault-tolerant systems.

4 DRAFTS Example Office Desk –lamp bulb fails (fault) –light level drops (error) –I can’t get work done (failure) –unless…

5 DRAFTS One Good Idea: Redundancy

6 DRAFTS One Bad Idea: Redundancy

7 DRAFTS Structure System-level fault tolerance –avoid single point of failure –avoid common-mode failure (e.g. same bug in replicated software, all power supplies fail above 50 o C, etc.) –fault isolation –cross fingers!

8 DRAFTS Fault Model Silent Faults –faults result in omission errors Crash Faults (fail-stop) –faults result in crashes: no more data, ever! Non-silent Faults –faults result in value errors Byzantine Faults –malicious attacks, non-silent faults, bounded delays, etc…

9 DRAFTS Fault Detection Typically check for errors –Silent Faults: no errors? “omission” errors! Easy for synchronous systems, otherwise use timeouts. Question: You are sick in bed. How do you know if your door bell is broken?

10 DRAFTS Fault Detection Typically check for errors –Non-silent faults: how do you know if result is wrong? –e.g. your calculator computes sin(), how do you know if it is faulty? –BTW: what time is it?

11 DRAFTS Fault Detection Non-silent faults: try voting –you can tolerate up to n/2 -1 faults

12 DRAFTS Fault Detection Typically check for errors –Byzantine faults: oh my! you can’t trust people on chatlines… can you ask them the time? the account number of the red cross for a donation? would you ask them what medicine to take?

13 DRAFTS Byzantine Generals question: “attack or retreat?” message passing (oral/written) there are traitors goal: determine consensus among non-traitors

14 DRAFTS Byzantine Generals Basic algorithm (by Lamport et al.) –n rounds of oral message passing –use majority voting, decide Tolerates up to < 1/3 traitors If you can use signed messages, reduced number or rounds All methods require bounded asynchrony, i.e. bounded delays

15 DRAFTS What model to use? Depends on your application –internet transactions? probably Byzantine –embedded systems? usually non-silent faults are sufficient, but… more networked applications…. –channel transmission? using CRC one “approximates” fail silence HW faults or SW faults?

16 DRAFTS More on redundancy Space-redundancy –hw (e.g. 4 brakes, RAID disks, batteries,…) –data structures (e.g. RAID) –software (e.g. Domain name servers) Time-redundancy –same person, compute twice –“reload” in web-browsers transient faults

17 DRAFTS Recovery You detected a fault, now what? Isolate fault to avoid further errors Recover from fault –backtrack to known good checkpoint –start another agent to compute result –use another already available result –reduce functionality (e.g. slow down) –bring system to safe state (e.g. turn off engine)

18 DRAFTS Conclusions Faults do occur, do you care? Model them Use redundancy right! System-level fault tolerance Techniques exist, some are complex to get right