COMP5348 Enterprise Scale Software Architecture Semester 1, 2010 Lecture 10. Fault Tolerance Based on material by Alan Fekete.

Slides:



Advertisements
Similar presentations
1 CSIS 7102 Spring 2004 Lecture 9: Recovery (approaches) Dr. King-Ip Lin.
Advertisements

Authority 2. HW 8: AGAIN HW 8 I wanted to bring up a couple of issues from grading HW 8. Even people who got problem #1 exactly right didn’t think about.
Software Process Models
© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.
Stats for Engineers Lecture 11. Acceptance Sampling Summary One stage plan: can use table to find number of samples and criterion Two stage plan: more.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
The Architecture Design Process
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Software Testing and QA Theory and Practice (Chapter 15: Software Reliability) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “DATABASE RECOVERY” (PART – 1) Academic Year 2014 Spring.
By : Nabeel Ahmed Superior University Grw Campus.
Introduction to Databases and Database Languages
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
System Testing There are several steps in testing the system: –Function testing –Performance testing –Acceptance testing –Installation testing.
Computer Organization CS224 Fall 2012 Lesson 51. Measuring I/O Performance  I/O performance depends on l Hardware: CPU, memory, controllers, buses l.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 15 Software Reliability
INFO 637Lecture #81 Software Engineering Process II Integration and System Testing INFO 637 Glenn Booker.
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
Software Reliability SEG3202 N. El Kadri.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Basic Principle of Statistics: Rare Event Rule If, under a given assumption,
CS1Q Computer Systems Lecture 8
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Nonbehavioral Specifications Non-behavioral Characteristics Portability Portability Reliability Reliability Efficiency Efficiency Human Engineering.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What.
Database Systems Recovery & Concurrency Lecture # 20 1 st April, 2011.
CS 360 Lecture 17.  Software reliability:  The probability that a given system will operate without failure under given environmental conditions for.
Chapter 1: Fundamental of Testing Systems Testing & Evaluation (MNN1063)
1 Advanced Software Architecture Muhammad Bilal Bashir PhD Scholar (Computer Science) Mohammad Ali Jinnah University.
CS 162 Section 10 Two-phase commit Fault-tolerant computing.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
Lecture 15 Page 1 CS 236 Online Evaluating Running Systems Evaluating system security requires knowing what’s going on Many steps are necessary for a full.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Software Metrics and Reliability
Hardware & Software Reliability
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Transactional Information Systems:
Operating System Reliability
Operating System Reliability
Fault Tolerance In Operating System
BASICS OF SOFTWARE TESTING Chapter 1. Topics to be covered 1. Humans and errors, 2. Testing and Debugging, 3. Software Quality- Correctness Reliability.
Software Reliability: 2 Alternate Definitions
COP 5611 Operating Systems Fall 2011
Operating System Reliability
Section 18.8 : Concurrency Control – Timestamps -CS 257, Rahul Dalal - SJSUID: Edited by: Sri Alluri (313)
Mattan Erez The University of Texas at Austin July 2015
Introduction to Fault Tolerance
COP 5611 Operating Systems Spring 2010
System Testing.
Operating System Reliability
Operating System Reliability
Operating System Reliability
Presentation transcript:

COMP5348 Enterprise Scale Software Architecture Semester 1, 2010 Lecture 10. Fault Tolerance Based on material by Alan Fekete

Copyright warning

Outline ›Types of Faults and Failures ›Estimating reliability and availability ›A case study: storage failures ›Techniques for dealing with failure

Terminology: Fault and Failure ›Any system or component has specified (desired) behavior, and observed behavior ›Failure: the observed behavior is different from the specified behavior ›Fault (defect): something gone/done wrongly ›Once the fault occurs, there is a “latent failure”; some time later the failure becomes effective (observed) ›Once the failure is effective, it may be reported and eventually repaired; from failure to the end of repair is a “service interruption”

What fails? ›Anything can fail! -Dealing with a failure may cause more faults! ›Hardware ›Software -OS, DBMS, individual process, language RTE, network stack, etc ›Operations ›Maintenance ›Process ›Environment

How does it fail? ›Failstop -System ceases to do any work ›Hard fault -Leads to: System continues but keeps doing wrong things ›Intermittent (“soft”) fault -Leads to: Temporary wrong thing then correct behavior resumes ›Timing failure -Correct values, but system responds too late

Repeatability? ›If system is run the same way again, will the same failure behavior be observed? -If several systems are run on the same data, will they all fail the same way? ›Heisenbugs: system does not fail the same way in different executions -May be due to race conditions -Very common in practice -Eg “just reboot/powercycle the system” ›“Bohrbugs”: reproducible

Analyzing failure ›Root cause analysis ›Find the fault that led to failure ›Then find why the fault was made -And why it wasn’t detected/fixed earlier ›See My car will not start. (the problem) 1. Why? - The battery is dead. (first why) 2. Why? - The alternator is not functioning. (second why) 3. Why? - The alternator belt has broken. (third why) 4. Why? - The alternator belt was well beyond its useful service life and has never been replaced. (fourth why) 5. Why? - I have not been maintaining my car according to the recommended service schedule. (fifth why, root cause)

Attacks or accident? ›Some failures are due to malice: deliberate attempt to cause damage ›Most are due to bad luck, incompetence, or choice to run risks -often to several of these together!

Outline ›Types of Faults and Failures ›Estimating reliability and availability ›A case study: storage failures ›Techniques for dealing with failure

Memoryless fault model › Failure in each short period is independent of failure in another short period -There is no effect from the lack of failure in the past -Contrast with gambler’s fallacy

Failure rate ›Express the possibility of failure as AFR (annualized failure rate) -Expected ratio of failures to size of population, over a year of operation ›Discuss: how is it possible that AFR>1? ›Can then compute expected number of failures during other periods -Recall: approx 30,000,000 sec in 1 yr -Almost 10,000 hrs in 1 yr

Warning ›AFR = 1% means: in 100,000 cases, 1000 is the expected number that fail during a year ›It does not say: 1000 will definitely fail during the year ›It is not necessarily saying: there is 50% chance that more than 1000 fail, and 50% chance that fewer fail ›It does not say: my system will fail if I wait for 100 years

Bathtub fault model ›Failure is more likely very early in system lifecycle, and very late in system lifecycle -Initially, many construction faults are discovered and fixed -Eventually parts wear out -Software also has “bit rot” due to shifting uses, other changes etc ›Failure is rare in the extended middle of the system lifecycle

Burstiness and environmental effects ›Many failure types do have occurrences that correlate -Often one failure may be fixed, but leave degraded data structures or collateral damage -Environmental effects like heat

MTTF ›Express current failure rate through mean-time-to-failure -MTTF= 1/AFR years -This is often too long to be known through testing ›Quote this at some point in time -If failure rate changes through time, this is not the same as true expected time till failure happens ›Many vendors give “best case” MTTF -“system is guaranteed to perform like this or worse”

Repair ›After a failure, system or component can be repaired/replaced ›This takes time ›MTTR is average time from failure till repair is completed (and system resumes working correctly)

Availability ›What fraction of a time-period is system functioning correctly? ›Availability = MTTF/(MTTF+MTTR) ›Often quoted as “number of nines” ›“3 nines” means 99.9% available -Downtime of approx 500 mins each year

Parts and the whole ›When a system is made up of several components -How does failure of a component impact the system as a whole? ›Often failure of one component means overall failure ›But some designs use redundancy -allow system to keep working until several components all fail

Calculations for systems of components ›For a system whose functioning requires every component to work -Assume that component failures are independent! ›AFR(system) = Sum(AFR of component i ) ›Availability(system) = Product(availability of component i )

Outline ›Types of Faults and Failures ›Estimating reliability and availability ›A case study: storage failures ›Techniques for dealing with failure

Outline ›Types of Faults and Failures ›Estimating reliability and availability ›A case study: storage failures ›Techniques for dealing with failure

Logging ›Record what happens in a stable place -Whose failures are independent of the system -Especially record all user inputs and responses ›There are frameworks that make this somewhat easy ›Essential to deal with mess left by failure ›Also powerful aid in forensics

Styles of recovery ›Backward: Restore system to a sensible previous state, then perform operations again ›Forward: Make adjustments to weird/corrupted state, to bring it to what it ought to have been

Principles ›Idempotence -Make sure that if you perform an operation again, it doesn’t have different effects -Eg write(5) is idempotent, add(2) is not -Create idempotence by knowing what has already been done (store timestamps with the data) ›Causality -Make sure that if you perform operation again, the system is in the same state it was in when op was done originally -Or at least, in a state with whatever context is necessary

Failfast components ›Design of a larger system from redundant parts, is easier if the parts are failstop -And availability is best if they are quick to repair ›So, don’t try to keep going when a problem is observed -Stop completely -Wait for the repair!

N-plex Redundancy ›Have several (n) copies of the same component ›When one fails, use another -And start repairing the first ›So system works as long as some component is working ›The hard thing: making the voter work reliably!

Calculating failure rates ›Assume that failures are independent -Not very realistic: eg environment can affect all in the same way ›Availability(n-plex) = 1 - (1-Avail(component)) n ›MTTF(n-plex) = MTTF n /n*MTTR (n-1)

Summary ›Many varieties of failure -Anything can fail! ›Measures of reliability and availability -Estimating these -Know the assumptions you make! ›Techniques to deal with failures

Further reading ›J. Gray and A. Reuter “Transaction Processing: Concepts and Techniques” (Morgan Kaufmann 1993) ›B. Schroeder, G. Gibson “Understanding Disk Failure Rates: What does a MTTF of Hours mean to you?” ACM Trans on Storage 3(3), 2007 ›E. Pinheiro, W.-D. Weber, L. A. Barroso “Failure Trends in Large Disk Drive Population” Usenix FAST’07 pp ›W. Jiang, C. Hu, Y. Zhou, A. Kanevsky “Are Disks the Dominant Contributor for Storage Failures?” Usenix FAST’08 pp