1 CSSE 477 – More on Availability & Reliability Steve Chenoweth Thursday, 9/22/11 Week 3, Day 3 Right – High availability with VMWare – the major goal.

Slides:



Advertisements
Similar presentations
Object Oriented Analysis And Design-IT0207 iiI Semester
Advertisements

Extreme Programming Alexander Kanavin Lappeenranta University of Technology.
Damian Gordon.  Static Testing is the testing of a component or system at a specification or implementation level without execution of the software.
1 In-Process Metrics for Software Testing Kan Ch 10 Steve Chenoweth, RHIT Left – In materials testing, the goal always is to break it! That’s how you know.
1 Steve Chenoweth Friday, 10/21/11 Week 7, Day 4 Right – Good or bad policy? – Asking the user what to do next! From malware.net/how-to-remove-protection-system-
1 Software Maintenance and Evolution CSSE 575: Session 4, Part 1 Software Maintenance – Big Issues served up, Side order of Reifer Steve Chenoweth Office.
1 Steve Chenoweth Tuesday, 10/04/11 Week 5, Day 2 Right – Typical tool for reading out error codes logged by your car’s computer, to help analyze its problems.
Overview Lesson 10,11 - Software Quality Assurance
SE 450 Software Processes & Product Metrics Reliability: An Introduction.
1 CSSE 377 – Intro to Availability & Reliability Part 2 Steve Chenoweth Tuesday, 9/13/11 Week 2, Day 2 Right – Pictorial view of how to achieve high availability.
Software Performance Engineering Steve Chenoweth CSSE 375, Rose-Hulman Tues, Oct 23, 2007.
Swami NatarajanJune 17, 2015 RIT Software Engineering Reliability Engineering.
SE 450 Software Processes & Product Metrics Reliability Engineering.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Risk Management. What is risk? You have some expected outcome –Of some event in the future Risk is the deviation of the actual future outcome from the.
Software Testing and QA Theory and Practice (Chapter 15: Software Reliability) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
Personal Software Process Overview CIS 376 Bruce R. Maxim UM-Dearborn.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software Reliability Growth. Three Questions Frequently Asked Just Prior to Release 1.Is this version of software ready for release (however “ready” is.
Software Reliability Categorising and specifying the reliability of software systems.
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
ECE 355: Software Engineering
1 Measurement Theory Ch 3 in Kan Steve Chenoweth, RHIT.
Achieving Better Reliability With Software Reliability Engineering Russel D’Souza Russel D’Souza.
SAS 03/ GSFC/SATC-ERAU-DoC Fault Tree Analysis Application for Safety and Reliability Massood Towhidnejad Embry-Riddle University Dolores Wallace & Al.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 15 Software Reliability
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS (Cont’d) Instructor Ms. Arwa Binsaleh.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
Software Metrics - Data Collection What is good data? Are they correct? Are they accurate? Are they appropriately precise? Are they consist? Are they associated.
SOFTWARE ENGINEERING1 Introduction. Software Software (IEEE): collection of programs, procedures, rules, and associated documentation and data SOFTWARE.
OHTO -99 SOFTWARE ENGINEERING “SOFTWARE PRODUCT QUALITY” Today: - Software quality - Quality Components - ”Good” software properties.
CS 350, slide set 6 M. Overstreet Old Dominion University Spring 2005.
 CS 5380 Software Engineering Chapter 8 Testing.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
OBJECT ORIENTED SYSTEM ANALYSIS AND DESIGN. COURSE OUTLINE The world of the Information Systems Analyst Approaches to System Development The Analyst as.
Software Requirements Engineering: What, Why, Who, When, and How
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 3 Slide 1 Critical Systems 1.
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
Software Engineering Chapter 3 CPSC Pascal Brent M. Dingle Texas A&M University.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
PSP Quality Strategy [SE-280 Dr. Mark L. Hornick 1.
Historical Aspects Origin of software engineering –NATO study group coined the term in 1967 Software crisis –Low quality, schedule delay, and cost overrun.
Nonbehavioral Specifications Non-behavioral Characteristics Portability Portability Reliability Reliability Efficiency Efficiency Human Engineering.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
CPSC 873 John D. McGregor Session 9 Testing Vocabulary.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Approaches to ---Testing Software Some of us “hope” that our software works as opposed to “ensuring” that our software works? Why? Just foolish Lazy Believe.
CSC 480 Software Engineering Test Planning. Test Cases and Test Plans A test case is an explicit set of instructions designed to detect a particular class.
CPSC 871 John D. McGregor Module 8 Session 1 Testing.
2007 MIT BAE Systems Fall Conference: October Software Reliability Methods and Experience Dave Dwyer USA – E&IS baesystems.com.
Software Quality Assurance and Testing Fazal Rehman Shamil.
CSE SW Metrics and Quality Engineering Copyright © , Dennis J. Frailey, All Rights Reserved CSE8314M13 8/20/2001Slide 1 SMU CSE 8314 /
Software Engineering Lecture 8: Quality Assurance.
Software Maintenance1 Software Maintenance.
SENG521 (Fall SENG 521 Software Reliability & Testing Preparing for Test (Part 6a) Department of Electrical & Computer Engineering,
CPSC 372 John D. McGregor Module 8 Session 1 Testing.
Software Metrics and Reliability
CIS 375 Bruce R. Maxim UM-Dearborn
John D. McGregor Session 9 Testing Vocabulary
Approaches to ---Testing Software
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
John D. McGregor Session 9 Testing Vocabulary
Software Reliability: 2 Alternate Definitions
John D. McGregor Session 9 Testing Vocabulary
Software Quality Engineering
Presentation transcript:

1 CSSE 477 – More on Availability & Reliability Steve Chenoweth Thursday, 9/22/11 Week 3, Day 3 Right – High availability with VMWare – the major goal is to eliminate all single points of failure. When the an ESX host goes down (the red X), the virtual machines running on the failed host migrate over to the other, healthy ESX hosts that have spare capacity. After migration, the virtual machines are restarted automatically, without intervention from the administrator. From hosting/esx-high-availability.html. hosting/esx-high-availability.html

2 Today 5 Minute talks on availability project Tactics for software availability engineering… –A bit more, mostly from Musa’s book Tonight: –Project 2, final part –HW 3 (individual) John Musa ( ), the inventor of “software reliability engineering.”

3 Software failures Important to discuss with customers –Need to know what variations in system behavior are tolerable Requirements provide a positive specification –Defining failures gives you a “negative specification” What the system must not do –Adds another dimension to communication with the user Below – A windmill goes south.

4 How to classify severity of software failures Based on cost: (these are 1999 $ from Musa’s book, so you need to set your own scale) Severity classDefinition ($) 1> 100, ,000 – 100, – 10,000 4< 1000

5 How to classify severity of software failures, cntd Based on operational impact: Severity class Definition 1Unavailability to users of one or more key operations 2Unavailability to users of one or more important operations 3Unavailability to users of one or more operations but workarounds available 4Minor deficiencies in one or more operations

6 How to classify severity of software failures, cntd Need to make such a table product-specific. Suppose the product is “Fone Follower”: Failure severity class Failure definition 1 Failure that prevents calls from being forwarded 2 Failure that prevents entry of phone numbers to which calls will be forwarded 3 Failure that makes system administration more difficult, but possible through alternate means. Like GUI doesn’t work for some feature, but can use text I/F. 4 Failure that causes minor inconvenience – like screens don’t show current date.

7 Need to set “failure intensity objectives” for each release Need to define a global way of measuring the “intensity” – like failures per hour of operation that will be tolerated. Next goal is to convert these to global measures related to code, like failures per million lines of code run, for the critical subsystems. Heuristic: For a given release, the product of these factors tends to be roughly constant, related to the amount of new functionality added: 1.Failure intensity 2.Development time 3.Development cost

8 Failure intensity vs reliability Generally, = - ln R t Where R = reliability, = failure intensity, and t = number of natural time units. E.g., if reliability is for 8 hours, the failure intensity is one failure per 1000 hours. Note that Musa’s definition of “reliability” is slightly more sophisticated than usual: It’s the probability of execution without failure for a specified time interval (like hours). In contrast, usually we talk of the average time to failure, and so skip over the probability part (it’s 50%).

9 For a new system… We don’t already have a track record of failure intensities –So it’s tougher to judge how much time to spend trying to get it right, for the “next release” – that’s release 1.0 ! What to do? –Find operational data for similar systems Or, how reliable are the underlying systems you’ll use? –Consider vendor warranties – what will we promise? (Or what do others promise?) –Get experts to estimate, based on prior work

10 Availability Tactics Once you know what to prevent, you develop to counteract those problems: Try the 3 Strategies from Bass Ch 5: –Fault detection –Fault recovery –Fault prevention Musa’s variation on this list is: –Fault prevention –Fault removal –Fault tolerance Opening rounds Later

11 Musa’s development strategies Engineer the right balance among these reliability strategies –Determine where to focus them, to maximize the likelihood of meeting the objectives in an economical way –Components you buy and integrate are a problem – you have less control over those! The best you can do may be to test with them as thoroughly as possible, give feedback to their vendors

12 Musa’s strategies, cntd Testers should be in on setting the objectives and deciding the system architecture Fault prevention is done by: –Good development processes, especially: Having sound underlying methodologies Doing reviews – like to requirements & design Enforcing standards Using design tools that keep faults from being introduced

13 Musa’s strategies, cntd How’s fault removal done? –Primarily by code reviews and testing You can measure the effectiveness of the code reviews by how many were caught vs how many remained to be caught in testing Measure the effectiveness of testing by the number of faults found by that testing, vs those found after (like by the customer!)

14 Musa’s strategies, cntd How to achieve fault-tolerance? –Needs design: Anticipate what deviations are likely to occur and will lead to failures Implement “robust” software to counteract them –Like handling unexpected input from users or from other systems –How to minimize performance degradation, data corruption, or undesirable outputs Use hardware to help (see slide 1!) –Can measure effectiveness by the reduction in failure intensity that results

15 Musa’s strategies, cntd In large organizations: –Important to measure the effectiveness of each of these availability tactics –Leads to sound decisions about what to spend money on – “best strategies”

16 Software Safety A topic related to reliability Means freedom from mishaps –Mishap = loss of human life, injury, or property damage –Most software failures just cause user dissatisfaction Software safety is dependent on realistic testing –Need operational profiles to focus testing

17 Software vs hardware Hardware reliability is affected by aging and wear. Software – not so much. –Execution time affects reliability What is its “duty cycle” (in hours)? And What are its failures per execution hour? –To get ultrareliable functions, we need a test duration of several times the reciprocal of the failure intensity objective E.g., to get a failure intensity of failures per hour, you’d need to test for several times 10 9 hours.

18 Musa on fault detection A less precise idea than software failure Absolute concept – an entity you can define without reference to software failures –Like a bad disk read Operational concept – an entity that exists only with reference to the fact that failures will occur if that state is executed –The fault is the defect causing the failure

19 Absolute faults Makes it necessary to postulate a “perfect” program to which an actual program can be compared. Then a fault is incorrect code, in comparison –It’s defective, missing, or an extra instruction –The defective program can be compared with the perfect program. Of course, in reality, there could be many “perfect” programs. –You don’t know about the bad code till you are trying to fix it.

20 Operational faults The “fault” has some reality of its own, as an implementation defect. But really, a piece of bad code could be responsible for no failures, or lots of failures, etc. And “software failures” may not be in the code, per se. So this definition of “fault” has issues, too.

21 A realistic definition Musa ends up saying, like Bass, that a software fault is incorrect code that leads to actual or potential failures. Which leaves open the question of “What code should be included in the fault?” –Requires judgment –Would changing a few related instructions prevent additional failures from occurring?

22 Availability in summary Strategic thing is to work to improve it –Definitely don’t ignore it on any real product –“Hope is a city on de Nile”. Lots more to explore –Plenty of systems out there, where people will pay to have them improve on this –Like all the quality attributes, this could be a career activity A good additional course in this area – Dr. Radu’s ECE 497 – Design of fault tolerant digital systems Right - A software cruise down “de Nile”