Lui Sha, Summer 20001 Overview – A recipe for successful research: –how to think independently, differently and boldly (lectures 1 - 2) –how to analyze.

Slides:



Advertisements
Similar presentations
Test process essentials Riitta Viitamäki,
Advertisements

COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
Object-Oriented Software Development CS 3331 Fall 2009.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Reliable System Design 2011 by: Amir M. Rahmani
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
1 Software Testing and Quality Assurance Lecture 36 – Software Quality Assurance.
The Architecture Design Process
1 From Ill-formed to Well-formed ! One of the most important skills in research is problem formulation – transforming an interesting idea/question into.
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Review last lectures.
Software Testing and QA Theory and Practice (Chapter 15: Software Reliability) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 15 Software Reliability
Autumn 2008 EEE8013 Revision lecture 1 Ordinary Differential Equations.
CPIS 357 Software Quality & Testing
1 Debugging and Testing Overview Defensive Programming The goal is to prevent failures Debugging The goal is to find cause of failures and fix it Testing.
Software Reliability SEG3202 N. El Kadri.
1 Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions Xue Liu, Hui Ding, Kihwal Lee, Marco Caccamo, Lui Sha.
Team Skill 6: Building the Right System From Use Cases to Implementation (25)
Hypothesis Testing Quantitative Methods in HPELS 440:210.
Analysis of Algorithms
Elementary Sorting Algorithms Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
INF 111 / CSE 121: Software Tools and Methods Lecture Notes for Fall Quarter, 2007 Michele Rousseau Set 12 (Some slides adapted from Sommerville 2000 &
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Software Reliability Research Pankaj Jalote Professor, CSE, IIT Kanpur, India.
CprE 458/558: Real-Time Systems
Chapter 10 Verification and Validation of Simulation Models
Chapter 1 Introduction n Introduction: Problem Solving and Decision Making n Quantitative Analysis and Decision Making n Quantitative Analysis n Model.
Article Summary of The Structural Complexity of Software: An Experimental Test By Darcy, Kemerer, Slaughter and Tomayko In IEEE Transactions of Software.
HNDIT23082 Lecture 06:Software Maintenance. Reasons for changes Errors in the existing system Changes in requirements Technological advances Legislation.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Software Quality Assurance and Testing Fazal Rehman Shamil.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
These slides are designed to accompany Software Engineering: A Practitioner’s Approach, 7/e (McGraw-Hill 2009). Slides copyright 2009 by Roger Pressman.1.
Principal Component Analysis
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Structuring Redundancy for Fault Tolerance Chapter 2 Designed by: Hadi Salimi Instructor: Dr. Mohsen Sharifi.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses pt.1.
CS203 – Advanced Computer Architecture Dependability & Reliability.
SENG521 (Fall SENG 521 Software Reliability & Testing Preparing for Test (Part 6a) Department of Electrical & Computer Engineering,
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Week#3 Software Quality Engineering.
CSE 143 Lecture 13 Inheritance slides created by Ethan Apter
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Step 1: Specify a null hypothesis
Fault Tolerance In Operating System
Software Quality Engineering
Chapter 10 Verification and Validation of Simulation Models
Software Reliability Models.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
T305: Digital Communications
Chapter 10 – Software Testing
Knowing When to Stop: An Examination of Methods to Minimize the False Negative Risk of Automated Abort Triggers RAM XI Training Summit October 2018 Patrick.
Lecture 06:Software Maintenance
Retrieval Performance Evaluation - Measures
Seminar on Enterprise Software
Presentation transcript:

Lui Sha, Summer Overview – A recipe for successful research: –how to think independently, differently and boldly (lectures 1 - 2) –how to analyze ideas analytically, quantitatively and carefully (lectures 3 - 4) – Outlines –Lecture 3: learning the key to reliability modeling in 1 hour. –Lecture 4: reasoning about the relationship between diversity, complexity and reliability – References –Fault Tolerance in Distributed Systems, Pankaj Jalote, Prentice Hall –A First Course in Stochastic Process, Samual Karlin and Harward Taylor, Academic Press –Performance and Reliability Analysis of Computer Systems, R. A. Sahner et al –Tools: Mathematica or Matlab

Lui Sha, Summer Lecture 3 – Questions that we want to answer –which hardware system has a longer MTTF, a TRM system or a singleton computer with no replication and voting? –when you design your research web-site architecture, how do you know which alternative will give you higher availability and/or reliability

Lui Sha, Summer Concept of Reliability Reliability for a giving mission duration t, R(t), is the probability of the system working as specified for a duration that is at least as long as t. The most commonly used reliability function is the exponential reliability function. The failure rate used here is the long term average rate, e.g. 10 failures/year

Lui Sha, Summer MTTF and Availability Mean time to failure: on average how long it takes before a failure occurs. Availability = the percentage of time a system is functioning = MTTF /(MTTF + MTTR) when t   where MTTR is the mean time to repair, 1/ . We also use an exponential model for repair time, where  is the repair rate. Availability is meaningful to use only when __________________

Lui Sha, Summer Simple Reliability Modeling r 1 (t)r 2 (t)

Lui Sha, Summer Parallel System r 1 (t) r 2 (t) When all the components have the same failure rate, we have a simple expression (Mathematica can do it for you symbolically)

Lui Sha, Summer Triple Modular Redundancy r(t) V

Lui Sha, Summer Singleton vs TMR with no Repair Curve 1 Curve 2 Which one is TMR? Why?

Lui Sha, Summer Majority Voting with repair 3 working 2 working Failure 3 2  We did not model the reliability of the voter in this model. What does this imply? What should we do, if the reliability of the voter needs to be modeled?

Lui Sha, Summer Key Ideas – The key ideas in Markov model are –The notion of a state –The transition probability depends only on the current state –State transition probability P iJ (t) defines the probability of the system which starts at state i but is at state j after t units of time. – For example P 13 (t) is the probability that the system starts at state 1 ends at the failure state, state 3 after t units of time. (1 - P 13 (t) ) is the reliability, why? 3 working 2 working Failure 3 2 

Lui Sha, Summer Solution to the Markov Model – P(t) is the matrix of state transition probabilities p iJ (t) – A is the matrix of failure rates and repair rates between states

Lui Sha, Summer The State Transition Rate Matrix, A. 3 working 2 working Failure 3 2  off diagonal elements come from the diagram each row sums to zero State 1State 2 State 3

Lui Sha, Summer A Mathematica Example Simplify[MatrixExp[A t]] R(t) = (1 - P 13 (t)) = 3E -2t - 2E -3t Using basic probability theory, we known that the system works when all 3 or any 2 out 3 working. Hence R(t) = P 3 + 3P 2 (1 - P) = 3P 2 - 2P 3 = 3E -2t - 2E -3t Each version’s reliability is E - t, i.e. = 1 and no repair A = {{-3, 3, 0}, {0, -2, 2}, {0, 0, 0}}

Lui Sha, Summer Summary of Modeling –We started with a simple reliability model –We reasoned about the reliability using elementary probability theory –We examined the key ideas of Markov model and the solution methods –We show that once we draw the state transition diagram, powerful tools will automatically produce the solutions. –The power of Markov model and math packages reduces reliability analysis to – draw the state transition diagrams – write down the transition rate matrix – call the matrix exponential function

Lui Sha, Summer Lecture 4 – Questions that we want to answer –what are the key ideas and intuitions that would allow us to attack the problem of software reliability –how can we turn intuitive ideas into a logical system that we can reason about analytically –how can we use our sharpened understanding to guide our software architecture designs

Lui Sha, Summer From Ill-formed to Well-formed – One of the most important skills in research is problem formulation – transforming an interesting idea/question into something that can be analyzed. – To shed light on the questions that we have raised in last class, we need to –postulate a logical relation between the effort we spent in software engineering and the resulting reliability –this logical relationship should ground in factual observation. But idealization is fine.

Lui Sha, Summer Assumptions Grounded in Observations – 1) The more complex the software project is, the harder it is to make it reliable. For a given degree of complexity, the more effort that we can devote to software engineering, the higher the reliability. – 2) The obvious errors are spotted and corrected early during the development. As time passes by, the remaining errors are subtler, more difficult to detect and correct. – 3) There is only a finite amount of effort (budget) that we can spend on any project.

Lui Sha, Summer From Assumptions to Model – These observations suggest that, for a normalized mission duration t = 1, the reliability of a software system can be expressed as an exponential function of the software complexity, C, and available development effort, E, in the form of R(E, C) = e -C /E. As we can see, R(E, C) rises as effect E increases and decreases as complexity C increases. There is, however, another way to model it. Figure 1: Reliability and Complexity C=1 C = 2

Lui Sha, Summer Analysis – For 3-version programming: –the reliability of each version when efforts are equally allocated is R = exp( – c/(E/3)) –The system works if all of 3 works or any 2 out 3 works. Thus, the reliability of the system Rs = R 3 + 3(R 2 *(1-R)). Note that this analysis assumes faults in different versions are independent, a favorable assumption. – For recovery block with a perfect acceptance test (favorable assumption), if any version works the system works. –For the case of 3 alternatives without complexity reduction and with equal effort allocation, we have R = exp( –c/(E/3)) and Rs = 1 – (1 – R) 3 –if you divide the effort equally among 2 alternatives but one has only 0.5c complexity, then R1 = exp(–c/(E/2)) and R2 = exp(–0.5c/(E/2)) and system reliability Rs = 1 – (1 – R1)(1–R2). – You can try out different effort allocation methods and different complexity reductions and see the results (plot them and try to find a qualitative pattern).

Lui Sha, Summer From Assumption to Model - 3 – Single version vs 3 version (equal allocation) Figure 2: Effect of Divided Efforts in 3-version Programming 3-version programming Single version programming

Lui Sha, Summer Single version vs Recovery Block –Single version vs Recovery Block (3 alternatives, equal allocation, no complexity reduction) Figure 3: Effect of Dividing Effort in Recovery Block Single version programming RB RB: Recovery Block

Lui Sha, Summer Degree of Diversity – RBn, where n is the number of alternatives. (n-way equal allocation, no complexity reduction) RB2 RB3 RB10 Figure 4: Degree of Diversity Adding diversity to system is kind of liking adding salt to a bowl of soup. A little improves the taste. Too much is counter-productive.

Lui Sha, Summer Keep it Simple, Stupid! – RB2Ln, where n is the complexity reduction in the alternative to the primary with full functionality. ( 2-way equal effort allocation, n times complexity reduction in the simple alternative) Figure 5: Effect of Complexity Reduction Single version programming RB2 RB2L2 RB2L10

Lui Sha, Summer Using Simplicity to Control Complexity – Ok, simplicity leads to reliability. But we want fancy features that require complex software. Worse, most applications do not have high coverage acceptance tests. – The solution is to use simplicity to control complexity. –use a simple and reliable core that provides the essential service. –Ensure that the reliable core will not compromised by the faults in the bells-and- whistles. –Leverage the reliable core to ensure the overall system integrity in spite of faults in the complex features, even WHEN THERE IS NO EFFECTIVE ACCEPTANCE TESTS.

Lui Sha, Summer Using Simplicity to Control Complexity A Real World Example – Facts of life –The root cause of software faults is complexity, but companies can’t sell new software without new capabilities and features. –Fault masking by checking output is mostly impractical – This leaves us with forward recovery architectures that –limit the potential damage of complex components –use simpler and reliable components to guarantee system integrity – A real world example that you may bet your life on it –Used successfully in engineering artifacts (e.g. Boeing 777)

Lui Sha, Summer Analytic Redundancy Proven Reliability Control Performance 747 controller 777 controller – Boeing 777 has two digital controllers. The normal controller is an optimized one. The secondary controller is based on the much simpler 747 control technology. – To design a simple, maximal recoverability region controller: Dynamic systems: X’ = A X, where A = (A* + B K), where A* is the system matrix, and K is the reliable control. – Stability condition: A T Q + Q A  0, where Q is the Lyapunov function. X T Q X = 1 is an ellipsoid. The operational state constraints are represented by a polytope described by a set of linear inequalities in the system state space. – The largest ellipsoid in a polytope can be found by minimizing (log det Q) [Boyd 94], subject to stability condition and the state constraints. – The recovery problem will be much easier if the system is open loop stable. Many industry process control systems are open loop stable.

Lui Sha, Summer What is Analytic Redundancy? – The use of analytic redundancy originated from sensors system designs to improve system reliability in spite of NON-Random errors. For example, in navigation to determine position – way points (land marks) – Inertial navigation system – GPS – Analytic redundancy are characterized by: –it is partially redundant (they don’t give identical answers) –there is an analytical relations between the similar answers that allows us to use them to check each others out.

Lui Sha, Summer Compared with Recovery Block – Both uses a simple and reliable component and a full-featured component. – Backward vs forward recovery –Recovery blocks is a backward recovery that try to prevent faults visible from outside. –Analytic redundancy is a forward recovery approach allows for visible faults but try to make them tolerable and recoverable. – Fault detection –The outputs of EACH alternative in recovery block MUST PASS the acceptance test BEFORE it is used. –The simple and reliable alternative’s computation and/or the expected system behavior is used to judge the complex alternative. Error: Bugs in the code; Fault: Bugs activated during runtime; Failure: faults causing system to behave in an unacceptable way.

Lui Sha, Summer Quiz 2: Joe’s Dilemma – Students’ sorting programs will be graded as follows: – “A” if a program is correct with computational complexity O(n log(n)). – “B” if a program is correct with computational complexity O(n 2 ). – “F” if a program sorts items incorrectly. – If Joe uses bubble-sort, he will get a “B”. If Joe uses heap-sort, he will get either an “A” or a “F”. – What should Joe do?

Lui Sha, Summer Solution 2: Analytic Redundancy – heap-sort bubble-sort input output O(n log(n))O(n) if input is sorted O(n 2 ) otherwise Joe will get at least a “B”. The critical property of this system, correctness of sorting, is “controlled” by the logically simpler component.

Lui Sha, Summer Comparison of Two Solutions – Both approaches work in this example. – Under recovery block, if the bubble sort does not but heap sorts works, the system’s answer is still correct. – Under analytic redundancy, if bubble sort is incorrect but heap sort is correct, the system’s answer can be incorrect. – If we want to use simplicity to control complexity. The simple one must work. This is a disadvantage, but not a serious one. If you can’t do the simple one right, your chance of getting the complex one right is not very good. – More important, analytic redundancy still applicable when there is no effective acceptance tests. (Recall the Boeing 777 example). – Bottom line: use recovery block if you can find an effective acceptance test. Otherwise, think analytic redundancy.