Fault-Tolerant Computing Systems #3 Fault-Tolerant Software

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Tolerating Timing faults TSW November 2009 Anders P. Ravn Aalborg University.
Lecture 8: Testing, Verification and Validation
1 Fault-Tolerant Computing Systems #6 Network Reliability Pattara Leelaprute Computer Engineering Department Kasetsart University
Fault-Tolerant Systems Design Part 1.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
線形符号(10章).
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
三角関数の合成.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerance: Basic Mechanisms mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
8. Fault Tolerance in Software
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Design of SCS Architecture, Control and Fault Handling.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
©Ian Sommerville 1995 Software Engineering, 5th edition. Chapter 22Slide 1 Verification and Validation u Assuring that a software system meets a user's.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Timing and Race Condition Verification of Real-time Systems Yann–Hang Lee, Gerald Gannod, and Karam Chatha Dept. of Computer Science and Eng. Arizona State.
TESTING.
CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II.
Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational and logical operators.
Fault-Tolerant Systems Design Part 1.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
CprE 458/558: Real-Time Systems
Software Testing and Quality Assurance Practical Considerations (4) 1.
RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
Discrete Optimization (離散最適化) Linear Programming (線形計画) Nonlinear Programming (非線形計画) Network optimization (ネットワーク計画) Mathematical Programming (数理計画)
Software Engineering 2004 Jyrki Nummenmaa 1 BACKGROUND There is no way to generally test programs exhaustively (that is, going through all execution.
Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational and logical operators.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Pattara Leelaprute Computer Engineering Department
CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
CS 5150 Software Engineering Lecture 22 Reliability 3.
Structuring Redundancy for Fault Tolerance Chapter 2 Designed by: Hadi Salimi Instructor: Dr. Mohsen Sharifi.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Week#2 Software Quality Assurance Software Quality Engineering.
Verification vs. Validation Verification: "Are we building the product right?" The software should conform to its specification.The software should conform.
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Week#3 Software Quality Engineering.
Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational operators – Discover.
Software Quality Assurance
Fault Tolerance In Operating System
Critical systems development
Multi-version approach (with error detection and recovery)
Fault Tolerance Distributed Web-based Systems
Middleware for Fault Tolerant Applications
Ask Have ~ ? / How long ~ ? Answer these questions
Chapter 10 – Software Testing
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Software Verification and Validation
Software Verification and Validation
Regression testing Tor Stållhane.
Software Verification and Validation
ECE 753: FAULT-TOLERANT COMPUTING
Seminar on Enterprise Software
Presentation transcript:

Fault-Tolerant Computing Systems #3 Fault-Tolerant Software Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th

Fault-Tolerant Software Software Design Fault Design fault only (no operational fault, why?) Bug Approaches Fault avoidance Only removes deterministic design faults Review, Testing, V&V (Verification & Validation) It’s difficult / impossible to guarantee that there is no design fault in the software Fault tolerance System level Software design level

Fault-Tolerant Software Single-Version Single software Multi-Version Design Diversity concept Two or more different but functionally identical versions (variants) of a piece of software. Executed in sequence or parallel The versions are used as alternatives, in pairs, or in groups Examples Swedish State Railway Airbus A310, Airbus A320/A330/A340 while(1){ // count=25 writeline(x[0]=count/10); // 25/10=2, show2 <-- count-up, count=26 writeline(x[1]=count%mod); // 26%6=6, show "6" } temp=count; // count=25 writeline(x[0]=temp/10); // 25/10=2, show2 writeline(x[1]=temp%mod); // 26%6=6, show "6"

Single-Version Software Fault Tolerance State-dependent fault (most often) Unanticipated Similar to transient hardware faults (appear and go) Activated by particular input sequence Use only single software Data diversity concept Techniques to be used N-Copy Programming Retry box Checkpoint and Restart (Rollback Recovery) Unanticipated 予期できない

Single-Version Software FT (Data Diversity)  The same software is repeatedly executed with different but logically same input data Faults in software are usually input sequence dependent Ex.calculate sin(x) sin(x) = sin(a)sin(p/2–(x-a)) + sin(p/2–a)sin(x–a) = sin(a)sin(p/2–x+a) + sin(p/2–a)sin(x–a) sin(a+b) = sin(a)cos(b) + cos(a)sin(b) cos(a) = sin(p/2 – a) b = x-a b = x-a

N-Copy Programming sin(x) = sin(a)sin(p/2–x+a) + sin(p/2–a)sin(x–a) Input sin(x) Re-express Data Copy 2 voter Output Copy N sin(aN)sin(p/2–x+aN) + sin(p/2–aN)sin(x–aN) Ex. x=a+b b=x-a x=5とする a1=2,b1=3 a2=3,b2=2 a3=4,b3=1 等 sin(x) = sin(a)sin(p/2–x+a) + sin(p/2–a)sin(x–a)

Retry Block Re-express Data Software No Deadline Expired? Fault OK NG sin(ai)sin(p/2–x+ai) + sin(p/2–ai)sin(x–ai) Re-express Data Software No Acceptance Test Deadline Expired? Fault OK NG Yes

Checkpoint and Restart Checkpoint and Restart (Rollback Recovery) Use a checkpoint to mark a state of system If an error occurs, restart the operation from the normal state before an error occurred. Good for transient fault (occur only under specific condition) Checkpoint Error Rollback Restart = 1.Static, 2.Dynamic 1.Static Like restart button. Go back to the initial reset state. Selection based on the operational situation. 2. Dynamic (全部のWorkを捨てなくてもいい) Dynamically create the checkpoint (snapshot) - Fixed interval / particular point based on some optimization rule existence of unrecoverable actions (external event)!

Multi-Version Software Fault Tolerance Design Diversity Using two or more different but functionally identical (same spec) versions (variants) of a piece of software. Components built differently should fail differently Techniques to be used Recovery Blocks N-Version Programming Acceptance Voting & N Self-Checking System Certification Trails Validity = 正当性 Assertion = 断言・主張(確認)

Recovery Blocks Recovery Blocks Combine checkpoint and restart with multi-version software Use acceptance test (AT) to detect error. If error has been detected, then use the other variant Acceptance Test Test the validity of an output Assertion

Recovery Block Recovery Blocks ・・・ Acceptance Test OK AT Version 1 Output Input Checkpoint NG OK AT Version 2 Checkpoint NG ・・・ Checkpoint needed to recovery the state after a version fail, to provide a valid operational starting point for the next version (in the case that error detected) Version2,3,4 can e degraded performance OK AT Version N Variants NG ERROR

N-Version Programming Operate multiple variants at the same time, and take majority by the voter Version 1 Input Version 2 voter Output Version 3

Acceptance Voting & N Self-Checking System Use separate AT for each version Version 1 Version 2 Version 3 voter AT Compare in pair If not agree, the response of the pair are discarded If agree, then compare again N Self-Checking System Version 1 Compare Version 2 Compare Version 3 Compare Version 4

Certification Trails Certification Trails Improvement of 2-Version Programming Primary module leaves a certification trail (a trail of data at intermediate points in the computation) Secondary module uses certification trail, so it can execute more quickly / have simpler structure. Compare an output of primary and secondary If agree the results are accept, otherwise ERROR Version 1 Certification trail 計算途中の情報 Kormuul tii uu rawang garn pharamern phon Input Certification Trail Output Compare Version 2