1 Software Fault Tolerance (SWFT) Software Testing Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Constantin.

1 Software Fault Tolerance (SWFT) Software Testing Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Constantin Sârbu Dept. of Computer Science TU Darmstadt, Germany

2 Fault Removal: Software Testing  So far: checkpointing, recovery blocks, NVP, NCP, microreboots …  Verification & Validation  Testing Techniques  Static vs. Dynamic  Black-box vs. White-box  Today: Testing of dependable systems  Modeling  Fault-injection (FI / SWIFI)  Some existing tools for fault injection  Next 2 lectures: Testing of operating systems  Fault injection aspects in OSs (WHEN / WHAT to inject)  Profiling the OS extensions (state change @ runtime)

3 Why is PERFECT testing impossible?  HW/OS/SW/Protocols  our fault/error models are speculative  failure modes and associated failure distributions are probabilistic  sequences (# of data cascades, # temporal links) do not follow any meaningful distributions   state space: fault classes only condense equivalent behavior states – nothing more  lack of details available! [processor level, gate, device, transistor, VHDL?]  fixing bugs often causes more bugs (bug re-injections)  cause of bugs is more important: complex spec? complex dependency?  How good are our system models?

4 Dependability Modeling  Simplex R(t) = e -λt  Series R(sys) = R 1 R 2 R 3 …R n R(sys) = e -( 1 + 2 + … n) MTTF = 1/ sys  Parallel Example: R 1 =R 2 =.98 U 1 =U 2 =1-.98=.02 (Unreliability) U(sys) = U 1 U 2 =.0004 R(sys) = 1 – U(sys) =.9996 R(sys) = 1 – (1 - R 1 )(1 – R 2 ) R1R2R3Rn R1 R2 Example1: n=5, R 1 =R 2 =R 3 =R 4 =R 5 =.98 R(sys)=.90 Example2: n=10, R 1 =R 2 =…=R 9 =R 10 =.98 R(sys)=.82

5 Dependability Modeling  TMR: is this a parallel system?  Works as long as two units are fault-free  Assumes independent faults  Perfect voter  No repair!  Reliability: Where did this come from? P1 P2 P3 = o/p

6 Modeling P1 P2 P3 = o/p 2F3 ≈ “probability of one out of three failing”  … How about repair?

7 Modeling (Markov) 2F3 Solving this system gives: P1 P2 P3 = o/p Do we always have perfect detection? Can the system go directly from 3 to F? but = 1000 h = 833 h = 17500 h for λ = 0.001; µ =0.1

8 Coverage in models New structure, two-out-of-four 2F10 P1 P2 P3 = o/p P4

9 Coverage in models New structure, two-out-of-four P1 P2 P3 = o/p P4 2F10 We add the coverage factor C

10 Fault Injection in One Sentence Experimental evaluation using fault injection is the process of analyzing a system’s response to exceptional conditions by intentionally (& artificially) inserting abnormal states during normal operation and monitoring the reaction(s) The Brute-Force Approach for Evaluating and Validating the Provisioning of Dependability

11 Faults  Errors  Failures FaultErrorFailure GoodBad Detection & Recovery No Faults Fault appears Fault activated Error activated Recovery failed Fault disappears Error overwritten Recovery incomplete Error detected Recovery successful Fault Injection Error Injection

12 Basics of Fault Injection  Where: to apply change (location, abstraction/system level)  What: to inject (what should be injected/corrupted?)  Which: trigger to use (event, instruction, timeout, exception, code mutation?)  When: to inject (corresponding to type of fault)  How: often to inject (corresponding to type of fault)  …  What to record & interpret? To what purpose?  How is the system loaded at the time of the injection  Applications running and their load (workload)  System resources  Real  realistic  synthetic workload

13 Various FI Approaches  Physical fault injection  EMI, radiation, …  Simulated fault injection  Injections into VHDL-model  Hardware fault-injection  Pin-level injection  Scan chains  Software implemented fault injection (SWIFI)  Bit-flips, mutations  Code and Data segments  API’s, …

14 Coverage and Latency  Aim is to find characteristics of Event X  Event X may be detection, recovery, etc.  Coverage of Event X  Conditional probability of Event X occurring  E.g. probability of error detection given that an error exists in the system  Latency of Event X  Time from the earliest (theoretically) possible occurrence of Event X to the actual monitored occurrence  E.g. time from error occurrence to error detection

15 Estimating Metrics in FI  Detection coverage = #detections/#injections  Detection latency = mean (detection times)  Recovery coverage = #recoveries/#detections  Recovery latency = mean (recovery times)

16 Physical Fault Injection  Reproduce extreme environmental conditions  EMI  Radiation  Heat  Shock  Voltage drops/spikes etc  Advantages  “Real” faults  Tangible  Simple “test cases”  Disadvantages  Difficult to control/repeat  Needs at least a prototype

17 Simulation-based Fault Injection  Using a model of the system  VHDL  MatLab  SystemC  Spice  Advantages  Usable during design  Controllable  Disadvantages  Requires a model  Model accuracy?  Slow

18 Simulated Fault Injection Fault injection Electrical levelLogical levelFunctional level Change current Change voltage Stuck at 0 or 1 Inverted fault Change CPU Register Flip memory bits, etc. Electrical circuits Logic gates Functional units Physical process Logic operation

19 Hardware-based Fault Injection  Inject faults using hardware (similar to physical)  Pin-level injection  Scan chains  Advantages  Controllable  Close to “real” faults  Disadvantages  Requires special equipment  Reachability?

20 SoftWare Implemented Fault Injection: SWIFI  Manipulate bits in memory locations and registers  Emulation of HW faults  Change text segment of processes Emulation of SW faults (bugs, defects)  Dynamic: E.g., Op-code switch during operation  Static: Change source code and recompile (a.k.a. mutation)

21 SWIFI  PROS:  No special hardware instrumentation  Inexpensive and easy to control  High observability (down to variables)  CONS:  Only into locations accessible to software  Instrumentation may disturb workload  Difficult to observe short latency faults  Open questions:  Is the injected fault representative of a “real” fault?  Is the emulated/simulated environment (ops., load, tests) representative of the real system?

22 A Generic View of SWIFI-Tools Controller Data analyzer Target Injector Stimuli generator Monitor/ Data collector Readouts Setup

23 Many Tools Available  DEPEND, MEFISTO  Evaluating HW/SW architectures using simulations  FERRARI, DOCTOR, RIFLE, Xception  Evaluate tolerance against HW faults  DEFINE, FIAT, FTAPE  Evaluate tolerance against HW and SW faults  MAFALDA, NFTAPE, PROPANE  Evaluate effects of HW & SW faults and analyze error propagation  Ballista  OS Robustness testing

24 DEPEND and MEFISTO  Evaluation of system architectures  E.g. validate TMR recovery protocols, synchronization protocols etc.  Simulate system and components using SW  DEPEND  uses object-oriented design for flexibility  Models a system and it’s interactions and FTM’s  MEFISTO  uses VHDL  Testing of FTM’s  Support for HW-based FI (validating Fault models)

25 FERRARI, DOCTOR and Xception  Evaluate system level effects of HW faults using SWIFI  E.g. bit-errors in registers, address bus errors, etc.  FERRARI (Fault and ERRor Automatic Real-time Injector)  Inject errors while applications are running  Compare with golden run  Registers, PC, Instruction type, branch and CC are targets  DOCTOR  Injects CPU, memory and network faults  Uses timeouts, traps and code mutations  Used on distributed real-time systems  Xception (example on next slides)  Uses debugging facilities in CPU’s

26 Xception  Goal: SWIFI using HW debugging support  Minimizing intrusion using debugging interfaces  Many fault triggers  Detailed performance monitoring can be used  Can affect any SW process (including kernel) No source code needed Injector Target App Fault Setup Experiment Manager Module Outputs Faults Logs Results Fault Archive User space Kernel space

27 Xception’s Fault Model  Duration  Transient  Location  Components inside processor Integer Unit, FPU, MMU, Buses, Registers, Branch processing  Trigger  Temporal  Opcode fetch, Operand load/store  Types  Bit-flips  Masks based on register/bus/memory sizes (e.g. 32 bits)

28 Xception  Data to collect  Fault information  System state information Instruction pointer etc  Kernel and Application deviations Kernel error codes Output of applications (workload)  Error detection status  Performance monitoring information

29 Xception Results for 4 node parallel computer running a Linda π calculation benchmark: © J. Carreira et al, TOSE 24(2) 1998 Results for 4 node parallel computer running a Linda matrix multiplication benchmark (with FT algorithm): © J. Carreira et al, TOSE 24(2) 1998

30 DEFINE, FIAT and FTAPE  Evaluate system level effects of HW and SW faults  E.g. bit-errors in data and code defects  Define  HW and SW faults for distributed systems  Memory, CPU, buses and communication channels  Synthetic WL  Studied the impact of missing/corrupted messages and client failures  FIAT (Fault Injection Automated Testing)  Measures impact on WL applications  Bit-level errors in target workload  Limited fault manifestations

31 MAFALDA, NFTAPE and PROPANE  Evaluate effects of HW and SW faults, and analyze error propagation  From system level down to variable level  Need instrumentation, but no HW-support  MAFALDA focused on micro-kernels  Bit-flips in memory/data and API’s  NFTAPE tries to do everything in one tool!  PROPANE purely software

32 Instrumentation Example (PROPANE) int spherical_volume( double radius ) { double volume; volume = 4.0 * (PI * pow(radius, 3.0)) / 3.0; return volume; } int spherical_volume( double radius ) { double volume; /* Injection location for radius */ propane_inject( IL_SPHERE_VOL, &radius, PROPANE_DOUBLE ); /* Probe the value of radius */ propane_log_var( P_RADIUS, &radius ); volume = 4.0 * (PI * pow(radius, 3.0)) / 3.0; /* Probe the value of volume */ propane_log_var( P_VOLUME, &volume ); return volume; } Original code Instrumented code

33 PROPANE  PROPANE = PROPagation ANalysis Environment Highest Error Rate Lowest Error Rate ms_slot_nbr i mscnt pulscnt slow_speed stopped IsValue OutValue TOC2 ADC TCNT TIC1 PACNT SetValue CLOCK PRES_S V_REG PRES_A CALC DIST_S

34 Code Mutations  Idea: Try to simulate real faults in binary code 1.Search real SW for faults 2.Identify the fault patterns in the binaries 3.Inject the patterns to your SW

35 When Do I Use Approach X? StudyMain Tools Architecture & high- level FI-mechanisms DEPEND, Loki Low-level FI- mechanisms All (except perhaps DEPEND, Loki) OS-robustnessFERRARI, DEFINE (both are for UNIX), MAFALDA (for kernels), Ballista Propagation analysisNFTAPE, PROPANE

36 Fault Injection  This is experimental and a statistical basis for establish a desired level of confidence in the system.  Keep in mind that: a)the statistical basis does not always apply to real systems esp. SW b)statistically significant injections has little meaning if (a) applies c)the injected fault is NOT the real fault

37 More Information  Iyer R., Tang D., ”Experimental Analysis of Computer System Dependability”, Chapter 5 in Pradhan’s book Fault-Tolerant Computer System Design, 1996  www.deeds.informatik.tu-darmstadt.de [Check papers on EPIC, Propane, M. Hiller’s PhD thesis] www.deeds.informatik.tu-darmstadt.de

1 Software Fault Tolerance (SWFT) Software Testing Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Constantin.

Similar presentations

Presentation on theme: "1 Software Fault Tolerance (SWFT) Software Testing Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Constantin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Software Fault Tolerance (SWFT) Software Testing Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Constantin.

Similar presentations

Presentation on theme: "1 Software Fault Tolerance (SWFT) Software Testing Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Constantin."— Presentation transcript:

Similar presentations

About project

Feedback