Download presentation
Presentation is loading. Please wait.
Published byNathan Mace Modified over 9 years ago
1
1 Software Fault Tolerance (SWFT) Software Testing Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de Prof. Neeraj Suri Constantin Sârbu Dept. of Computer Science TU Darmstadt, Germany
2
2 Fault Removal: Software Testing So far: checkpointing, recovery blocks, NVP, NCP, microreboots … Verification & Validation Testing Techniques Static vs. Dynamic Black-box vs. White-box Today: Testing of dependable systems Modeling Fault-injection (FI / SWIFI) Some existing tools for fault injection Next 2 lectures: Testing of operating systems Fault injection aspects in OSs (WHEN / WHAT to inject) Profiling the OS extensions (state change @ runtime)
3
3 Why is PERFECT testing impossible? HW/OS/SW/Protocols our fault/error models are speculative failure modes and associated failure distributions are probabilistic sequences (# of data cascades, # temporal links) do not follow any meaningful distributions state space: fault classes only condense equivalent behavior states – nothing more lack of details available! [processor level, gate, device, transistor, VHDL?] fixing bugs often causes more bugs (bug re-injections) cause of bugs is more important: complex spec? complex dependency? How good are our system models?
4
4 Dependability Modeling Simplex R(t) = e -λt Series R(sys) = R 1 R 2 R 3 …R n R(sys) = e -( 1 + 2 + … n) MTTF = 1/ sys Parallel Example: R 1 =R 2 =.98 U 1 =U 2 =1-.98=.02 (Unreliability) U(sys) = U 1 U 2 =.0004 R(sys) = 1 – U(sys) =.9996 R(sys) = 1 – (1 - R 1 )(1 – R 2 ) R1R2R3Rn R1 R2 Example1: n=5, R 1 =R 2 =R 3 =R 4 =R 5 =.98 R(sys)=.90 Example2: n=10, R 1 =R 2 =…=R 9 =R 10 =.98 R(sys)=.82
5
5 Dependability Modeling TMR: is this a parallel system? Works as long as two units are fault-free Assumes independent faults Perfect voter No repair! Reliability: Where did this come from? P1 P2 P3 = o/p
6
6 Modeling P1 P2 P3 = o/p 2F3 ≈ “probability of one out of three failing” … How about repair?
7
7 Modeling (Markov) 2F3 Solving this system gives: P1 P2 P3 = o/p Do we always have perfect detection? Can the system go directly from 3 to F? but = 1000 h = 833 h = 17500 h for λ = 0.001; µ =0.1
8
8 Coverage in models New structure, two-out-of-four 2F10 P1 P2 P3 = o/p P4
9
9 Coverage in models New structure, two-out-of-four P1 P2 P3 = o/p P4 2F10 We add the coverage factor C
10
10 Fault Injection in One Sentence Experimental evaluation using fault injection is the process of analyzing a system’s response to exceptional conditions by intentionally (& artificially) inserting abnormal states during normal operation and monitoring the reaction(s) The Brute-Force Approach for Evaluating and Validating the Provisioning of Dependability
11
11 Faults Errors Failures FaultErrorFailure GoodBad Detection & Recovery No Faults Fault appears Fault activated Error activated Recovery failed Fault disappears Error overwritten Recovery incomplete Error detected Recovery successful Fault Injection Error Injection
12
12 Basics of Fault Injection Where: to apply change (location, abstraction/system level) What: to inject (what should be injected/corrupted?) Which: trigger to use (event, instruction, timeout, exception, code mutation?) When: to inject (corresponding to type of fault) How: often to inject (corresponding to type of fault) … What to record & interpret? To what purpose? How is the system loaded at the time of the injection Applications running and their load (workload) System resources Real realistic synthetic workload
13
13 Various FI Approaches Physical fault injection EMI, radiation, … Simulated fault injection Injections into VHDL-model Hardware fault-injection Pin-level injection Scan chains Software implemented fault injection (SWIFI) Bit-flips, mutations Code and Data segments API’s, …
14
14 Coverage and Latency Aim is to find characteristics of Event X Event X may be detection, recovery, etc. Coverage of Event X Conditional probability of Event X occurring E.g. probability of error detection given that an error exists in the system Latency of Event X Time from the earliest (theoretically) possible occurrence of Event X to the actual monitored occurrence E.g. time from error occurrence to error detection
15
15 Estimating Metrics in FI Detection coverage = #detections/#injections Detection latency = mean (detection times) Recovery coverage = #recoveries/#detections Recovery latency = mean (recovery times)
16
16 Physical Fault Injection Reproduce extreme environmental conditions EMI Radiation Heat Shock Voltage drops/spikes etc Advantages “Real” faults Tangible Simple “test cases” Disadvantages Difficult to control/repeat Needs at least a prototype
17
17 Simulation-based Fault Injection Using a model of the system VHDL MatLab SystemC Spice Advantages Usable during design Controllable Disadvantages Requires a model Model accuracy? Slow
18
18 Simulated Fault Injection Fault injection Electrical levelLogical levelFunctional level Change current Change voltage Stuck at 0 or 1 Inverted fault Change CPU Register Flip memory bits, etc. Electrical circuits Logic gates Functional units Physical process Logic operation
19
19 Hardware-based Fault Injection Inject faults using hardware (similar to physical) Pin-level injection Scan chains Advantages Controllable Close to “real” faults Disadvantages Requires special equipment Reachability?
20
20 SoftWare Implemented Fault Injection: SWIFI Manipulate bits in memory locations and registers Emulation of HW faults Change text segment of processes Emulation of SW faults (bugs, defects) Dynamic: E.g., Op-code switch during operation Static: Change source code and recompile (a.k.a. mutation)
21
21 SWIFI PROS: No special hardware instrumentation Inexpensive and easy to control High observability (down to variables) CONS: Only into locations accessible to software Instrumentation may disturb workload Difficult to observe short latency faults Open questions: Is the injected fault representative of a “real” fault? Is the emulated/simulated environment (ops., load, tests) representative of the real system?
22
22 A Generic View of SWIFI-Tools Controller Data analyzer Target Injector Stimuli generator Monitor/ Data collector Readouts Setup
23
23 Many Tools Available DEPEND, MEFISTO Evaluating HW/SW architectures using simulations FERRARI, DOCTOR, RIFLE, Xception Evaluate tolerance against HW faults DEFINE, FIAT, FTAPE Evaluate tolerance against HW and SW faults MAFALDA, NFTAPE, PROPANE Evaluate effects of HW & SW faults and analyze error propagation Ballista OS Robustness testing
24
24 DEPEND and MEFISTO Evaluation of system architectures E.g. validate TMR recovery protocols, synchronization protocols etc. Simulate system and components using SW DEPEND uses object-oriented design for flexibility Models a system and it’s interactions and FTM’s MEFISTO uses VHDL Testing of FTM’s Support for HW-based FI (validating Fault models)
25
25 FERRARI, DOCTOR and Xception Evaluate system level effects of HW faults using SWIFI E.g. bit-errors in registers, address bus errors, etc. FERRARI (Fault and ERRor Automatic Real-time Injector) Inject errors while applications are running Compare with golden run Registers, PC, Instruction type, branch and CC are targets DOCTOR Injects CPU, memory and network faults Uses timeouts, traps and code mutations Used on distributed real-time systems Xception (example on next slides) Uses debugging facilities in CPU’s
26
26 Xception Goal: SWIFI using HW debugging support Minimizing intrusion using debugging interfaces Many fault triggers Detailed performance monitoring can be used Can affect any SW process (including kernel) No source code needed Injector Target App Fault Setup Experiment Manager Module Outputs Faults Logs Results Fault Archive User space Kernel space
27
27 Xception’s Fault Model Duration Transient Location Components inside processor Integer Unit, FPU, MMU, Buses, Registers, Branch processing Trigger Temporal Opcode fetch, Operand load/store Types Bit-flips Masks based on register/bus/memory sizes (e.g. 32 bits)
28
28 Xception Data to collect Fault information System state information Instruction pointer etc Kernel and Application deviations Kernel error codes Output of applications (workload) Error detection status Performance monitoring information
29
29 Xception Results for 4 node parallel computer running a Linda π calculation benchmark: © J. Carreira et al, TOSE 24(2) 1998 Results for 4 node parallel computer running a Linda matrix multiplication benchmark (with FT algorithm): © J. Carreira et al, TOSE 24(2) 1998
30
30 DEFINE, FIAT and FTAPE Evaluate system level effects of HW and SW faults E.g. bit-errors in data and code defects Define HW and SW faults for distributed systems Memory, CPU, buses and communication channels Synthetic WL Studied the impact of missing/corrupted messages and client failures FIAT (Fault Injection Automated Testing) Measures impact on WL applications Bit-level errors in target workload Limited fault manifestations
31
31 MAFALDA, NFTAPE and PROPANE Evaluate effects of HW and SW faults, and analyze error propagation From system level down to variable level Need instrumentation, but no HW-support MAFALDA focused on micro-kernels Bit-flips in memory/data and API’s NFTAPE tries to do everything in one tool! PROPANE purely software
32
32 Instrumentation Example (PROPANE) int spherical_volume( double radius ) { double volume; volume = 4.0 * (PI * pow(radius, 3.0)) / 3.0; return volume; } int spherical_volume( double radius ) { double volume; /* Injection location for radius */ propane_inject( IL_SPHERE_VOL, &radius, PROPANE_DOUBLE ); /* Probe the value of radius */ propane_log_var( P_RADIUS, &radius ); volume = 4.0 * (PI * pow(radius, 3.0)) / 3.0; /* Probe the value of volume */ propane_log_var( P_VOLUME, &volume ); return volume; } Original code Instrumented code
33
33 PROPANE PROPANE = PROPagation ANalysis Environment Highest Error Rate Lowest Error Rate ms_slot_nbr i mscnt pulscnt slow_speed stopped IsValue OutValue TOC2 ADC TCNT TIC1 PACNT SetValue CLOCK PRES_S V_REG PRES_A CALC DIST_S
34
34 Code Mutations Idea: Try to simulate real faults in binary code 1.Search real SW for faults 2.Identify the fault patterns in the binaries 3.Inject the patterns to your SW
35
35 When Do I Use Approach X? StudyMain Tools Architecture & high- level FI-mechanisms DEPEND, Loki Low-level FI- mechanisms All (except perhaps DEPEND, Loki) OS-robustnessFERRARI, DEFINE (both are for UNIX), MAFALDA (for kernels), Ballista Propagation analysisNFTAPE, PROPANE
36
36 Fault Injection This is experimental and a statistical basis for establish a desired level of confidence in the system. Keep in mind that: a)the statistical basis does not always apply to real systems esp. SW b)statistically significant injections has little meaning if (a) applies c)the injected fault is NOT the real fault
37
37 More Information Iyer R., Tang D., ”Experimental Analysis of Computer System Dependability”, Chapter 5 in Pradhan’s book Fault-Tolerant Computer System Design, 1996 www.deeds.informatik.tu-darmstadt.de [Check papers on EPIC, Propane, M. Hiller’s PhD thesis] www.deeds.informatik.tu-darmstadt.de
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.