Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok.

1 Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok Southern Methodist University US/UK Workshop on Network-Centric Operation and Network Enabled Capability, Washington, D.C., July 24-25, 2008

2 Outline Overall Framework External Environment Profiling Component Dependability: Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation Summary and Future Work 7/24/20082US/UK NCO/NEC Workshop

3 Overall Framework Systems made up of different components Many factors contribute to system dependability Our focus: Diversity of individual components Component strength/weakness/diversity: Target: Different dependability attributes and sub-attributes External reference: Operational profile (OP) Internal assessment: Contributors to dependability Value perspective: Relative importance and trade-off Maximize diversity => Maximize dependability Combine strength Avoid/complement/tolerate flaws/weaknesses 7/24/20083US/UK NCO/NEC Workshop

4 Overall Framework (2) Diversity: Four Perspectives Environmental perspective: Operational profile (OP) Target perspective: Goal, requirement Internal contributor perspective: Internal characteristics Value perspective: Customer Achieving diversity and fault tolerance: Component evaluation matrix per target per OP Multidimensional evaluation/composition via DEA (Data Envelopment Analysis) Internal contributor to dependability mapping Value-based evaluation using single objective function 7/24/20084US/UK NCO/NEC Workshop

5 Terminology Quality and dependability are typically defined in terms of conformance to customer’s expectations and requirements Key concepts: defect, failure, fault, and error Dependability: the focus in this presentation Key attributes: reliability, security, etc. Defect = some problem with the software either with its external behavior or with its internal characteristics 7/24/20085US/UK NCO/NEC Workshop

6 Failure, Fault, Error IEEE STD 610.12 terms related to defect: Failure: The inability of a system or component to perform its required functions within specified requirements Fault: An incorrect step, process, or data definition in a computer program Error: A human action that produces an incorrect result Errors may cause faults to be injected into the software Faults may cause failures when the software is executed 7/24/20086US/UK NCO/NEC Workshop

7 Reliability and Other Dependability Attributes Software reliability = the probability for failure-free operation of a program for a specified time under a specified set of operating conditions (Lyu, 1995; Musa et al., 1987) Estimated according to various model based on defect and time/input measurements Standard definitions for other dependability attributes, such as security, fault tolerance, availability, etc. 7/24/20087US/UK NCO/NEC Workshop

8 Outline Overall Framework External Environment Profiling Component Dependability: Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation Summary and Future Work 7/24/20088US/UK NCO/NEC Workshop

9 Diversity: Environmental Perspective Dependability defined for a specific environment Stationary vs dynamic usage environments Static, uniform, or stationary (reached an equilibrium) Dynamic, changing, evolving, with possible unanticipated changes or disturbances Single/overall OP for former category Musa or Markov variation Single evaluation result possible per component per dependability attribute: e.g., component reliability R(i) Environment Profiling for Individual Components Environmental snapshots captured in Musa or Markov Ops Evaluation matrix (later) 7/24/20089US/UK NCO/NEC Workshop

10 Operational Profile (OP) Operational profile (OP) is a list of disjoint set of operations and their associated probabilities of occurrence (Musa 1998) OP describes how users use an application: Help guide the allocation of test cases in accordance with use Ensure that the most frequent operations will receive more testing As the context for realistic reliability evaluation Other usages, including diversity and internal-external mapping in this presentation 7/24/200810US/UK NCO/NEC Workshop

11 Markov Chain Usage Model Markov chain usage model is a set of states, transitions, and the transition probabilities As an alternative to Musa (flat) OP Each link has an associated probability of occurrence Models complex and/or interactive systems better Unified Markov Models (Kallepalli and Tian, 2001; Tian et al., 2003): Collection of Markov Ops in a hierarchy Flexible application in testing and reliability improvement 7/24/200811US/UK NCO/NEC Workshop

12 Operational Profile Development: Standard Procedure Musa’s steps (1998) for OP construction: Identify the initiators of operations Choose a representation (tabular or graphical) Create an operations “list” Establish the occurrence rates of the individual operations Establish the occurrence probabilities Other variations Original Musa (1993): 5 top-down refinement steps Markov OP (Tian et al): FSM then probabilities based on log files 7/24/200812US/UK NCO/NEC Workshop

13 OPs for Composite Systems Using standard procedure whenever possible For overall stationary environment For individual component usage => component OP For dynamic environment: Snapshot identification Sets of OPs for each snapshot System OP from individual component OPs Special considerations: Existing test data or operational logs can be used to develop component OPs Union of component OPs => system OP 7/24/200813US/UK NCO/NEC Workshop

14 OP and Dependability Evaluation Some dependability attributes defined with respect to a specific OP: e.g., reliability For overall stationary environment: direct measurement and assessment possible For dynamic environment: OP-reliability pairs Consequence of improper reuse due to different OPs (Weyuker 1998) From component to system dependability: Customization/selection of best-fit OP for estimation Compositional approach (Hamlet et al, 2001) 7/24/200814US/UK NCO/NEC Workshop

15 Outline Overall Framework External Environment Profiling Component Dependability: Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation Summary and Future Work 7/24/200815US/UK NCO/NEC Workshop

16 Diversity: Target Perspective Component Dependability: Component reliability, security, etc. to be scored/evaluated Direct Measurement and Assessment Indirect Assessment (later) Under stationary environment: Dependability vector for each component Diversity maximization via DEA (data envelopment analysis) Under dynamic environment: Dependability matrix for each component Diversity maximization via extended DEA by flattening out the matrix 7/24/200816US/UK NCO/NEC Workshop

17 Diversity Maximization via DEA DEA (data envelopment analysis): Non-parametric analysis Establishes a multivariate frontier in a dataset Basis: linear programming Applying DEA Dependability attribute frontier Illustrative example (right) N-dimensional: hyperplane 7/24/200817US/UK NCO/NEC Workshop

18 DEA Example Lockheed-Martin software project performance with regard to selected metrics and production efficiency model Measures efficiencies of decision making units (DMU) using weighted sums of inputs and weighted sums of outputs Compares DMUs to each other Sensitivity analysis affords study of non-efficient DMUs in comparison BCC VRS Model used in initial study InputsOutputs Labor hours Software Change Size Software Reliability At Release Defect Density after test Software Productivity Efficiency Output/Input 7/24/200818US/UK NCO/NEC Workshop

19 DEA Example (2) Using production efficiency model for Compute- Intensive dataset group Ranked set of projects Data showing distance and direction from efficiency frontier 7/24/200819US/UK NCO/NEC Workshop

20 Diversity: Internal Perspective Component Dependability: Direct Measurement and Assessment: might not be available, feasible, or cost-effective Indirect Assessment via Internal Contributor Mapping Internal Contributors: System design, architecture Component internal characteristics: size, complexity, etc. Process/people/other characteristics Usually more readily available data/measurements Internal=>External mapping Procedure with OP as input too (e.g., fault=>reliability) 7/24/200820US/UK NCO/NEC Workshop

21 Example: Fault-Failure Mapping for Dynamic Web Applications 7/24/200821US/UK NCO/NEC Workshop

22 Web Example: Fault-Failure Mapping Input to analysis (and fault-failure conversion): Anomalies recorded in web server logs (failure view) Faults recorded during development and maintenance Defect impact scheme (weights) Operational profile Product “A” is an ordering web application for telecom services Consists of hundreds of thousands of lines of code Running on IIS 6.0 (Microsoft Internet Information Server), Process couple of millions requests per day 7/24/200822US/UK NCO/NEC Workshop

23 Web Example: Fault-Failure Mapping (Step 1) Pareto chart for the defect classification of product “A” The top three categories represent 66.26% of the total defect data 7/24/200823US/UK NCO/NEC Workshop

24 Web Example: Fault-Failure Mapping (Steps 4 & 5) Number of Hits with response code 200 and 300235142 Average Number of hits per transaction40 Number of transactions5880 Operation Probability Number of Transactions New order0.1588 Change order0.352058 Move order0.1588 Order Status0.452646 OP for product “A” and the corresponding numbers of transactions. 7/24/200824US/UK NCO/NEC Workshop

25 Web Example: Fault-Failure Mapping (Step 6) Application Aspect Impact Weight Number of transactions Failure Frequency Order statusShowstopper 100% 2646 Order statusHigh 70% 26461852 Order statusMedium 50% 26461323 Order statusLow 20% 2646529 Order statusException 5% 2646132 Using the number of transactions calculated from OP and the defined fault impact schema, we calculated the fault exposure or corresponding potential failure frequencies 7/24/200825US/UK NCO/NEC Workshop

26 Web Example: Fault-Failure Mapping (Step 7) RankResponse Code FaultFailure Frequency 1404/images/dottedsep.gif5805 2404/images/gnav_redbar_s_r.gif3687 3404/images/gnav_redbar_s_l.gif3537 4200/300Order status – showstopper2646 5404/includes/css/images/background.gif2593 6200/300Change order- showstopper2058 7200/300Order status – high1852 8200/300Change order – high1441 9200/300Order status – medium1323 10200/300Change order – medium1029 11404/includes/css/nc2004style.css721 7/24/200826US/UK NCO/NEC Workshop

27 Web Example: Fault-Failure Mapping (Result Analysis) A large number of failures were caused by a small number of errors with high usage frequencies Fixing faults with a high usage frequency and a high impact could achieve better efficiency in reliability improvement By fixing the top 6.8% faults, the total failures were reduced by about 57% Similarly, 10% -> 66%, 15%->71%, 20%->75%, for top- faults induced failure reduction Defect data repository and web server log recorded failures have insignificant overlap => both are needed for effective reliability improvement 7/24/200827US/UK NCO/NEC Workshop

28 Diversity: Value Perspective Component Dependability Attribute: Direct Measurement and Assessment: might not capture what customers truly care about Different value attached to different dependability attributes Value-based software quality analysis: Quantitative model for software dependability ROI analysis Avoid one-size-fits-all Value-based process: experience at NASA/USC (Huang and Boehm) extend to dependability Mapping to value-based perspective more meaningful to target customers 7/24/200828US/UK NCO/NEC Workshop

29 Value Maximization Single objective function: Relative importance Trade-off possible Quantification scheme Gradient scale to selecte component(s) Compare to DEA General cases Combination with DEA Diversity as a separate dimension possible 7/24/200829US/UK NCO/NEC Workshop

30 Outline Overall Framework External Environment Profiling Component Dependability: Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation Summary and Future Work 7/24/200830US/UK NCO/NEC Workshop

31 Experimental Evaluation Testbed Basis: OPs Focus on problems and system behavior under injected or simulated problems Fault Injection for Reliability and Fault Tolerance Reliability mapping for injected faults Use of fault seeding models Direct fault tolerance evaluation Security Threat Simulation Focus 1: likely scenarios Focus 2: coverage via diversity 7/24/200831US/UK NCO/NEC Workshop

32 Summary and Future Work Overall Framework External Environment Profiling Component Dependability: Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation Summary and Future Work 7/24/200832US/UK NCO/NEC Workshop

