TOK Meeting I Prof. Domenico Cotroneo University of Napoli Federico II COIMBRA, 10th March 2010.

TOK Meeting I Prof. Domenico Cotroneo University of Napoli Federico II cotroneo@unina.it COIMBRA, 10th March 2010

1 Research topics Focus on dependability assessment and improvement of complex computer systems Model-based dependability assessment Proactive fault tolerance Software Fault Injection Reliability oriented testing Field Failure Data Analysis (FFDA) Benchmarking Software faults Diagnosis Case studies: Operating Systems, Large-Scale Complex Critical Infrastructures, Wireless Sensor Networks, Supercomputers, …

2 Outline Model-based dependability assessment Proactive fault tolerance Software Fault Injection Reliability oriented testing Field Failure Data Analysis (FFDA) Benchmarking Software faults Diagnosis

Model-based dependability assessment Dependability assessment of distributed systems by means of hybrid modeling and automated failure model generation strategies, in the context of: Wireless Sensor Networks (WSNs) Large-scale Complex Critical Infrastructures (LCCIs) 3

Assessment Issues The Assessment is dramatically complicated by the complexity of potential changes that may take place E.g.: concerning Wireless Sensor Networks:  Workload impacts on packets sent (rate and size)  Packets flow impacts on forwarding node lifetime  Failure rates influenced by above factors (e.g. battery depletion) The network is dynamic, it evolves, adapts and reacts to new working conditions How to address such issues while assessing the dependability?

Automated Failure Model Generation Domain expert FMEA, Test outcomes Literature, …. Model Templates Fixed Parameters XML description of node failure models Experiments of interest Models, Reward metrics and Experiments Model generator

6 Outline Model-based resiliency assessment Proactive fault tolerance Injection of software faults Reliability oriented testing Field Failure Data Analysis (FFDA) Benchmarking Software faults Diagnosis

Proactive Fault tolerance in LCCIs Techniques to tolerate message losses in Internet- scale data dissemination infrastructures without suffering of severe performance fluctuations (timeliness) Focus on Publish/Subscribe middleware, used in the context of Air Traffic Control Systems 7

8 Timely Loss-tolerant Pub/Sub Requirements on event dissemination Reliability Timeliness Scalability Publish/Subscribe services are suitable to federate heterogeneous systems due to their intrinsic decoupling properties that enforce scalability. Large scale Complex Critical Infrastructures (LCCIs) are Internet- scale federations of several autonomous and heterogeneous systems that work collaboratively to provide critical facilities.

9 IssueSolution Tolerate coordinator failures without cluster isolation Active+Passive Coordinator Replication Treat link crashes in the coordinator cluster Multiple-Trees, which allow circumventing the crashed link Guarantee event delivery despite message losses (Coordinator Cluster) FEC approach + gossiping Guarantee event delivery despite message losses (Node Clusters) Layered Opportunistic Multicast Issues and Solutions

10 Outline Model-based resiliency assessment Proactive fault tolerance Software Fault Injection Reliability oriented testing Field Failure Data Analysis (FFDA) Benchmarking Software faults Diagnosis

Software Fault Injection MandelBugs are the major cause of failures in critical, well-tested systems Frequent fault triggers from the literature: Concurrency Timing of external events Wrong memory state BohrBugs MandelBugs “Solid” Simple to reproduce Detected and fixed during testing “Transient” Complex activation conditions (e.g. environment) Difficult to reproduce Critical in well- tested software

Our Contribution We studied if an existing Software Fault Injection technique (G-SWFIT) accurately emulates MandelBugs Some limitations were highlighted in a Fault Tolerant ATC system: When a fault is activated, it affects both replicas Failure mostly occur at the beginning of execution Most faults are not activated We designed a technique for emulating MandelBugs, in which the fault trigger is taken into account

Results Fault Tolerance Mechanisms testing benefited in terms of representativeness and state coverage In future work, the fault model will be extended (e.g., deadlock concurrency faults) and a more extensive experimental analysis will be performed

Reliability-Oriented Testing Reliability/cost trade-off hard to achieve for critical systems Systems complexity, size and heterogeneity Standard high-reliability requirements vs. tight time/cost constraints Inadequateness of current strategies How can engineers plan an effective, reliability-oriented verification for critical software systems? Application Components Heterogeneity Multilayer SubSystem 1 SubSystem 2 Sub System n 15

Identification of the most critical parts of a system Effort Allocation Selection of proper verification techniques Current verification strategies are not well suited to handle these challenges, since… … the most crucial activities regard Reliability should be adopted as pilot criterion to drive verification However, it is difficult to quantitatively evaluate the impact of verification activities on the final reliability Thus, crucial choices are often left to the engineers’ intuition. In particular… Reliability-Oriented Testing 16

Current Investigation We are currently investigating solutions for the improvement of verification activity effectiveness oriented to deliver highly reliable software systems, where the most critical activities are supported by means of quantitative criteria So far, we proposed the following solutions for the two outlined issues 1. A Model-based Approach for Reliability-oriented Effort Allocation 2. Empirical Approaches for Verification Techniques Selection  Evaluation of the effectiveness of existing verification techniques as function of fault classes affecting the system 17

Two empirical studies are conducted to address the issue of how to select and to apply the proper verification techniques Verification Techniques Selection Empirical studies Techniques vs. Fault Types 1.Analysis of the effectiveness of various verification techniques with respect to the fault types (according to the ODC) the system is potentially affected. Observed relationships allows to select techniques based on fault content and ODC types expected to be present in the software under test Fault Types vs. Software Metrics But how can the tester know what kind of faults characterize the software? It depends on software features, such as code complexity and size, the adopted programming techniques, and so on 2.A second study is to characterize a system by investigating potential relationships between fault types and number, and some relevant software features, expressed by means of common software metrics 18

Elusiveness Aging-related bugs For further reliability improvement, we are investigating the software aging phenomenon Analysis of Software Aging In particular we are currently analyzing aging:  From a static point of view: investigating relationships between a wide set of software metrics (52) and aging manifestation (both common metrics and metrics potentially “tailored” to aging bugs)  A set of static indicators to forecast the aging trend prior to operational phase  From a dynamic point of view: How aging is related to workload? How can we compare more systems from aging perspective? We are attempting to relate high-level (comparable) workload features to aging dynamics 19

21 FFDA Field Failure Data Analysis (FFDA): dependability evaluation of computer systems, based on event logs OSs; Internet; Security; Embedded; Mobile; … “logs are the first place where system administrators go when alerted to a problem, since they are one of the few mechanisms for gaining visibility of the behavior of a system” 1 1 A. J. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In DSN ‘07, pages 575–584. IEEE Computer Society, 2007.

22 Known open issues Logs are designed and implemented by programmers o crucial decisions are left to the last steps of the development cycle Subjectivity and unstructured nature leads to: HETEROGENEITY format content INACCURACY duplicated events not useful events missing events … … Unability of discriminating actual failures from presumed ones Meaning depends on developer experience and attitude Furthermore, failure propagation phenomena results in multiple and apparently uncorrelated events in the logs

23 The proposed solution 1 To re-think the way logs are produced a little effort during the system design phase can significantly reduce the analysis efforts and lead to effective FFDA results by providing developers with logging rules and support tools To unambiguously detect failure occurrence and location To trace failure propagation phenomena induced by interactions Prelilminary experimental results have shown how the use of precise rules makes possible to automatically gain information, useful to improve FFDA results

24 Known open issues Time coalescence used as a common mean to group potentially correlated failure events, but… We argue that one-fits-all coalescence leads to unrealistic results, since different subsystems manifest failures with different inter-arrive and duration. o E.g., single failure duration spanning from 2s (memory) up to 15 minutes (I/0) with inter-arrive of days (memory) and tens of minutes (I/O) Well known and still open problem of wrong grouping phenomenon: truncation and collisions

25 The proposed solution 2 A multiple time windows coalescence algorithm Logs are filtered per entities in the system (e.g., supercomputer nodes) and are classified subsystem-per-subsystem For each node subsystem, a suitable coalescence time window is selected empirically Finally single node subsystems failure data is coalesced by selecting a time window able to minimize wrong grouping phenomenon on involved subsystem data The approach is currently implemented in a tool for automating the Log collection and analysis The approach has been applied for the dependability assessment of 3 supercomputers at the National Center for Supercomputing Applications (NCSA) and at the University of Naples.

27 Scope Objective: Realization of benchmarking tools for assessing the performance of Publish/Subscribe architectures Research activities carried out within the context of Distributed Event-Based Systems (DEBS) that adopt the Publish/Subscribe interaction model.

28 Container App component App { requires X; provides Y; subscribes Z; publishes W; } We propose an architecture based on Components, which contains the business logic of the benchmark; Connectors, which encapsulate the communication logic to glue together different components. Such architecture enforces separation of concerns that facilitates the quick development of benchmarking applications. In fact, middleware can provider their products as connectors, so benchmarkers do not require the knowledge of how to use the middleware under benchmarking. Proposed approach

29 EDA Platform Component Framework ApplicationCode Level of Abstraction Much application/platform code still (unnecessarily) written manually. GenCode AppCode In order to reduce the efforts to write application/platform code, we have studied to adopt Model-Driven Development (MDD) technologies: Domain-Specific Modeling Languages (DSMLs); Transformation engines and generators. A model-driven solution

Faults Diagnosis Diagnosis is cause the process of determining the cause of errors both in location and in nature Off line for post mortem analysis On line for just-in- time system reconfiguration …the only viable strategy for (safety) and mission critical systems The system has failed… Why? When? Where is the cause? 31

The proposed approach at glance A tunable diagnosis engine, able to learn over time A priori partial knowledge about system fault model E.g., testing activity outcomes and human experience Detection and location as coupled processes Data based! 32

TOK Meeting I Prof. Domenico Cotroneo University of Napoli Federico II COIMBRA, 10th March 2010.

Similar presentations

Presentation on theme: "TOK Meeting I Prof. Domenico Cotroneo University of Napoli Federico II COIMBRA, 10th March 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TOK Meeting I Prof. Domenico Cotroneo University of Napoli Federico II COIMBRA, 10th March 2010.

Similar presentations

Presentation on theme: "TOK Meeting I Prof. Domenico Cotroneo University of Napoli Federico II COIMBRA, 10th March 2010."— Presentation transcript:

Similar presentations

About project

Feedback