EDCC-8 28 April 2010 Valencia, Spain MobiLab Roberto Natella, Domenico Cotroneo {roberto.natella,

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Programming Types of Testing.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.
Making Services Fault Tolerant
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SE 450 Software Processes & Product Metrics Reliability: An Introduction.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Reliability on Web Services Pat Chan 31 Oct 2006.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Swami NatarajanJuly 14, 2015 RIT Software Engineering Reliability: Introduction.
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
1 Software Testing and Quality Assurance Lecture 5 - Software Testing Techniques.
©Ian Sommerville 2006Critical Systems Slide 1 Critical Systems Engineering l Processes and techniques for developing critical systems.
Learning From Mistakes—A Comprehensive Study on Real World Concurrency Bug Characteristics Shan Lu, Soyeon Park, Eunsoo Seo and Yuanyuan Zhou Appeared.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 10 Slide 1 Critical Systems Specification 3 Formal Specification.
1 Joe Meehean. 2 Testing is the process of executing a program with the intent of finding errors. -Glenford Myers.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Exceptions and Mistakes CSE788 John Eisenlohr. Big Question How can we improve the quality of concurrent software?
Chapter 13 Processing Controls. Operating System Integrity Operating system -- the set of programs implemented in software/hardware that permits sharing.
Reliability Andy Jensen Sandy Cabadas.  Understanding Reliability and its issues can help one solve them in relatable areas of computing Thesis.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Software Metrics - Data Collection What is good data? Are they correct? Are they accurate? Are they appropriately precise? Are they consist? Are they associated.
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu
EEL Software development for real-time engineering systems.
Presenter: Jyun-Yan Li Effective Software-Based Self-Test Strategies for On-Line Periodic Testing of Embedded Processors Antonis Paschalis Department of.
Architectural Design lecture 10. Topics covered Architectural design decisions System organisation Control styles Reference architectures.
Consistent and Efficient Database Replication based on Group Communication Bettina Kemme School of Computer Science McGill University, Montreal.
Jon Perez, Mikel Azkarate-askasua, Antonio Perez
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
Practical Byzantine Fault Tolerance
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Prepare by : Ihab shahtout.  Overview  To give an overview of fixed priority schedule  Scheduling and Fixed Priority Scheduling.
14.1/21 Part 5: protection and security Protection mechanisms control access to a system by limiting the types of file access permitted to users. In addition,
CprE 458/558: Real-Time Systems
Implementing Simple Replication Protocols using CORBA Portable Interceptors and Java Serialization T. Bennani, L. Blain, L. Courtes, J.-C. Fabre, M.-O.
DynaRIA: a Tool for Ajax Web Application Comprehension Dipartimento di Informatica e Sistemistica University of Naples “Federico II”, Italy Domenico Amalfitano.
Redesigning Air Traffic Control: An Exercise in Software Design Daniel Jackson and John Chapin, MIT Lab for Computer Science Presented by: Jingming Zhang.
Software Quality Assurance and Testing Fazal Rehman Shamil.
Outsourcing, subcontracting and COTS Tor Stålhane.
1 Developing Aerospace Applications with a Reliable Web Services Paradigm Pat. P. W. Chan and Michael R. Lyu Department of Computer Science and Engineering.
1 Software Testing and Quality Assurance Lecture 17 - Test Analysis & Design Models (Chapter 4, A Practical Guide to Testing Object-Oriented Software)
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
C++ for Engineers and Scientists, Second Edition 1 Problem Solution and Software Development Software development procedure: method for solving problems.
Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Objective ICT : Internet of Services, Software & Virtualisation FLOSSEvo some preliminary ideas.
TOK Meeting I Prof. Domenico Cotroneo University of Napoli Federico II COIMBRA, 10th March 2010.
CS223: Software Engineering Lecture 25: Software Testing.
A statistical anomaly-based algorithm for on-line fault detection in complex software critical systems A. Bovenzi – F. Brancati Università degli Studi.
Testing and Debugging PPT By :Dr. R. Mall.
The Development Process of Web Applications
Soft-Error Detection through Software Fault-Tolerance Techniques
Chapter 24 Testing Object-Oriented Applications
Fault Tolerance Distributed Web-based Systems
Chapter 19 Testing Object-Oriented Applications
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Software Verification and Validation
Software Verification and Validation
Operating Systems : Overview
Chapter 19 Testing Object-Oriented Applications
Operating Systems : Overview
EEC 688/788 Secure and Dependable Computing
Software Verification and Validation
Presentation transcript:

EDCC-8 28 April 2010 Valencia, Spain MobiLab Roberto Natella, Domenico Cotroneo {roberto.natella, The MobiLab Group Dipartimento di Informatica e Sistemistica, Università degli Studi di Napoli Federico II Via Claudio 21, Napoli, Italy 28 April 2010, Valencia, Spain Emulation of Transient Software Faults for Dependability Assessment: A Case Study MobiLab The 8th European Dependable Computing Conference

EDCC-8 28 April 2010 Valencia, Spain MobiLab 2 / 22  Context and problem statement  Software Fault Injection  Bohrbugs and Mandelbugs  Case study from the ATC domain  Evaluation of state-of-the-art fault injection  An experiment involving concurrency faults  Conclusions ::. Outline

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Rationale (1/2)  Software faults represent an important cause of system failures  Despite of efforts on Verification activities, fault avoidance, and fault removal, software systems are often delivered with residual software faults  Critical systems adopt Fault Tolerance Mechanisms (FTMs) to avoid failures at run-time 3 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Rationale (2/2)  FTMs: A few examples  Spatial redundancy CORBA FT, TANDEM90 Process Pairs  Temporal redundancy Checkpointing and rollback  Software Fault Injection (SFI) is a valuable approach for the verification and the improvement of FTMs  To correctly emulate software faults, we need to understand their features 4 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Software Faults  BohrBugs  Faults whose activation is reproducible, i.e., it is straightforward to identify its activation pattern  Typically detected and then fixed during testing phase 5 / 22  MandelBugs  Faults whose activation is transient and not systematically reproducible  Their activation conditions depend on complex combinations of user inputs, the internal state and the external environment

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Problem statement  Mandelbugs represent the major cause of failure in mission-critical system ..up to 82 % in well-tested software [2] [5] [6]  Mandelbugs are typically tolerated by the adoption of several redundancy schemes 6 / 22  Are existing SFI techniques able to emulate Mandelbugs adequately?  How should Mandelbugs be emulated?

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Software Fault Injection (SFI)  To date, representativeness of injected faults has still not been investigated with respect to:  Fault manifestation;  Their effectiveness in testing FTMs (i.e., to emulate faults that most often occur and that they should tolerate) 7 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Contributions  We aim to investigate this issue with a simple experimental campaign but… ….in a complex and real-world software sytems  We evaluated G-SWFIT, with respect to Mandelbugs  We compared the results with an experiment, specifically designed to emulate Mandelbugs  Case study: a fault-tolerant system from the Air Traffic Control (ATC) domain  It is a Flight Data Processor (FDPS) based on a CORBA-compliant middleware 8 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Case study (1/2) 9 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Case study (2/2)  We modeled the FDPS as a FSM to support the analysis of faults 10 / 22  A state consists of the following internal variables: 1)The number of FDP requests queued by the Façade 2)The number of requests under processing 3)The number of requests queued by Processing Servers (PSs)  CR, FR, PSC, …, are the messages exchanged in the FDPS

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Experimental campaign using G-SWFIT (1/4)  We implemented G-SWFIT fault operators in an open-source fault injection tool  The tool analyzes a C/C++ source code file, to produce a set of faulty source files  Freely available at: 11 / 22 C pre- processor C/C++ Source Files C/C++ frontend Fault Injector + ÷ 2 63 Abstract Syntax Tree int main() { if( a && b ) { c++; } } Patch Files (with faults)

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Experimental campaign using G-SWFIT (2/4)  533 faults have been injected in the Façade source code  1599 experiments (3 different workloads)  For each experiment, we collected: Information about a failure (e.g., Façade crash, switch to the backup, missed FDP requests) The state in which a failure occurred The state in which the fault was activated 12 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Experimental campaign using G-SWFIT (3/4)  G-SWFIT is useful to test important system states (e.g., the checkpointing mechanism)  However, faults did not emulate well Mandelbugs because:  A great amount of faults (56%) manifest themselves during Façade initialization or during the first request (state 0:0:0); but Mandelbugs usually manifest themselves during the operational phase of a system 13 / 22 faults activatedfailures

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Experimental campaign using G-SWFIT (4/4) 14 / 22  However, faults did not emulate well Mandelbugs because (CONTINUED):  In most of cases (93%) in which the backup Façade is activated, the backup also fails (i.e., fault activation is simple to reproduce, like Bohrbugs)  Some important states (potentially affected by Mandelbugs) are untested (e.g., when one or more requests are queued by the PSs)  State coverage: 65%

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Concurrency fault emulation (1/3)  To emulate Mandelbugs, we analyzed the scientific literature on software faults  We identified the following fault triggers:  Concurrency  Timing of external events  Wrong memory state  Faulty error handlers  Complex input sequences  Software aging 15 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Concurrency fault emulation (2/3)  Features of most frequent concurrency faults (from a field data study [29]):  They are atomicity-violation faults (49%)  Only 1 shared variable is involved (66%)  At most 2 threads are needed to trigger the fault (90%)  Our fault model:  2 threads access to a shared variable without acquiring a lock (race condition) 16 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Concurrency fault emulation (3/3)  We propose a fault emulation technique in two phases:  Fault injection:  Collects information about critical regions and their memory accesses  Removes lock operations before and after a pair of conflicting critical regions  Trigger injection :  Submits an input sequence to drive the system to a target state  Schedules 2 threads such that memory accesses interfere with each other 17 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Preliminary system characterization  Focusing on fault triggering we have to profile the system to recognize (and then to drive) the operating state  An input is associated to:  A sequence of messages sent and received by the Façade  A sequence of lock and memory accesses 18 / 22 sent by the tester

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. How to trigger a fault?  An algorithm identifies (i) which inputs to send and (ii) in which state to send inputs to trigger a fault  The algorithm exploits preliminary information to match messages with shared memory accesses 19 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. An example of concurrency fault (1/2) The CRQ input is sent 20 / 22 Lock operation omittedThread 1 is blockedThe CR input is sentThread 2 writes an inconsistent value Thread 1 reads the (faulty) value An algorithm processes the FSM to find when to send inputs (next slide)

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Experimental campaign using concurrency faults  4 injected concurrency faults lead to a failure of primary Façade and not the backup one  We covered 13 out of 14 states (93%) in which faults were injected  Cumulative state coverage: 95%  In particular, states *:3:1 were tested (i.e., one or more requests queued by PSs) 21 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Lessons Learned  Are existing SFI techniques able to emulate Mandelbugs adequately?  No, G-SWFIT should be complemented by taking into account Mandelbugs  How should Mandelbugs be emulated?  Our solution is to identify most common fault triggers, and to try to emulate them in addition to modifying the source code 22 / 22

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Thank you! Any questions?

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. Backup slides

EDCC-8 28 April 2010 Valencia, Spain MobiLab ::. G-SWFIT fault operators  G-SWFIT fault operators were derived from a field data study [14]  Fault activation and manifestation were neglected due to lack of data :-D :-D :-D