Software Reliability Engineering: A Roadmap

Slides:

Advertisements

Similar presentations

Software Quality Assurance Plan

Advertisements

An Empirical Study on Reliability Modeling for Diverse Software Systems Xia Cai and Michael R. Lyu Dept. of Computer Science & Engineering The Chinese.

11. Practical fault-tolerant system design Reliable System Design 2005 by: Amir M. Rahmani.

Predictor of Customer Perceived Software Quality By Haroon Malik.

Making Services Fault Tolerant

1 Building Reliable Web Services: Methodology, Composition, Modeling and Experiment Pat. P. W. Chan Department of Computer Science and Engineering The.

Software Quality Engineering Roadmap

1 Testing Effectiveness and Reliability Modeling for Diverse Software Systems CAI Xia Ph.D Term 4 April 28, 2005.

Software Testing Using Model Program DESIGN BY HONG NGUYEN & SHAH RAZA Dec 05, 2005.

Swami NatarajanJune 17, 2015 RIT Software Engineering Reliability Engineering.

Reliability on Web Services Pat Chan 31 Oct 2006.

An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.

Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.

Reliability Modeling for Design Diversity: A Review and Some Empirical Studies Teresa Cai Group Meeting April 11, 2006.

SENG521 (Fall SENG 521 Software Reliability & Testing Defining Necessary Reliability (Part 3b) Department of Electrical & Computer.

1 The Effect of Code Coverage on Fault Detection Capability: An Experimental Evaluation and Possible Directions Teresa Xia Cai Group Meeting Feb. 21, 2006.

1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

1 Building Reliable Web Services: Methodology, Composition, Modeling and Experiment Pat. P. W. Chan Supervised by Michael R. Lyu Department of Computer.

1 Prediction of Software Reliability Using Neural Network and Fuzzy Logic Professor David Rine Seminar Notes.

Models for Software Reliability N. El Kadri SEG3202.

Chapter 22. Software Reliability Engineering (SRE)

Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.

Software Reliability Categorising and specifying the reliability of software systems.

Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.

CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.

Achieving Better Reliability With Software Reliability Engineering Russel D’Souza Russel D’Souza.

FMEA-technique of Web Services Analysis and Dependability Ensuring Anatoliy Gorbenko Vyacheslav Kharchenko Olga Tarasyuk National Aerospace University.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.

Chapter 3: Software Maintenance Process Omar Meqdadi SE 3860 Lecture 3 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.

 CS 5380 Software Engineering Chapter 8 Testing.

BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu

Testing Workflow In the Unified Process and Agile/Scrum processes.

Product Metrics An overview. What are metrics? “ A quantitative measure of the degree to which a system, component, or process possesses a given attribute.”

Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.

Experimentation in Computer Science (Part 1). Outline  Empirical Strategies  Measurement  Experiment Process.

Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.

Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.

Historical Aspects Origin of software engineering –NATO study group coined the term in 1967 Software crisis –Low quality, schedule delay, and cost overrun.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Chapter 5 McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved.

Software Testing and Quality Assurance Software Quality Assurance 1.

1 Reliable Web Services by Fault Tolerant Techniques: Methodology, Experiment, Modeling and Evaluation Term Presentation Presented by Pat Chan 3 May 2006.

CprE 458/558: Real-Time Systems

Verification and Validation Assuring that a software system meets a user's needs.

Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.

Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.

Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.

Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.

SENG521 (Fall SENG 521 Software Reliability & Testing Overview of Software Reliability Engineering Department of Electrical.

1 Developing Aerospace Applications with a Reliable Web Services Paradigm Pat. P. W. Chan and Michael R. Lyu Department of Computer Science and Engineering.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.

© Chinese University, CSE Dept. Software Engineering / Software Engineering Topic 1: Software Engineering: A Preview Your Name: ____________________.

Experience Report: System Log Analysis for Anomaly Detection

Software Metrics and Reliability

WP3: D3.1 status, pending comments and next steps

Hardware & Software Reliability

Chapter 10 Software Quality Assurance& Test Plan Software Testing

Chapter 18 Maintaining Information Systems

Authors: Maria de Fatima Mattiello-Francisco Ana Maria Ambrosio

Fault Tolerance & Reliability CDA 5140 Spring 2006

Chapter 13 Quality Management

Presented by: CAI Xia Ph.D Term2 Presentation April 28, 2004

© Oxford University Press All rights reserved.

Reliable Web Services: Methodology, Experiment and Modeling International Conference on Web Services (ICWS 2007) Pat. P. W. Chan, Michael R. Lyu Department.

Luca Simoncini PDCC, Pisa and University of Pisa, Pisa, Italy

Presentation transcript:

Software Reliability Engineering: A Roadmap Future of Software Engineering ICSE’2007 Minneapolis, Minnesota May 24, 2007 Software Reliability Engineering: A Roadmap Michael R. Lyu Dept. of Computer Science & Engineering The Chinese University of Hong Kong

Introduction Software reliability is the probability of failure-free operation with respect to execution time and environment. Software reliability engineering (SRE) is the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability. SRE has been adopted by more than 50 companies as standards or best current practices. Creditable software reliability techniques are still in urgent need.

Historical SRE Techniques: Fault Lifecycle Fault prevention: to avoid, by construction, fault occurrences. Fault removal: to detect, by verification and validation, the existence of faults and eliminate them. Fault tolerance: to provide, by redundancy and diversity, service complying with the specification in spite of manifested faults. Fault/failure forecasting: to estimate, by statistical modeling, the presence of faults and occurrence of failures.

Fault Lifecycle Technique Fault Manifestation and Modeling Process Reliability Fault Prevention Fault Removal Fault Tolerance Fault/Failure Forecasting

Fault Lifecycle Technique Fault Manifestation and Modeling Process Reliability Availability Safety Security Fault Prevention Fault Removal Fault Tolerance Fault/Failure Forecasting

Software Reliability Modeling  R = e -t Testing Time

Current SRE Process Overview

Current Trends and Problems The theoretical foundation of software reliability comes from hardware reliability techniques. Software failures do not happen independently. Software failures seldom repeat in exactly the same or predictable pattern. Failure mode and effect analysis (FMEA) for software is still controversial and incomplete. There is currently a need for a creditable end-to-end software reliability paradigm that can be directly linked to reliability prediction from the very beginning.

Future Direction 1: Reliability-Centric Software Architectures The product view – achieve failure-resilient software architecture Fault prevention Fault tolerance The process view – explore the component-based software engineering Component identification, construction, protection, integration and interaction Reliability modeling based on software structure

Future Direction 2: Design for Reliability Achievement Fault confinement Fault detection Diagnosis Reconfiguration Recovery Restart Repair Reintegration

Fault Confinement Fault Detection Failover Diagnosis Online Offline Reconfiguration Recovery Restart Repair Reintegration 1. Fault confinement. This stage limits the spread of fault effects to one area of the Web service, thus preventing contamination of other areas. Fault-confinement can be achieved through use of: fault-detection within the Web services, consistency checks and multiple requests/confirmations. 2. Fault detection. This stage recognizes that something unexpected has occurred in the Web services. Fault latency is the period of time between the occurrence of a fault and its detection. Techniques fall in 2 classes: off-line and on-line. With off-line techniques, such as diagnostic programs, the service is not able to perform useful work while under test. On-line techniques, such as duplication, provide a real-time detection capability that is performed concurrently with useful work. 3. Diagnosis. This stage is necessary if the fault detection technique does not provide information about the failure location and/or properties. 4. Reconfiguration. This stage occurs when a fault is detected and a permanent failure is located. The Web services can be composed of different components. When providing the service, there may be failure in individual components. The system may reconfigure its components either to replace the failed component or to isolate it from the rest of the system. 5. Recovery. This stage utilizes techniques to eliminate the effects of faults. Two basic recovery approaches are based on: fault masking, retry and rollback. Fault-masking techniques hide the effects of failures by allowing redundant information to outweigh the incorrect information. Web services can be replicated or implemented with different versions (NVP). Retry attempts a second attempt at an operation and is based on the premise that many faults are transient in nature. Web services provide services through network; retry would be a practical as requests/reply may be affected by the situation of the network. Rollback makes use of the fact that the Web service operation is backed up (checkpointed) to some point in its processing prior to fault detection and operation recommences from this point. Fault latency is important here because the rollback must go back far enough to avoid the effects of undetected errors that occurred before the detected error. 6. Restart. This stage occurs after the recovery of undamaged information. l Hot restart: resumption of all operations from the point of fault detection and is possible only if no damage has occurred. l Warm restart: only some of the processes can be resumed without loss. l Cold restart: complete reload of the system with no processes surviving. The Web services can be restarted by rebooting the server. 7. Repair. At this stage, a failed component is replaced. Repair can be off-line or on-line. Web services can be component-based and consist of other Web services In off-line repair either the Web service will continue if the failed component/sub-Web service is not necessary for operation or the Web services must be brought down to perform the repair. In on-line repair the component/sub-Web service may be replaced immediately with a backup spare or operation may continue without the component. With on-line repair Web service operation is not interrupted. 8. Reintegration. In this stage the repaired module must be reintegrated into the Web service. For on-line repair, reintegration must be performed without interrupting Web service operation.

Future Direction 3: Testing for Reliability Assessment Establish the link between software testing and reliability Study the effect of code coverage to fault coverage Evaluate impact of reliability by various testing metrics Assess competing testing schemes quantitatively

Positive vs. negative evidences for coverage-based software testing Resources Findings Positive Frankl(1988) Horgan(1994) Weyuker(1988) High code coverage brings high software reliability and low failure rate Chen(1992) A correlation between code coverage and software reliability is observed Wong(1994) The correlation between test effectiveness and block coverage is higher than that between test effectiveness and the size of test set Frate(1995) An increase in reliability comes with an increase in at least one code coverage measures Cai (2005) Code coverage contributes to a noticeable amount of fault coverage Negative Briand(2000) The testing result on published data did not support a causal dependency between code coverage and defect coverage

RSDIMU test cases description II III IV V VI This is the descriptions of test set, containing the detailed testing purpose of each test case. Can be classified as functional testing (1-800) and random testing (801-1200). Can be classified into six regions according to their different patterns.

The correlation: various test regions Linear regression relationship between block coverage and fault coverage in the whole test set Linear modeling fitness in various test case regions Fault Coverage For overall, moderate; highest, region IV, lowest: region VI

The correlation: normal operational testing vs. exceptional testing Testing profile (size) R-square Whole test case (1200) 0.781 Normal testing (827) 0.045 Exceptional testing (373) 0.944 Normal operational testing very weak correlation Exceptional testing strong correlation

The correlation: normal operational testing vs. exceptional testing Normal testing: small coverage range (48%-52%) Exceptional testing: two main clusters Fault Coverage Fault Coverage Normal testing; code coverage (48%-52%) main control flow/data flow Exceptional testing: two clusters . The reason is in some cases, part of large-scale computational functions are executed but others will be skipped. But in other cases, all these computational code are skipped.

The Spectrum in Software Testing and Reliability Time Based Models Coverage Testing - user oriented - tester oriented - more physical meaning - less physical meaning - abundant models - lack of models - easy data collection - hard data collection - less relevance to testing - more relevance to testing New Model Software Reliability Growth Models Coverage-Based Analysis A new model is needed to combine execution time and testing coverage

A New Coverage-Based Reliability Model λ(t,c): joint failure intensity function λ1(t): failure intensity function with respect to time λ2(c): failure intensity function with respect to coverage α1,γ1, α2, γ2: parameters with the constraint of α 1 + α 2 = 1 joint failure intensity function failure intensity function with time failure intensity function with coverage Dependency factors In`tegral De`rivative Since lambda(t) is the failure intensity function with respect to time, any existing distributions in well-known reliability models can be used, e.g., NHPP,Weibull model,S-shaped model or logarithmic Poisson models.

Estimation Accuracy NHPP model

Future Direction 4: Metrics for Reliability Prediction New models (e.g., BBN) to explore rich software metrics Data mining approaches Machine learning techniques Bridging the gap of the one-way function: feedback to building reliable software Continuous industrial data collection efforts – demonstration of cost-effectiveness

Future Direction 5: Reliability for Emerging Software Applications “The Internet changes everything” On-demand customizable software Service oriented architecture, composition, integration Customization by middleware – from metadata to metacode A common infrastructure delivers reliability to all customers

A Paradigm for Reliable Web Service Replication Manager Web service selection algorithm WatchDog UDDI Registry WSDL Web Service IIS Application Database Client Port Create Web services Select primary Web service (PWS) 3. Register 4. Look up 5. Get WSDL 6. Invoke Web service Keep check the availability of the PWS If PWS failed, reselect the PWS. 9. Update the WSDL

Conclusions Software reliability is receiving higher attention as it becomes an important economic consideration for businesses. New SRE paradigms need to consider software architectures, testing techniques, data analyses, and creditable reliability modeling procedures. Domain specific approaches on emerging software applications are worthy of investigation. Still a long way to go, but the directions are clear.