An Experimental Evaluation of the Reliability of Adaptive Random Testing Methods Hong Zhu Department of Computing and Electronics, Oxford Brookes University,

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Hong Zhu Department of Computing and Communication Technologies Oxford Brookes University, Oxford OX33 1HX, UK COMPSAC 2012 PANEL.

Random Testing Tor Stålhane Jonas G. Brustad. What is random testing The principle of random testing is simple and can be described as follows: 1.For.

Statistical Issues in Research Planning and Evaluation

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Effect Size and Meta-Analysis

The Ways and Means of Psychology STUFF YOU SHOULD ALREADY KNOW BY NOW IF YOU PLAN TO GRADUATE.

1 Rare Event Simulation Estimation of rare event probabilities with the naive Monte Carlo techniques requires a prohibitively large number of trials in.

1 Software Reliability Growth Models Incorporating Fault Dependency with Various Debugging Time Lags Chin-Yu Huang, Chu-Ti Lin, Sy-Yen Kuo, Michael R.

Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.

Evaluating Hypotheses

An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.

An Overview of Today’s Class

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Meta-analysis & psychotherapy outcome research

Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

QMS 6351 Statistics and Research Methods Chapter 7 Sampling and Sampling Distributions Prof. Vera Adamchik.

Experimental Evaluation

Personality, 9e Jerry M. Burger

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Today Concepts underlying inferential statistics

Relationships Among Variables

Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.

Fig Theory construction. A good theory will generate a host of testable hypotheses. In a typical study, only one or a few of these hypotheses can.

©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.

Software Testing Sudipto Ghosh CS 406 Fall 99 November 9, 1999.

Copyright © Cengage Learning. All rights reserved.

Chapter 2 The Research Enterprise in Psychology. n Basic assumption: events are governed by some lawful order  Goals: Measurement and description Understanding.

Chapter 1: Introduction to Statistics

Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.

Some Background Assumptions Markowitz Portfolio Theory

Program Evaluation. Program evaluation Methodological techniques of the social sciences social policy public welfare administration.

Understanding Statistics

1 ECE 453 – CS 447 – SE 465 Software Testing & Quality Assurance Instructor Kostas Kontogiannis.

Chapter 8 Introduction to Hypothesis Testing

The Research Enterprise in Psychology. The Scientific Method: Terminology Operational definitions are used to clarify precisely what is meant by each.

CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.

Yaomin Jin Design of Experiments Morris Method.

 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.

Experimentation in Computer Science (Part 1). Outline  Empirical Strategies  Measurement  Experiment Process.

Statistical Applications Binominal and Poisson’s Probability distributions E ( x ) =  =  xf ( x )

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Test Drivers and Stubs More Unit Testing Test Drivers and Stubs CEN 5076 Class 11 – 11/14.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

Chin-Yu Huang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan Optimal Allocation of Testing-Resource Considering Cost, Reliability,

Statistical Power The power of a test is the probability of detecting a difference or relationship if such a difference or relationship really exists.

Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.

Question paper 1997.

Rescaling Reliability Bounds for a New Operational Profile Peter G Bishop Adelard, Drysdale Building, Northampton Square,

Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.

Classification Ensemble Methods 1

Computer Science 1 Systematic Structural Testing of Firewall Policies JeeHyun Hwang 1, Tao Xie 1, Fei Chen 2, and Alex Liu 2 North Carolina State University.

Analysis of Experimental Data; Introduction

8/23/00ISSTA Comparison of Delivered Reliability of Branch, Data Flow, and Operational Testing: A Case Study Phyllis G. Frankl Yuetang Deng Polytechnic.

1 © 2011 Professor W. Eric Wong, The University of Texas at Dallas Requirements-based Test Generation for Functional Testing W. Eric Wong Department of.

1 Collecting and Interpreting Quantitative Data Deborah K. van Alphen and Robert W. Lingard California State University, Northridge.

Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.

5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Overview Theory of Program Testing Goodenough and Gerhart’s Theory

Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

12 Inferential Analysis.

Random Testing.

Analyzing Reliability and Validity in Outcomes Assessment Part 1

12 Inferential Analysis.

Analyzing Reliability and Validity in Outcomes Assessment

Collecting and Interpreting Quantitative Data

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

An Experimental Evaluation of the Reliability of Adaptive Random Testing Methods Hong Zhu Department of Computing and Electronics, Oxford Brookes University, Oxford OX33 1HX, UK

Aug Exterimental Study of ART Reliability Motivation  General problem:  How to evaluate and compare software testing methods?  Existing works  Comparison criteria  Fault detecting ability  Cost  Comparison methods  Formal analysis Subsumption relation and other relations Axiomatic assessment  Experimental and empirical studies A great number of software testing methods have been proposed, but few have been widely used in practice.

Aug Exterimental Study of ART Reliability What is a good software testing method?  Goodenough and Gerhart (1975):  Validity:  for all software under test (SUT) there is a test set generated by the method that can detect faults if any.  Reliability:  if one such test set detects a fault, then ideally any test set of the method will also detect the fault. J. B. Goodenough and S. L. Gerhart, Toward a theory of test data selection, IEEE TSE, Vol.SE_3, June Fault detecting abilityReliable and stable

Aug Exterimental Study of ART Reliability Proposed idea  Research hypothesis  It is inadequate if testing methods are only compared on fault detecting ability. They should also be compared on how reliable they are in the sense of stably producing high fault detecting rates.  Research questions:  Are testing methods differ from each other in reliability?  If yes, what are the factors that affect their reliabilities?  How to compare testing methods’ reliability?  How to measure?  How to do experiments?  What are the practical implications?

Aug Exterimental Study of ART Reliability The approach  Investigation of some specific software testing methods to test the hypothesis:  random software testing (RT)  adaptive random software testing methods (ART) Adaptive random testing have been proposed to further improve the effectiveness and efficiency of random testing by evenly distribute the test cases in the input space. (T.Y. Chen, et al., 2004, …) Random testing have been regarded as effective as systematic testing more efficient than systematic methods

Aug Exterimental Study of ART Reliability Background: (1) Random testing  Random testing (RT) is a software testing method that selects or generates test cases through random sampling over the input domain of the SUT according to a given probability distribution.  Representative random testing  Non-representative random testing

Aug Exterimental Study of ART Reliability Definitions Let D be the input domain, i.e. the set of valid input data to the SUT. A test set T=, n  0, on domain D is a finite sequence of elements in D, i.e. a i  D, for all i=1, 2, …, n. The number n of elements in a test set is called the size of the test set. A SUT is tested on a test set T= means that the software is executed on inputs a 1, a 2, …, a n one by one with necessary initialization before each execution. Strictly speaking, it should not be called test set as elements are ordered. But, it is a convention in the literature of ART to call them test set. When n=0, the test set is called empty test, which means the software is not executed at all. Without loss of generality, we assume that test sets considered in this paper are non-empty.

Aug Exterimental Study of ART Reliability Definitions A test set T= is a random test set if the elements are selected or generated at random independently according to a given probability distribution on the input domain D. If the same input is allowed to occur more than once in a test set, we say that the random test sets are with replacement; otherwise, the test sets are called without replacement. In the experiment reported here, the random test sets are with replacement.

Aug Exterimental Study of ART Reliability Background: (2) Adaptive random testing  Basic idea:  To evenly spread the test cases in the input space so that it is more likely to find faults in the software.  First to generate random test cases, then to manipulate the random test cases to achieve the goal of even spread.  It is proposed and investigated by Prof. T.Y. Chen and his research group. A variety of such manipulations have been proposed and investigated.

Aug Exterimental Study of ART Reliability Adaptive Random Testing 1: Mirror Let D 0, D 1, D 2, …, D k (k>0) be a disjoint partition of the input domain D. The sub-domains D 0, …, D k are of equal size and there is an one-to-one mapping m i from D 0 to D i, i=1, …, k. The sub- domain D 0 is called the original sub-domain. The other sub- domains are called the mirror sub-domains. The mappings m i are called the mirror functions. Let T 0 = be a random test set on D 0. The mirror of the original test sets T 0 in a mirror sub-domain D i through m i is a test set T i, written m i (T 0 ), defined as follows. m i ( )=, where mi(a 0 j ) = a i j, i=1, …, k, j=1, …, n. The mirror test set of T 0, written Mirror(T 0 ), is a test set on D that Mirror(T 0 ) =.

Aug Exterimental Study of ART Reliability Adaptive Random Testing 2: Distance Let || x, y || be a distance measure defined on input domain D. It is extended to the distance between an element x and a set Y of elements as || x, Y|| = min(||x, y||, y  Y). A test set T= is maximally distanced if for all i=1, 2, …, n, || a i, {a 0, …, a i-1 }||  || a j, {a 0, …, a i-1 } ||, for all j > i. Let T be any given test set. The distance manipulation of T is a re-ordering of T’s elements in the sequence so that it is maximally distanced, written Distance(T). A distance measure on a domain D is a function || || from D 2 to real numbers in [0,  ) such that for all x, y, z  D, || x, x || = 0; || x, y|| = ||y, x || and ||x, y ||  ||x, z || + || z, y||.

Aug Exterimental Study of ART Reliability Adaptive Random Testing 3: Filter Let || x, y || be a distance measure defined on the input domain D. Let T be any given non-empty test set. Test set T’ = is said to be ordered according to the distance to T, if for all i=1, 2,. …, n-1, ||a i, T||  ||a i+1, T||. Let S and C be two given natural numbers such that S > C > 0, and T 0, T 1, …, T k be a sequence of test sets of size S>1. Let U 0 =T 0. Assume that U i = and T i+1 =. We define T’ i+1 = be a test set obtained from T i+1 by permutation its elements so that T’ i+1 is ordered according to the distance to U i. Then, we inductively define U i+1 =, for i=0, 1, …, k-1. The test set U k is called the test set obtained by filtering C out of S test cases.

Aug Exterimental Study of ART Reliability Notes on the definitions of ART techniques  Strictly speaking, the results of the above manipulations are not random test sets even if the test sets before manipulation are random.  These manipulations can be combined together to obtain more sophisticated adaptive random testing.  The set of manipulations defined above may not be complete. There are also other manipulations on random test set;  In the literature, ART methods are usually described in the form of test case generation algorithms rather than manipulation operators.  The filter manipulation is not used in the experiments reported in this paper. For example, in mirror adaptive random testing, the input domain is partitioned. A random test set is first generated on the original sub- domain. It is then manipulated by the Distance operation and then Mirrored into the mirror sub-domain. It is defined here to demonstrate that testing methods defined in the form of an algorithm can also be defined formally as manipulations. This simplifies the descriptions of ART testing methods concisely and formally, This enables us to recognise easily the various different ways in which they can be combined.

Aug Exterimental Study of ART Reliability The general framework of experiments with software testing method  Select a set of software under test (SUT): S 1, S 2, …, S k, k>0;  Select a measurement f of fault detecting ability on single test set;  Applying the test method to generate test sets T 1, T 2, …, T k  Test the SUT on test sets and measure fault detecting ability:  Assume that m 1, m 2, …, m k are the fault detecting abilities of T 1, T 2, …, T k as measured by f in testing SUTs S 1, S 2, …, S k,  Calculate the performance of the test method using a fault detecting ability measure S i can be the same as or different from S j.

Aug Exterimental Study of ART Reliability Background: (4) Measurements of fault detecting ability P-measure: Probability of detecting at least one fault The P-measure of the k tests T 1, T 2, …, T k is defined as follows. P(T 1, T 2, …, T k )=d/k, where k>0, When k , P(T 1, T 2, …, T k ) approaches the probability that a test set T detects at least one fault.

Aug Exterimental Study of ART Reliability Understanding P-measure When the test cases in each test set T are generated at random independently using the same distribution of the operation profile, the probability that a test set of size w detects at least one fault is 1-(1-  ) w, where  is the failure probability of the software. Formally, for random testing, we have that P(T)  1-(1-  ) w, where w = | T |. In other words, the P-measure of random testing is an estimation of the value 1-(1-  ) w, where w is the test size.

Aug Exterimental Study of ART Reliability Measurements of fault detecting ability (2) E-measure: Average number of failures The E-measure of k tests T 1, T 2, …, T k is defined as follows. M(T 1, T 2, …, T k )=, where k>0,, i=1,…, k.

Aug Exterimental Study of ART Reliability Understanding the E-measure When the test sets T i are of the same size, say w, and the test cases are selected at random independently using the same distribution as the operation profile, the E-measure approaches  w, when k . Here,  is the failure probability of the SUT. Formally, for random testing we have that M(T)   w, where w = | T |. In other words, the E-measure of random testing is the estimation of  w, where w is the test size. When test set size is 1, for random testing, P-measure equals to E-measure.

Aug Exterimental Study of ART Reliability Measurements of fault detecting ability (3) F-measure: The First failure The F-measure is defined as follows. F(T 1, T 2, …, T k )=, where k>0, T i =, v i is a natural number that S fails on the test case a vi and for all 0<n<v i, S is correct on test case a n. It is assumed that the sizes of the test sets T i, i=1,…, k, are large enough so that v i exists.

Aug Exterimental Study of ART Reliability Understanding the F-measure  The F-measure is a random variable of geometric distribution, i.e. the distribution of first success Bernoulli trial, under the conditions:  the test cases a i in each test set T are selected at random independently  using the same distribution of the operation profile  In other words, when k , F(T 1, T 2, …, T k )  1/ , where  is the probability of failure. F(T)  1/  Summary. For random testing: P(T)  1  (1  ) w M(T)   wF(T)  1/  where w= |T|

Aug Exterimental Study of ART Reliability Measuring the reliability of test methods S-measure: The variation of fault detecting ability Let:  T 1, T 2, …, T k be a number of test sets generated according to a testing method  f be a given measurement of fault detecting ability, The S-measure S(T 1, T 2, …, T k ) of tests T 1, T 2, …, T k is formally defined as follows. S(T 1, T 2, …, T k )=, where k>1, m i =f(T i ), i=1,…, k,. The S-measure is the sample standard deviation of the values m 1, m 2, …, m k of the f-measures on the test sets. The smaller the S-measure, the more reliable the testing method is.

Aug Exterimental Study of ART Reliability The experiment: the goals  to evaluate ART methods on their reliabilities of fault-detecting abilities with respect to realistic faults.  to identify the factors that influence the reliability of fault detecting ability.  two main candidate factors:  the reliability (or equivalently the failure probability) of the SUT  the regularity of failure domains. This is achieved through the design of experiment with repetition of testing on a number of test sets. Andrews, Briand and Labiche (2005): mutants systematically generated by using mutation testing tools can represent very well the real faults. Thus, we use mutants in our experiments rather than simulation.

Aug Exterimental Study of ART Reliability Experiment process A. Selection of subject programs  GCD: the greatest common divisor  LCM: the least common multiplier B. Generation of mutants  Systematically by using the MuJava testing tool C. Generation of test sets  For each testing method, five test sets were generated  Each test set consists of 1000 test cases. D. Test the mutants  Both F-measures and E-measures are used E. Analysis of the test results Two dimensional input space on natural numbers.

Aug Exterimental Study of ART Reliability Distribution of mutants Mutant TypeGCDLCMTotal AOIS AOIU9615 AORB448 AORS011 LOI14 28 ROR15 30 Total Equivalent Mutants Trivial Mutants Used in experiment The trivial mutants fail on every test case. Types of mutants according to the mutation operators.

Aug Exterimental Study of ART Reliability Test Set Generation Algorithm Step 1. The input domain is divided into two sub-domains. The original sub- domain D 0 ={ }  { }, The mirror sub-domain D 1 ={ }  { }. Step 2. Use pseudo-random function of the Java language with different seeds to generate 500 random test cases in the domain D 0. Let T 1 be the set of these test cases, where the elements are ordered in the order that they are generated. Step 3. The mirror test set T 2 is generated by applying the following mirror mapping m: D 0  D 1. m( )=. (*) The test set RT is obtained by merging T 1 with T 2, i.e. adding the elements of T 2 at the end of T 1. Step 4. Let T=T 1  T 2. The test set DMART is obtained by applying the Distance manipulation on T, i.e. DMART=Distance(T). Step 5. Let T 3 be the result of applying the distance manipulation on T 1, i.e. T 3 =Distance(T 1 ). The test set MDART is obtained by applying the Mirror manipulation on T 3 with the mirror mapping m as defined in (*) above, i.e. MDART=Mirror(T 3 )=Mirror(Distance(T 1 ))

Aug Exterimental Study of ART Reliability The characteristics of the mutants Distribution of average E-measures over 5 test sets Trivial mutants

Aug Exterimental Study of ART Reliability What’s new in the mutants? In the existing research on ART techniques, the subject samples (i.e. the SUT) are always selected with very low failure rates.  Mayer, Chen, Huang (2006): the comparison of RT and ART techniques  mutants with a failure rate higher than 0.05 were dismissed.  all the mutants of failure rates at most 0.05 were treated equally in statistical analysis.  Chen et al. (2004): the performances of ART testing techniques were evaluated by simulating the failure rates of 0.01, 0.005, and  Chen et al. (2007): most of the simulations are for failure rates less then 25%. There are only three sets of simulation data for failure rates higher than 25%, i.e., 1/2 and. Open question: How does ART perform on SUT that have higher failure rates. Simulation experiments on ART

Aug Exterimental Study of ART Reliability Distribution of StDev of E-measures over 5 test sets

Aug Exterimental Study of ART Reliability The relationship between Avg E-measures and their StDev

Aug Exterimental Study of ART Reliability Main findings : (1) Fault detecting abilities Average F-measures over all non-trivial mutants

Aug Exterimental Study of ART Reliability DMART improves RT by 17.64%, MDART improves RT by 3.75%; How much ART improves RT?

Aug Exterimental Study of ART Reliability Comparison with existing works  The results we obtained seem inconsistent with existing works obtain by simulations.  MART and DART were of similar fault detecting abilities in terms of average F-measures. They only differ from each other in the time needed to generate test cases.  The scale of improvement is much smaller than that demonstrated in existing work, which could be up to 40% increase in F-measures while we have shown that the average improvement is less than 20%.

Aug Exterimental Study of ART Reliability Main findings: (2) Factors affect ability What is the main factors that affect ART’s fault detecting ability? three levels of F-measures (>2.5, 1.5 ~ 2.5, <1.5) the M-measures can be divided into three areas (240~340, 340 ~ 680, >680).

Aug Exterimental Study of ART Reliability This explains why our results are different from existing works  Our mutants have higher failure rates.  Others’ have lower faiulre rates ART’s fault detecting abilities increase as SUT’s failure rate decreases.

Aug Exterimental Study of ART Reliability Main findings: (3) ART’s reliability DMART is the most reliable testing method among the three.

Aug Exterimental Study of ART Reliability How much ART improves RT on reliability? DMART made a significant improvement in reliability over RT. The improvement that MDART made is much smaller

Aug Exterimental Study of ART Reliability Main findings: (4) Factors affect reliability Is SUT’s failure rate a factor that affects ART reliability? The relationship between the standard deviation of the F-measure on DMART tests and the average failure rates of the mutants.

Aug Exterimental Study of ART Reliability The correlation coefficients between the average E-measure and the average standard deviations of F-measure of RT, MDART and DMART on all mutants are , and , respectively. The reliabilities of these testing methods depend on SUT’s failure rate. The variation of DMART tests’ F- measure decreases as the average failure rates increases.

Aug Exterimental Study of ART Reliability Is the ‘regularity’ of failure domain a factor that affects reliability? As the standard deviations of E-measures increases, the standard deviations of F-measures tend to increase.

Aug Exterimental Study of ART Reliability The correlation coefficients between the standard deviations of F- measures of RT, MDART and DMART and the standard deviations of E-measures for all mutants are 0.574, and 0.430, respectively. The average standard deviation is calculated on mutants in various ranges of standard deviations of E-measures The reliabilities of testing methods are related to the standard deviation of E-measures, i.e. the regularity of failure domains.

Aug Exterimental Study of ART Reliability Conclusion  We confirmed that ART not only improves the fault detecting ability even on higher failure rate SUT  We discovered that ART significantly improves the reliability of fault detecting ability  the variations on fault detecting abilities are significantly smaller than random testing, where  fault detecting ability is measured by F-measures,  the variation is measured by the standard deviation of F-measures.  We discovered two factors that affect ART’s reliability.  the failure rate of the software under test: when the SUT’s failure rate increases, the test method’s reliability increases.  the regularity of the failure domain: when the regularity increases, the reliability also increases, where  regularity is measured by the standard deviation of E-measures.

Aug Exterimental Study of ART Reliability References  Yu Liu and Hong Zhu, An Experimental Evaluation of the Reliability of Adaptive Random Testing Methods, Proc. of The Second IEEE International Conference on Secure System Integration and Reliability Improvement (SSIRI 2008), Yokohama, Japan, July 14-17, 2008, pp24-31.