The Impact of Concurrent Coverage Metrics on Testing Effectiveness

The Impact of Concurrent Coverage Metrics on Testing Effectiveness
Rita Lingmei Chi Chunkun

Study Concurrent Coverage Fault Detection Effectiveness

Problems Testing multi-threaded programs challenging
Concurrent fault detection techniques have limited accuracy Solution: Concurrent Coverage Metrics

Concurrent Coverage Metrics
Define set of requirements: Inter-leavings of synchronization operations Shared variable accesses

Problems Intuition behind concurrent coverage metrics unexplored
No existing study

Background Structural coverage metrics are used to derive set of test requirements -> enumerate set of thread interleaving cases Execute specific code + satisfy constraints on thread interaction

Related Work Analytical comparisons between coverage definitions and bug patterns High levels of coverage correlates with testing effectiveness. Similar study: Evaluates location-pair metric and Compares with two metrics This study is more comprehensive!!!! (HaPSet). Given a set {ρ1, , ρn} of interleavings and a shared memory-accessing or synchronization statement st ∈ Stmt. The History-aware Predecessor Set, or HaPSet[st], is a set {st1, , stk} of statements such that, for all i : 1 ≤ i ≤ k, an event e produced by st is immediately dependent upon an event ei produced by sti in some interleaving ρj , where 1 ≤ j ≤ n.

Goal Investigate existing concurrent coverage metrics
Evidence of usefulness/Demonstrate metric is of less use Quantify relationship between coverage, size and fault detection effectiveness

Study Design

Study Design Variables and Measures Independent
Concurrent Coverage Metrics Test Suite Construction Dependent Achieved Concurrent Coverage Test Suite Size Fault Detection Effectiveness

Study Design

Study Design Variables and Measures Independent
Concurrent Coverage Metrics Test Suite Construction Dependent Achieved Concurrent Coverage Test Suite Size Fault Detection Effectiveness

Contributions & Methodology

Contributions Reveal variability
Correlation between concurrent coverage and fault detection Usefulness of using coverage metrics (random test suite of equal size) Consider other Factors (Parameters used in random testing) The first contributions of this paper is that it re-evaluate prior work and reveal variability The second one is that it discovered that the concurrent testing effectiveness could be improved by considering other factors. The contributions came out based on the study of two research questions.

Research Question Research Question1 Coverage effectiveness Size
The first research question is what result in the increase of testing effectiveness ? metric coverage might be not true, because the increased coverage is accompanied with the increased size.

Research Question Research Question 2 Fault detection effectiveness ?
Achieving maximum coverage Random test suite of equal size The second questions is how to compare the two test suites: one is achieving maximum coverage, as is shown in the left side; the other is a random test suite of equal size Methodology is designed to address two research questions by generating proper test suites. ?

Methodology Eight Coverage metrics Nine concurrent programs
Five variables Design test suited generation process For this study, the authors chose eight,,, nine concurrent… five which rita has mentioned.

Methodology Generate mutants Conduct random test executions
Record requirements covered Record whether a fault is detected Resample to construct test suites Measure

Mutant Generation Goal : diverse fault types
Applied mutation operators: Change synchronization operations Modify synchronized block Discard cases: never fail malformed always killed The specific operations are shown in table 3 in the paper

Test Execution Goal: Test Execution
Estimating the number of test executions to achieve maximum coverage Record test requirements covered & whether a fault was detected Test Execution Randomized test case generation (arbitrary input) Two parameters: probability, delay The probability that a delay will inserted at each shares resource access or synchronization operations The fault detection criteria is defined in the paper

Resampling Goal: Test suite construction Random selection
Construct test suites of varying sizes and levels of coverage. (RQ1) Greedy test suite reduction Construct test suites of maximum coverage. (RQ2)

Measure Relationship The percentage of coverage requirements satisfied
The number of test executions Fault detection ability via regression

goal, methodology, results
Experiments goal, methodology, results

Evaluation RQ1: Relationship between test suite size, coverage and fault detection; how both coverage and size contribute to fault detection effectiveness. RQ2: Quantify the ability of test suites generated to quickly achieve high levels of concurrent coverage.

Evaluation Vector An initial rapid increase in fault detection as coverage increases, followed by a continued, but subdued increase for higher coverages.

Evaluation Vector A rapid increase in both fault detection and coverage for small test suite sizes, and smaller increases as test suite size grows.

Evaluation Stringbuffer

Evaluation For single-fault programs, there are fewer requirements and only a single fault, and thus the increase observed is less consistent. The overall relationship is still positive though.

Evaluation Observation 1
Local & Remote Define (LR-Def) is an extreme case.

Evaluation Vector LR-Def achieves maximum coverage almost immediately for Vector program.

Evaluation Stringbuffer
LR-Def achieves maximum coverage almost immediately for Stringbuffer program.

Evaluation Reason behind observation 1
Those metrics (e.g. LR-Def) are easier to satisfy have high coverage even for very small test suits.

Evaluation Observation 2 The tendency for metrics to cluster.

Evaluation Vector

Evaluation Stringbuffer

If two metrics are both pairwise and synchronization-based, then they tend to exhibit similar levels of coverage for all test suite sizes (e.g. sync-pair & follows).

Evaluation Conclusion 1 Both test suite size and coverage appear to be
positively correlated with fault detection effectiveness, and size is positively correlated with coverage as well.

Evaluation However, knowing them correlated with each other is not enough, we need to quantify the relationships. Pearson’s r Pearson product-moment correlation coefficient, which is a linear correlation between two variables. -1 <= r <= +1

Evaluation

Across all programs using mutation faults, the correlation between coverage and fault detection is above 0.52 for all metrics.

Evaluation

For every single-fault program, several metrics have low correlation with fault detection.

Evaluation

This occurs because the metric’s intuition does not capture the single fault present.

Evaluation Conclusion 2
Coverage is often more strongly correlated with fault detection than size; Each metric is a useful predictor of concurrency testing effectiveness; The occasional low and often moderate correlation between coverage and fault detection hints that there exist factors other than coverage and size that may relate to fault detection effectiveness.

Evaluation However, since test suite size and coverage are also correlated, so does coverage predict fault detection effectiveness as we expected? Or does it merely reflect test suite size?

Evaluation In order to address the previous question, linear regression is introduced to determine whether coverage has an independent explanatory ability with respect to fault detection.

Evaluation First step Model the data as a linear equation
Dependent variable: fault detection Explanatory variables: TS, log(TS), CV, log(CV)

Evaluation Second step
Compute the adjusted R² to determine the goodness of fit for each data model. R²: Coefficient of determination, which is the square of Pearson’s r between two variables.

Evaluation Plot the associated adjusted R2 for each coverage metric, across all objects, indicating which set of explanatory variables have the highest fit.

While no single set of explanatory variables is best, in all instances but one (FF = log(SZ)), models based on both coverage and size are preferable to models using only one explanatory variable. This provides evidence that coverage has a predictive ability separate from test suite size. Furthermore, the adjusted R² is generally less than 0.8, indicating that there exist factors other than coverage and size that may relate to fault detection effectiveness.

Evaluation So far, we have addressed RQ1 that test suite size, coverage and fault detection are correlated with each other, and coverage has a predictive ability for fault detection apart from test suite size. As for RQ2, we would like to quantify the ability of test suites generated to quickly achieve high levels of concurrent coverage.

Evaluation General steps
a. Compare test suits of maximum achievable coverage, generated using greedy algorithm, against random test suites of equal size; b. If a metric is a reasonable target for test case generation, holding the test case generation algorithm constant, while reducing test case size to construct small, high coverage test suits than pure random test case generation.

Evaluation Specific steps
a. Each test suite generated to achieve maximum achievable coverage is paired with a randomly selected test suit of equal size; b. Permutation test is applied using 250,000 permutations for each p-value; c. Compute the average maximum coverage, average relative improvement in coverage over random test suits and average fault detection for the random test suits.

Evaluation

Evaluation Observation 1 Extreme case: Stringbuffer

Evaluation

Evaluation Very low correlations with fault detection and is thus reasonable.

Evaluation Observation 2 ArrayList

Evaluation Increases in average fault detection of 1.7 to 9.5 times at maximum coverage.

Achieving high coverage generally yields not only statistically significant, but also practically significant increases in fault detection.

Pros The key idea is important, providing insights for following work.
Evaluate eight different metrics. Compare the metrics from different perspectives. The analysis is comprehensive and convincing. Well organized focusing on solving the two research questions.

Cons The sample set is too small, only 9 programs.
Coverage metrics producing large numbers of test requirements not studied. Impact of input not studied. Only Java programs tested. Data representation ( Fig1 & Fig2) not clearly described.

Next step?

Next step? Improve existing metrics. Extend the set sample.
Study coverage metrics producing large numbers of test requirements. Cover impact of input. Test more programming languages.

The Impact of Concurrent Coverage Metrics on Testing Effectiveness

Similar presentations

Presentation on theme: "The Impact of Concurrent Coverage Metrics on Testing Effectiveness"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Impact of Concurrent Coverage Metrics on Testing Effectiveness

Similar presentations

Presentation on theme: "The Impact of Concurrent Coverage Metrics on Testing Effectiveness"— Presentation transcript:

Similar presentations

About project

Feedback