Mitigating the Effects of Flaky Tests on Mutation Testing

Slides:

Advertisements

Similar presentations

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Pallavi Joshi  Chang-Seo Park  Koushik Sen  Mayur Naik ‡  Par Lab, EECS, UC Berkeley‡

Advertisements

Race Directed Random Testing of Concurrent Programs KOUSHIK SEN - UNIVERSITY OF CALIFORNIA, BERKELEY PRESENTED BY – ARTHUR KIYANOVSKI – TECHNION, ISRAEL.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.

CS527: Advanced Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 18, 2008.

Regression Methodology Einat Ravid. Regression Testing - Definition  The selective retesting of a hardware system that has been modified to ensure that.

Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.

Paraμ A Partial and Higher-Order Mutation Tool with Concurrency Operators Pratyusha Madiraju AdVanced Empirical Software Testing and Analysis (AVESTA)

Mutation Testing Presented by Sharath Kumar Garlapati Vinesh Thummala.

(Quickly) Testing the Tester via Path Coverage Alex Groce Oregon State University (formerly NASA/JPL Laboratory for Reliable Software)

A CONTROL INSTRUMENTS COMPANY The Effectiveness of T-way Test Data Generation or Data Driven Testing Michael Ellims.

© 2006 Fraunhofer CESE1 MC/DC in a nutshell Christopher Ackermann.

State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.

CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.

AMOST Experimental Comparison of Code-Based and Model-Based Test Prioritization Bogdan Korel Computer Science Department Illinois Institute of Technology.

Empirically Revisiting the Test Independence Assumption Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D. Ernst, David Notkin.

Change Impact Analysis for AspectJ Programs Sai Zhang, Zhongxian Gu, Yu Lin and Jianjun Zhao Shanghai Jiao Tong University.

Mining and Analysis of Control Structure Variant Clones Guo Qiao.

Scalable Statistical Bug Isolation Authors: B. Liblit, M. Naik, A.X. Zheng, A. Aiken, M. I. Jordan Presented by S. Li.

From Quality Control to Quality Assurance…and Beyond Alan Page Microsoft.

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

Computer Science 1 Test Selection and Augmentation of Regression System Tests for Security Policy Evolution JeeHyun Hwang, Tao Xie, and collaborators at.

When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington.

CAPP: Change-Aware Preemption Prioritization Vilas Jagannath, Qingzhou Luo, Darko Marinov Sep 6 th 2011.

PROGRAMMING TESTING B MODULE 2: SOFTWARE SYSTEMS 22 NOVEMBER 2013.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Whole Test Suite Generation. Abstract Not all bugs lead to program crashes, and not always is there a formal specification to check the correctness of.

Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.

Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.

MUTACINIS TESTAVIMAS Benediktas Knispelis, IFM-2/2 Mutation testing.

Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications August Shi, Alex Gyori, Owolabi Legunsen, Darko Marinov 4/12/2016.

Mutation Testing Laraib Zahid & Mariam Arshad. What is Mutation Testing?  Fault-based Testing: directed towards “typical” faults that could occur in.

Test Case Purification for Improving Fault Localization presented by Taehoon Kwak SoftWare Testing & Verification Group Jifeng Xuan, Martin Monperrus [FSE’14]

A brief intro to: Parallelism, Threads, and Concurrency

The Impact of Concurrent Coverage Metrics on Testing Effectiveness

Jonathan Walpole Computer Science Portland State University

Regression Testing with its types

Chapter 8 – Software Testing

Lecture 25 More Synchronized Data and Producer/Consumer Relationship

Mutation testing Julius Purvinis IFM-0/2.

Cse 373 May 15th – Iterators.

Aditya P. Mathur Purdue University

Eclat: Automatic Generation and Classification of Test Inputs

Mutation Testing Meets Approximate Computing

Owolabi Legunsen, Farah Hariri, August Shi,

Weird Stuff I Saw While ... Supporting a Java Team

Balancing Trade-Offs in Test-Suite Reduction

Alex Groce, Josie Holmes, Darko Marinov, August Shi, Lingming Zhang

August Shi, Tifany Yung, Alex Gyori, and Darko Marinov

Fabiano Ferrari Software Engineering Federal University of São Carlos

A Few Review Questions Dan Fleck Fall 2009.

Test Case Purification for Improving Fault Localization

Mock Object Creation for Test Factoring

Group Truck Technology, Powetrain Engineering, Control Systems dept.

Automated Fitness Guided Fault Localization

Structural Coverage.

Reachability testing for concurrent programs

Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.

Dongyun Jin, Patrick Meredith, Dennis Griffith, Grigore Rosu

Structural Coverage.

Introduction Previous work Test Suite Minimization

Regression Testing.

Data Structures & Algorithms

CUTE: A Concolic Unit Testing Engine for C

A Few Review Questions Dan Fleck Spring 2009.

By Hyunsook Do, Sebastian Elbaum, Gregg Rothermel

RadarGun: Toward a Performance Testing Framework

Software Testing.

Mutation Testing Faults are introduced into the program by creating many versions of the program called mutants. Each mutant contains a single fault. Test.

August Shi, Wing Lam, Reed Oei, Tao Xie, Darko Marinov

Presentation transcript:

Mitigating the Effects of Flaky Tests on Mutation Testing August Shi, Jonathan Bell, Darko Marinov ISSTA 2019 Beijing, China 7/18/2019 Hello, my name is August Shi, and I am here to present our work, “Mitigating the Effects of Flaky Tests on Mutation Testing”. This is joint work with Jonathan Bell and Darko Marinov, and we are from the University of Illinois at Urbana-Champaign and George Mason University. CCF-1421503 CNS-1646305 CNS-1740916 CCF-1763788 CCF-1763822 OAC-1839010

UNRELIABLE Mutation Testing Compare test suites by mutation score Code Under Test Code Under Test test1 test2 test3 Mutant 1 Code Under Test Code Under Test UNRELIABLE Mut 1 Mut 1 Mut 2 test1 Survived test2 test3 Killed test1 test2 test3 Killed Code Under Test Code Under Test As you can see from our title, our work addresses mutation testing, so I would like to start with some background on mutation testing. The goal of mutation testing is to check the quality of the test suite. Let’s say we have some code under test and the corresponding test suite. The tests all pass when run on the code, but are they able to detect faults that get introduced into the code as code evolves? Mutation testing tries to evaluate the fault-detection capability… test1 test2 test3 Mutant 2 Compare test suites by mutation score Guide testing based on mutant-test matrix Mut 2 Survived

Mutation Testing with Flaky Tests Code Under Test Code Under Test test1 test2 test3 Mutant 1 Code Under Test Code Under Test Code Under Test STILL FLAKY Mut 1 Mut 1 Mut 2 test1 Survived? test2 test3 Killed? test1 test2 test3 Killed? Code Under Test Code Under Test That was traditional mutation testing, but what happens when we consider the possibility of flaky tests? First, what are flaky tests? Well, let’s say we run the tests once on the code and observe all tests passing, as we saw earlier. But let’s say we run it again on the same version of code with no changes, and now we see a test, test3, failing… test1 test2 test3 Mutant 2 Run 1 Run 2 Get test suite with deterministic outcomes Debug/fix flaky tests1 Remove/ignore flaky tests Mut 2 Survived? 1August Shi et al. “iFixFlakies: A Framework for Automatically Fixing Order-Dependent Tests”. ESEC/FSE 2019

Flaky Coverage Example Other reasons for flakiness: Concurrency Randomness I/O Order dependency 1 public class WatchDog { 2 ... 3 public void run() { 4 ... 5 synchronized (this) { 6 long timeLeft = timeout – (System.currentTimeMillis() - startTime); 7 isWaiting = timeLeft > 0; 8 while (isWaiting) { 9 ... 10 wait(timeLeft); 11 ... 12 }} 13 ... 14 }} Variable/Call timeout startTime currentTimeMillis() timeLeft isWaiting Value (Run 1) 5000 300000 300300 4700 true Value (Run 2) 5000 500000 510000 -5000 false Okay, so what if we don’t have flaky tests to the degree of their outcomes changing between runs. Can we still have problems with mutation testing due to flakiness? Let’s consider this example (adapted from code and tests we observed in Apache commons-exec)… public void test() { new WatchDog.run(); ... } TEST OUTCOME PASS PASS

Motivating Study Measure flakiness of coverage 30 open-source GitHub projects from prior work No flaky test outcomes! (all 35,850 tests pass in 17 runs) Rerun tests and measure differences in coverage 113,356 (22%) statements with different tests covering across runs 5,736 (16%) tests cover different statements across runs Lots of flakiness in coverage, even without flaky outcomes! We performed a motivating study to measure this flakiness in coverage

Mutation Testing with Flaky Coverage 1 public class WatchDog { 2 ... 3 public void run() { 4 ... 5 synchronized (this) { 6 long timeLeft = timeout – (System.currentTimeMillis() - startTime); 7 isWaiting = timeLeft > 0; 8 while (isWaiting) { 9 ... 10 wait(timeLeft); 11 ... 12 }} 13 ... 14 }} Variable/Call timeout startTime currentTimeMillis() timeLeft isWaiting Value (Run 1) 5000 300000 300300 4700 true Value (Mut Run) 5000 500000 510000 -5000 false So how does flakiness in coverage affect mutation testing? Mutation delete call public void test() { new WatchDog.run(); ... } Mutation not covered!

Mutation Testing Results are Unreliable Flakiness can shift mutation testing results Mutation scores may be inflated/deflated Mutant-test matrix unreliable Need to mitigate the effects of flakiness on mutation testing! Mitigation strategies based on reruns and isolation2 Implemented on PIT, a popular mutation testing tool for Java https://doi.org/10.6084/m9.figshare.8226332 https://github.com/hcoles/pitest/pull/534 https://github.com/hcoles/pitest/pull/545 2Jonathan Bell et al. “DeFlaker: Automatically Detecting Flaky Tests”. ICSE 2018

Mitigating Flakiness in Mutation Testing Traditional mutation testing Full test-suite coverage collection Mutants to test Test-mutant prioritization Sorted tests per mutant Mutant execution Improvements to cope with flakiness Rerun and isolate tests Run tests with least flaky coverage first Track mutations covered Rerun/isolate tests See paper

Coverage Collection When running multiple times, union coverage Once Rerun Multiple Times All tests in same JVM Default Default-Reruns Each test in own JVM Isolation Isolation-Reruns When running multiple times, union coverage More lines covered means more mutants generated Run tests in isolation to remove test-order dependencies

Executing Tests on Mutants Monitor if tests actually execute mutated bytecode Traditionally, mutant-test pair has status Killed or Survived Only applicable if test executes the mutated bytecode Mutant-test pair with test that does not execute mutated bytecode has new status Unknown Test can potentially cover mutation, based on prior coverage Mut 1 Mut 2 test1 Survived test2 Unknown test3

New Status for Mutants Overall mutant status depends on status of all mutant-test pairs run for the mutant Need to reduce number of Unknown mutants and pairs Killed Survived + Covered + Covered Unknown (not covered)

Rerunning Mutant-test Pairs While status of mutant-test pair is Unknown, rerun Change isolation level during reruns Mutant-test pairs for mutants in same class in same JVM Default Mutant-test Pairs Why does isolation help? Why is it expensive? Reduce flakiness at cost of performance Rerun number of times at each level, aim is to reduce number of unknowns but may not get completely 0 More Isolation Mutant-test pairs for same mutant in same JVM Increasing Cost Most Isolation Mutant-test pairs in own JVM

Experimental Setup Evaluate on same 30 projects in motivating study All modifications on top of PIT mutation testing tool RQ1: Flakiness in traditional mutation testing? RQ2: Effect of coverage on mutants generated? RQ3: Effect of re-executing tests on mutant status? RQ4: Prioritize tests for mutant-test executions? See paper See paper

RQ1: Flakiness in Traditional Mutation Testing Mutants by Status Killed Survived Unknown Total Mut. Score Overall 51,687 11,965 2,866 66,518 77.7%-82.0% Max difference up to 23pp! Must improve mutation scores more than this variance! Mutant-Test Pairs by Status Killed Survived Unknown Total Overall 1,569,658 1,097,506 255,194 2,922,358 <Call out the findings, that mutation scores can vary!!!> <Also call out that the tests did not appear flaky from initial outcomes!> 9% of mutants-test pairs are unknown (max up to 55%)! Matrix results can be unreliable

RQ3: Mutant Re-execution Results Unknown Mutants Unknown Mutant-Test Pairs Before After Reduction Overall 2,866 591 2,275 (79.4%) 255,194 30,321 224,873 (88.1%) Add. Covered Pairs Default Reruns 1 2 3 4 5 Overall 61,437 41,302 14,787 6,590 18,762 Increasing isolation greatly increases covered pairs Unnecessary to rerun too often with the most isolation Add. Covered Pairs More Isolation Reruns 1 2 3 4 5 Overall 46,819 14,072 1,000 629 3,872 Add. Covered Pairs Most Isolation Reruns 1 2 3 4 5 Overall 15,594

Discussion Flakiness can have negative effects beyond mutation testing Tools/studies that rely on coverage must consider flakiness Fault localization, program repair, test prioritization, test-suite reduction, test selection, test generation, runtime verification, … Mitigation strategies applicable beyond mutation testing Different isolation strategies for different tasks Flakiness in coverage happens, can have effects on anything We observe on mutation testing, but others can suffer too Our mitigation strategies can be applicable to applications beyond mutation testing

Conclusions Even seemingly non-flaky tests have flaky coverage 22% of statements not covered consistently! We present problems in mutation testing due to flakiness We propose techniques to mitigate effects Different combinations of reruns and isolation We reduce Unknown mutants/pairs by 79.4%/88.1% Flakiness can have negative effects beyond mutation testing Link is in the ACM digital library awshi2@illinois.edu https://doi.org/10.6084/m9.figshare.8226332

BACKUP

Prioritizing Tests for Mutants Run mutant-test pairs in the order that gets the overall mutant status faster, more reliably Once mutant status known, no need to run more Prioritize tests per mutant based on coverage Tests with more “stable” coverage on mutant prioritized earlier Later prioritize based on time When to rerun? Immediately rerun pair? Run all pairs first before rerunning?

RQ2: Coverage and Mutant Generation Number of Mutants Default Isolated Reruns Overall 70,773 70,993 70,877 71,112 Number of Mutant-Test Pairs Default Isolated Reruns Overall 3,089,051 3,162,138 3,101,314 3,165,527 Not much difference in numbers of mutants and pairs Can potentially use Default for mutant generation

RQ4: Prioritizing Tests Running Time for Immediately Rerun (s) Random Coverage PIT Best Worst Overall 84,013.0 51,821.8 51,804.9 42,333.4 299.203.5 Running Time for Not Immediately Rerun (s) Random Coverage PIT Best Worst Overall 90,479.0 60,810.3 60,793.3 52,014.6 284,820.7