Analyzing the Validity of Selective Mutation with Dominator Mutants FSE 2016, Seattle, Washington, USA Bob Kurtz, Paul Ammann, Jeff Offutt, Marcio E. Delamaro, Mariet Kurtz, and Nida Gökçe George Mason University, USA Universidade de São Paulo, Brazil The MITRE Corporation Muğla Sıtkı Koçman University, Turkey
Mutation testing in 2016 MASON GMU FSE 2016 ACME Co.
GMU Mutation testing in 2016 int max(int i, int j) { if (i > j) { return i; } else { return j; int max(int i, int j) { if (i <= j) { return i; } else { return j; int max(int i, int j) { if (i > j) { return j; } else { int max(int i, int j) { if (i > j) { return i; } else { return j++; Mutation Operators ROR LVR AOI MASON GMU FSE 2016 ACME Co.
≠ GMU Mutation testing in 2016 Original Mutants assertEquals(2, int max(int i, int j) { if (i > j) { return i; } else { return j; ≠ int max(int i, int j) { if (i <= j) { return i; } else { return j; int max(int i, int j) { if (i > j) { return j; } else { int max(int i, int j) { if (i > j) { return i; } else { return j++; int max(int i, int j) { if (i > j) { return i; } else { int max(int i, int j) { if (i <= j) { return i; } else { return j; int max(int i, int j) { if (i > j) { return j; } else { int max(int i, int j) { if (i > j) { return i; } else { return j++; int max(int i, int j) { if (i > j) { return i; } else { int max(int i, int j) { if (i >= j) { return i; } else { return j; int max(int i, int j) { if (i > i) { return i; } else { return j; int max(int i, int j) { if (i) { return i; } else { return j; int max(int i, int j) { if (i > j) { return --i; } else { return j; assertEquals(2, max(2,1)); int max(int i, int j) { if (i >= j) { return i; } else { return j; int max(int i, int j) { if (i > i) { return i; } else { return j; int max(int i, int j) { if (i) { return i; } else { return j; int max(int i, int j) { if (i > j) { return --i; } else { return j; MASON GMU FSE 2016 ACME Co.
Mutation testing in 2016 20,000 Mutants 2,000 SLOC ACME Co. FSE 2016 int max(int i, int j) { if (i > j) { return i; } else { return j; 2,000 SLOC FSE 2016 ACME Co.
Mutation testing in 2016 FSE 2016 ACME Co.
= Mutation testing in 2016 int max(int i, int j) { if (i > j) { return i; } else { return j; if (i >= j) { = FSE 2016 ACME Co.
Mutation testing in 2016 FSE 2016 ACME Co.
Path too long, no end in sight! Mutation testing in 2016 ??? Equivalent mutants! You can’t kill ‘em! How many? 10-20% Path too long, no end in sight! MASON GMU FSE 2016 ACME Co.
Mutation testing in 2016 FSE 2016 ACME Co.
= What went wrong? Equivalent mutants Syntactically different but semantically identical to the original program Cannot be killed by tests, must be manually evaluated one-by-one Requires unrealistic amounts of work! int max(int i, int j) { if (i > j) { return i; } else { return j; int max(int i, int j) { if (i >= j) { return i; } else { return j; int max(int i, int j) { if (i > j) { return i; } else { return j++; = FSE 2016
What went wrong? Redundant mutants A mutant is redundant if it is always killed when some other mutant is killed ≈98% of non-equivalent mutants How far along is testing? FSE 2016
Mutant reduction strategies int max(int i, int j) { if (i > j) { return i; } else { return j; Selective mutation Use the “best” operators to produce fewer mutants CCDL ORAN VLSR … OAAN SRSR FSE 2016
Mutant reduction strategies int max(int i, int j) { if (i > j) { return i; } else { return j; Random mutant selection Typically select ≈5% of all mutants CCDL OAAN ORAN SRSR VLSR … FSE 2016
Research questions Selective mutation has generally been thought of as the best approach since 1993 Random mutant selection has been used since 1980 How effective are these techniques? Can we improve on them? FSE 2016
mi → mj Mutant subsumption Given a set of mutants M, mutant mi subsumes mutant mj (mi → mj) iff: Some test kills mi All tests that kill mi also kill mj All Tests + Tests that kill mj Tests that kill mk Tests that kill mi mi → mj FSE 2016 [Ammann, et al., ICST 2014]
Subsumption graph m1 m1 m2 m3 m2 m3 m4 m4 m5 m5 m6 m7 m6 m7 m8 m8 Shows the subsumption relationship between mutants Leaf nodes not subsumed by any other mutant are “dominator mutants” Tests that kill all these dominators will also kill all other mutants All the other mutants are redundant! m1 m1 m2 m3 m2 m3 m4 m4 m5 m5 m6 m7 m6 m7 m8 m8 FSE 2016 [Kurtz, et al., Mutation 2014]
Using selective mutation, random mutants, or some other technique Testing model Begin Using selective mutation, random mutants, or some other technique Mutants All Mutants Select some mutant set M Mutants All Tests Select minimal test set T that kills M Find all mutants killed by T Mutation Scores Dominator Scores End FSE 2016
Nondeterministic process Testing model Begin Mutants All Mutants Select some mutant set M Nondeterministic process Mutants All Tests Select minimal test set T that kills M Find all mutants killed by T Mutation Scores Dominator Scores End FSE 2016
Testing model Begin Mutants All Mutants Select some mutant set M #𝑚𝑢𝑡𝑎𝑛𝑡 𝑠 𝑘𝑖𝑙𝑙𝑒𝑑 #𝑚𝑢𝑡𝑎𝑛𝑡 𝑠 𝑡𝑜𝑡𝑎𝑙 −#𝑚𝑢𝑡𝑎𝑛𝑡 𝑠 𝑒𝑞𝑢𝑖𝑣𝑎𝑙𝑒𝑛𝑡 Mutants All Tests Select minimal test set T that kills M #𝑑𝑜𝑚𝑖𝑛𝑎𝑡𝑜 𝑟𝑠 𝑘𝑖𝑙𝑙𝑒𝑑 #𝑑𝑜𝑚𝑖𝑛𝑎𝑡𝑜 𝑟𝑠 𝑡𝑜𝑡𝑎𝑙 Find all mutants killed by T Mutation Scores Dominator Scores End FSE 2016
Siemens suite scores Mutation and dominator score using the 5 E-Selective mutation operators from Mothra [Offutt, et al., 1993] FSE 2016 [Ammann, et al., ICST 2014]
The Siemens suite? Really?? REAL SOFTWARE SIEMENS SUITE FSE 2016
The Siemens suite? Really! The Siemens suite is useful here because: It has a thorough test suite that elicits rich subsumption relationships between mutants The Proteum mutation tool for C has a very large set of mutation operators If we can show that selective mutation is not a good solution for any particular program, then it is not a good solution for programs in general FSE 2016
Mutation vs. Dominator Score Mutation score for the Siemens replace program (1,000 iterations) FSE 2016 [Kurtz, et al., Mutation 2016]
Mutation vs. Dominator Score Mutation and dominator score for the Siemens replace program FSE 2016 [Kurtz, et al., Mutation 2016]
1-4 Operators for print_tokens Our Goal We define work as the number of mutants the engineer must evaluate (write a killing test or to show it’s equivalent) FSE 2016
1-4 Operators for print_tokens 231,525 operator combinations We define work as the number of mutants the engineer must evaluate (to write a killing test or to show it’s equivalent) FSE 2016
1-4 Operators for print_tokens Optimal selective operator combinations for this program 231,525 operator combinations FSE 2016
1-4 Operators for Siemens suite Operator combination are far less optimal for all programs There is no set of up to 4 operators that is good for every program 489,504 operator combinations FSE 2016
Selective and random Traditional mutant reduction approaches are not optimal, consistent with prior research that selective and random are similar FSE 2016
More than 4 operators m1 m1 m2 m3 m2 m3 m4 m4 m5 m6 m5 m5 m6 m7 m6 m7 We’d like to expand to sets of mutation operators larger than 4 Brute force approach is infeasible 1-4 operators = 489,504 sets ≈ 24 hours 1-59 operators ≈ 6x1017 sets ≈ 3x109 years Approximate test results based on subsumption Spearman’s rank ρ=0.966 Recalculate best operator combinations in the usual way m1 m1 m2 m3 m2 m3 m4 m4 m5 m6 m5 m5 m6 m7 m6 m7 m8 m8 FSE 2016
Optimal selective solutions Best 1-program solutions Best 1-59 operator combinations Even the optimal operator combinations are not very good FSE 2016
Is mutation testing dead? I don’t think so, but we need to do something better! FSE 2016
Mutation testing in 2020? Use machine learning to generate optimized mutants based on program features int max(int i, int j) { if (i > j) { return j; } else { int max(int i, int j) { if (i > j) { return i; } else { return j++; int max(int i, int j) { if (i > j) { return i; } else { return j; int max(int i, int j) { if (i > j) { return i; } else { int max(int i, int j) { if (i > i) { return i; } else { return j; int max(int i, int j) { if (i > j) { return i; } else { return j; int max(int i, int j) { if (i) { return i; } else { return j; int max(int i, int j) { if (i > j) { return --i; } else { return j; FSE 2016 [Future work]
Mutation testing in 2020? Use machine learning to generate optimized mutants based on program features Use static analysis to determine partial subsumption & tests int max(int i, int j) { if (i > j) { return i; } else { return j; int max(int i, int j) { if (i > j) { return j; } else { int max(int i, int j) { if (i > j) { return i; } else { return j++; int max(int i, int j) { if (i > j) { return i; } else { int max(int i, int j) { if (i > i) { return i; } else { return j; int max(int i, int j) { if (i) { return i; } else { return j; int max(int i, int j) { if (i > j) { return --i; } else { return j; FSE 2016 [Kurtz, et al., Mutation 2015]
Mutation testing in 2020? Use machine learning to generate optimized mutants based on program features Use static analysis to determine partial subsumption & tests Execute tests to refine subsumption and kill mutants int max(int i, int j) { if (i > j) { return i; } else { return j; FSE 2016 [Kurtz, et al., Mutation 2015]
Mutation testing in 2020? Use machine learning to generate optimized mutants based on program features Use static analysis to determine partial subsumption & tests Execute tests to refine subsumption and kill mutants Remove subsumed mutants and redundant tests int max(int i, int j) { if (i > j) { return i; } else { return j; FSE 2016 [Kurtz, et al., Mutation 2014]
Mutation testing in 2020? Use machine learning to generate optimized mutants based on program features Use static analysis to determine partial subsumption & tests Execute tests to refine subsumption and kill more mutants Remove subsumed mutants and redundant tests Output a set of tests and a FEW probable-high-value mutants for the engineer to kill int max(int i, int j) { if (i > j) { return i; } else { return j; int max(int i, int j) { if (i > j) { return i; } else { return --i; return j; FSE 2016
Conclusions Mutation score is an imprecise metric inflated by redundant mutants Researchers should use dominator score instead Current mutant selection techniques are not optimal There are no mutation operator combinations that are effective for a range of programs To optimize dominator score per unit of work, we need to customize mutants to the program under test! FSE 2016
Backup Slides FSE 2016
Future work Develop a mutant selection approach that is customized for a program Can’t tell if a mutation is good or bad without program context Use machine learning Correlate program features and mutation operators with mutant “goodness” Select more dominator and near-dominator mutants Select fewer highly- subsumed and equivalent mutants FSE 2016
Mutant Subsumption Graphs Given the following score function: m1 m3,m6 m5 m2 t1 t2 t3 t4 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m7,m9 m4 m8,m10 When we construct the mutant subsumption graph from the score function, we see three root nodes that are not subsumed by any other mutants. One mutant from each of these nodes forms a dominator set: { m1, m3, m5 }, { m1, m6, m5 } All other mutants are redundant! FSE 2016
Dominator mutation score Assume we execute test t1 m1 m3,m6 m5 m2 t1 t2 t3 t4 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m7,m9 m4 m8,m10 Killed mutants are shown in gray Mutation score: 7 of 9 killable mutants = 0.78 Dominator score: 1 of 3 mutants in a dominator set = 0.33 FSE 2016
Redundancy and equivalency We want to investigate how the accuracy changes as the number of redundant and equivalent mutants change, we need a way to measure redundancy and equivalency, preferably in a decoupled manner FSE 2016
Work and Normalized Work How much effort is required? We use a simple definition: the number of mutants that a tester must examine To effectively compare work between different programs with different numbers of mutants, we define equivalent work: FSE 2016
Redundancy and Work FSE 2016
Equivalency and Work FSE 2016
Average / worst-case correlation Spearman’s rank correlation ρ=0.966 FSE 2016
What’s an optimal solution? We score non-optimal points using the Hausdorff distance (dH), the distance from the nearest optimal point Hausdorff Distance dh FSE 2016
Today’s common techniques Mutant Selection Technique Hausdorff Distance (smaller is better) SSDL 0.104 E-Selective 0.134 5% Random 0.150 10% Random 0.274 15% Random 0.147 20% Random 0.229 25% Random 0.111 FSE 2016
Incrementally better Operator set “A” Operator set “B” 9 mutation operators Same dominator score as E- Selective Only 42% of the work Operator set “B” 8 mutation operators Same work as E-Selective 29% higher dominator score Operator set “C” 14 mutation operators A knee in the curve None are as good as the best solutions for a single program! FSE 2016
Determining local context FSE 2016
Threats to validity The Siemens suite is a small number of small programs Are they really representative of real-world programs? Existence of unkilled (but killable) mutants injects errors into determining dominator scores Specific operator sets identified as optimal may not be optimal, but broader points are not affected FSE 2016