When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington
2 Executing them in a different order: Order dependent Dependent test Two tests: createFile(“foo”)... readFile(“foo”)... (the intended test results) Executing them in default order:
Why should we care about test dependence? Makes test behaviors inconsistent Affects downstream testing techniques 3 CPU 2 CPU 1 Test parallelization Test prioritization Test selection
Test independence is assumed by: –Test selection –Test prioritization –Test parallel execution –Test factoring –Test generation –…–… Conventional wisdom: test dependence is not a significant issue 4 31 papers in ICSE, FSE, ISSTA, ASE, ICST, TSE, and TOSEM (2000 – 2013)
Test independence is assumed by: –Test selection –Test prioritization –Test parallel execution –Test factoring –Test generation –… Conventional wisdom: test dependence is not a significant issue 31 papers in ICSE, FSE, ISSTA, ASE, ICST, TSE, and TOSEM (2000 – 2013) Assume test independence without justification As a threat to validity Consider test dependence
Recent work Illinois MIR work on flaky tests and test dependences [ Luo FSE’14, Gyori ISSTA’15, Gligoric ISSTA’15 ] –Tests revealing inconsistent results Dependent test is a special type of flaky test. UW PLSE work on empirically revisiting the test independence assumption [Zhang et al ISSTA’14] –Dependent test assumption should not be ignored 6
Is the test independence assumption valid? Does test dependence arise in practice? What repercussions does test dependence have? How can we nullify the impact of test dependence? 7 ‒ Affecting downstream testing techniques ‒ General algorithm adds/reorders tests for techniques such as prioritization, etc. No! ‒ Yes, in both human-written and automatically-generated suites
Is the test independence assumption valid? Does test dependence arise in practice? What repercussions does test dependence have? How can we nullify the impact of test dependence? 8 ‒ Affecting downstream testing techniques ‒ General algorithm adds/reorders tests for techniques such as prioritization, etc. ‒ Yes, in both human-written and automatically-generated suites
Methodology 9 Reported dependent tests 5 issue tracking systems New dependent tests 5 real-world projects
Methodology 10 Reported dependent tests 5 issue tracking systems Search for 4 key phrases: (“dependent test”, “test dependence”, “test execution order”, “di ff erent test outcome”) Manually inspect 450 matched bug reports Identify 96 distinct dependent tests Characteristics: ‒ Root cause ‒ Developers’ action
Root cause dependent tests
Root cause static variable file system database Unknown at least 61% are due to side-e ff ecting access to static variables.
Developers’ action 13 98% of the reported tests are marked as major or minor issues 91% of the dependence has been fixed ‒ Improving documentation ‒ Fixing test code or source code
Methodology 14 Human-written test suites ‒ 6413 tests Automatically-generated test suites ‒ use Randoop [Pacheco’07] ‒ tests Selected these subjects from previous project [Zhang et al. ISSTA’14] that identified dependent tests in them 37 (0.6%) dependent tests 608 (5.5%) dependent tests New dependent tests 5 real-world projects
Is the test independence assumption valid? Does test dependence arise in practice? What repercussions does test dependence have? How can we nullify the impact of test dependence? 15 ‒ Affecting downstream testing techniques ‒ General algorithm adds/reorders tests for techniques such as prioritization, etc. ‒ Yes, in both human-written and automatically-generated suites
Test prioritization 16 … A test execution order … A new test execution order Achieve coverage faster Improve fault detection rate … Each test should yield the same result.
Four test prioritization techniques [ Elbaum et al. ISSTA 2000 ] 17 Test prioritization technique Prioritize on coverage of statements Prioritize on coverage of statements not yet covered Prioritize on coverage of methods Prioritize on coverage of methods not yet covered Record the number of dependent tests yielding different results Total: 37 human-written and 608 automatically-generated dependent tests 5 real-world projects
Evaluating test prioritization techniques 18 Test prioritization techniqueNumber of tests that yield different results Prioritize on coverage of statements5 (13.5%) Prioritize on coverage of statements not yet covered9 (24.3%) Prioritize on coverage of methods7 (18.9%) Prioritize on coverage of methods not yet covered6 (16.2%) Implication: ‒ On average, 18% chance test dependence would affect test prioritization on human-written tests Out of 37 human- written dependent tests
Evaluating test prioritization techniques 19 Test prioritization techniqueNumber of tests that yield different results Prioritize on coverage of statements372 (61.2%) Prioritize on coverage of statements not yet covered331 (54.4%) Prioritize on coverage of methods381 (62.3%) Prioritize on coverage of methods not yet covered357 (58.7%) Implication: ‒ On average, 59% chance test dependence would affect test prioritization on automatically-generated tests Out of 608 automatically- generated tests
Test selection 20 … A test execution order … A subset of the test execution order Runs faster … Each test should yield the same result.
Six test selection techniques [ Harrold et al. OOPSLA 2001 ] 21 Selection granularityOrdered by StatementTest id (no re-ordering) StatementNumber of elements tests cover StatementNumber of uncovered elements tests cover FunctionTest id (no re-ordering) FunctionNumber of elements tests cover FunctionNumber of uncovered elements tests cover Record the number of dependent tests yielding different results Total: 37 human- written and 608 automatically- generated dependent tests 5 real-world projects
Evaluating test selection techniques 22 Implication: ‒ On average, 3.2% chance test dependence would affect test selection on human-written tests Out of 37 human- written dependent tests Selection granularityOrdered byNumber of tests that yield different results StatementTest id (no re-ordering)1 (2.7%) StatementNumber of elements tests cover 1 (2.7%) StatementNumber of uncovered elements tests cover 1 (2.7%) FunctionTest id (no re-ordering)1 (2.7%) FunctionNumber of elements tests cover 1 (2.7%) FunctionNumber of uncovered elements tests cover 2 (5.4%)
Evaluating test selection techniques 23 Implication: ‒ On average, 32% chance test dependence would affect test selection on automatically-generated tests Out of 608 automatically- generated dependent tests Selection granularityOrdered byNumber of tests that yield different results StatementTest id (no re-ordering)95 (15.6%) StatementNumber of elements tests cover 109 (17.9%) StatementNumber of uncovered elements tests cover 109 (17.9%) FunctionTest id (no re-ordering)266 (44.0%) FunctionNumber of elements tests cover 294 (48.4%) FunctionNumber of uncovered elements tests cover 297 (48.8%)
Test parallelization 24 … A test execution order Reduce test latency … Each test should yield the same result. … Schedules the test execution order across multiple CPUs CPU 1 CPU 2
Two test parallelization techniques [ 1 ] 25 Record the number of dependent tests yielding different results Total: 37 human- written and 608 automatically- generated dependent tests 5 real-world projects [1] Executing unit tests in parallel on a multi-CPU/core machine in Visual Studio. 1/executing-unit-tests-in-parallel- on-a-multi-cpu-core- machine.aspx. Test parallelization technique Parallelize on test id Parallelize on test execution time
Evaluating test parallelization techniques 26 Implication: ‒ On average, when the #CPUs = 2, 27% chance test dependence would affect test parallelization for human-written tests. When the #CPUs = 16, 36% chance Out of 37 human- written dependent tests Parallelize on test idParallelize on test execution time #CPUs = 2#CPUs = 16#CPUs = 2#CPUs = 16 2 (5.4%)13 (35.1%)14 (37.8%) #CPUs = 4 and 8 were evaluated but omitted for space reasons
Evaluating test parallelization techniques 27 Implication: ‒ On average, when the #CPUs = 2, 46% chance test dependence would affect test parallelization for automatically-generated tests. When the #CPUs = 16, 64% chance Parallelize on test idParallelize on test execution time #CPUs = 2#CPUs = 16#CPUs = 2#CPUs = (31.9%)349 (57.4%)360 (59.2%)433 (71.2%) Out of 608 automatically- generated dependent tests #CPUs = 4 and 8 were evaluated but omitted for space reasons
Impact of test dependence 28 TechniqueTest suite typeChance of impact by test dependence PrioritizationHuman-writtenLow PrioritizationAutomatically-generatedHigh SelectionHuman-writtenLow SelectionAutomatically-generatedModerate ParallelizationHuman-writtenModerate ParallelizationAutomatically-generatedHigh ChanceAverage % of dependent test exposed Low0-25% Moderate25-50% High+50% Dependent tests does affect downstream testing technique especially for automaticatlly- generated test suites!
Is the test independence assumption valid? Does test dependence arise in practice? What repercussions does test dependence have? How can we nullify the impact of test dependence? 29 ‒ Affecting downstream testing techniques ‒ General algorithm adds/reorders tests for techniques such as prioritization, etc. ‒ Yes, in both human-written and automatically-generated suites
General algorithm to nullify test dependence 30 A test suite: -Product of test prioritization, selection, parallelization Known test dependences: -Can be generated through approximate algorithms [Zhang et al. ISSTA’14] or empty -Reuseable for different testing techniques and when developers change their code … A test suite … Reordered/Amended test suite … Known test dependences …
Prioritization algorithm to nullify test dependence 31 Measured average area under the curve (APFD) for percentage of faults detected over life of the test suite -APFD of original prioritization algorithms was 89.1%. -This dependence-aware algorithm was 88.1% — a negligible difference. … A test suite Prioritization … Prioritized test suite … Reordered test suite … Known test dependences … General algorithm
Selection algorithm to nullify test dependence 32 Measured number of tests selected -Number of tests selected on average of original selection algorithms was 41.6%. -This dependence-aware algorithm selected 42.2% — a negligible difference. … A test suite Selection … Selected test suite … Reordered/Amended test suite … Known test dependences … General algorithm
Parallelization algorithm to nullify test dependence 33 Measured time taken by slowest machine and its average speedup compared to unparallelized suites -Average speedup of original parallelization algorithms was 41%. -This dependence-aware algorithm’s speedup was 55%. … A test suite Parallelization … Subsequences of test suite … Reordered/Amended test suite … Known test dependences … General algorithm
Future work For test selection, measure the time it takes for our dependence-aware test suites to run compared to the dependence-unaware test suites 34 Evaluate our effectiveness at incrementally recomputing test dependences when developers make code changes
Evaluating and coping with impact of test dependence –Test dependence arises in practice –Test dependence does affect downstream testing techniques –Our general algorithm is effective in practice to nullify impact of test dependence Our tools, experiments, etc. Contributions 35
[Backup slides] 36
Why more dependent tests in automatically-generated test suites? Manual test suites: –Developer’s understanding of the code and their testing goals help build well-structured tests –Developers often try to initialize and destroy the shared objects each unit test may use Auto test suites: –Most tools are not “state-aware” –The generated tests often “misuse” APIs, e.g., setting up the environment incorrectly –Most tools can not generate environment setup / destroy code 37
Dependent tests vs. Nondeterministic tests Nondeterminism does not imply dependence –A program may execute non-deterministically, but its tests may deterministically succeed. Test dependence does not imply nondeterminism –A program may have no sources of nondeterminism, but its tests can still be dependent on each other 38