When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington.

Slides:



Advertisements
Similar presentations
Author: Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, Thomas Ball MIT CSAIL.
Advertisements

Configuration management
Testing Coverage Test case
Page Replacement Algorithms
Time-Aware Test Suite Prioritization Kristen R. Walcott, Mary Lou Soffa University of Virginia International Symposium on Software Testing and Analysis.
CS527: Advanced Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 18, 2008.
Testing Concurrent/Distributed Systems Review of Final CEN 5076 Class 14 – 12/05.
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
Prioritizing Test Cases for Regression Testing Sebastian Elbaum University of Nebraska, Lincoln Alexey Malishevsky Oregon State University Gregg Rothermel.
Abhinn Kothari, 2009CS10172 Parth Jaiswal 2009CS10205 Group: 3 Supervisor : Huzur Saran.
A Regression Test Selection Technique for Aspect- Oriented Programs Guoqing Xu The Ohio State University
Program Slicing Mark Weiser and Precise Dynamic Slicing Algorithms Xiangyu Zhang, Rajiv Gupta & Youtao Zhang Presented by Harini Ramaprasad.
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University.
CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.
1 Static Testing: defect prevention SIM objectives Able to list various type of structured group examinations (manual checking) Able to statically.
Software Testing and Quality Assurance
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Michael Ernst, page 1 Improving Test Suites via Operational Abstraction Michael Ernst MIT Lab for Computer Science Joint.
User-session based Testing of Web Applications. Two Papers l A Scalable Approach to User-session based Testing of Web Applications through Concept Analysis.
Applied Software Project Management Andrew Stellman & Jennifer Greene Applied Software Project Management Applied Software.
EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,
Automated Diagnosis of Software Configuration Errors
Deterministic Replay of Java Multithreaded Applications Jong-Deok Choi and Harini Srinivasan slides made by Qing Zhang.
Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
AMOST Experimental Comparison of Code-Based and Model-Based Test Prioritization Bogdan Korel Computer Science Department Illinois Institute of Technology.
An Introduction to MBT  what, why and when 张 坚
Practical Semantic Test Simplification Sai Zhang University of Washington.
Empirically Revisiting the Test Independence Assumption Sai Zhang, Darioush Jalali, Jochen Wuttke, Kıvanç Muşlu, Wing Lam, Michael D. Ernst, David Notkin.
1 Debugging and Testing Overview Defensive Programming The goal is to prevent failures Debugging The goal is to find cause of failures and fix it Testing.
Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.
Regression Testing. 2  So far  Unit testing  System testing  Test coverage  All of these are about the first round of testing  Testing is performed.
Which Configuration Option Should I Change? Sai Zhang, Michael D. Ernst University of Washington Presented by: Kıvanç Muşlu.
Automatically Repairing Broken Workflows for Evolving GUI Applications Sai Zhang University of Washington Joint work with: Hao Lü, Michael D. Ernst.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Computer Science 1 Mining Likely Properties of Access Control Policies via Association Rule Mining JeeHyun Hwang 1, Tao Xie 1, Vincent Hu 2 and Mine Altunay.
Prioritizing Test Cases for Regression Testing Article By: Rothermel, et al. Presentation by: Martin, Otto, and Prashanth.
Finding Errors in.NET with Feedback-Directed Random Testing Carlos Pacheco (MIT) Shuvendu Lahiri (Microsoft) Thomas Ball (Microsoft) July 22, 2008.
Directed Random Testing Evaluation. FDRT evaluation: high-level – Evaluate coverage and error-detection ability large, real, and stable libraries tot.
Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.
Parameterized Unit Testing in the Open Source Wild Wing Lam (U. Illinois) In collaboration with Siwakorn Srisakaokul, Blake Bassett, Peyman Mahdian and.
+ Moving Targets: Security and Rapid-Release in Firefox Presented by Carlos Bernal-Cárdenas.
CAPP: Change-Aware Preemption Prioritization Vilas Jagannath, Qingzhou Luo, Darko Marinov Sep 6 th 2011.
Pruning Dynamic Slices With Confidence Original by: Xiangyu Zhang Neelam Gupta Rajiv Gupta The University of Arizona Presented by: David Carrillo.
8/23/00ISSTA Comparison of Delivered Reliability of Branch, Data Flow, and Operational Testing: A Case Study Phyllis G. Frankl Yuetang Deng Polytechnic.
CS527 Topics in Software Engineering (Software Testing and Analysis) Darko Marinov September 7, 2010.
Structural Coverage. Measurement of structural coverage of code is a means of assessing the thoroughness of testing. Such metrics do not constitute testing.
DevCOP: A Software Certificate Management System for Eclipse Mark Sherriff and Laurie Williams North Carolina State University ISSRE ’06 November 10, 2006.
SOFTWARE TESTING SOFTWARE TESTING Presented By, C.Jackulin Sugirtha-10mx15 R.Jeyaramar-10mx17K.Kanagalakshmi-10mx20J.A.Linda-10mx25P.B.Vahedha-10mx53.
Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.
A PRELIMINARY EMPIRICAL ASSESSMENT OF SIMILARITY FOR COMBINATORIAL INTERACTION TESTING OF SOFTWARE PRODUCT LINES Stefan Fischer Roberto E. Lopez-Herrejon.
Testing (final thoughts). equals() and hashCode() Important when using Hash-based containers class Duration { public final int min; public final int sec;
PREPARED BY G.VIJAYA KUMAR ASST.PROFESSOR
Gwangsun Kim, Jiyun Jeong, John Kim
Regression Testing with its types
Chapter 8 – Software Testing
Aditya P. Mathur Purdue University
BASICS OF SOFTWARE TESTING Chapter 1. Topics to be covered 1. Humans and errors, 2. Testing and Debugging, 3. Software Quality- Correctness Reliability.
Balancing Trade-Offs in Test-Suite Reduction
Software testing strategies 2
August Shi, Tifany Yung, Alex Gyori, and Darko Marinov
Different Testing Methodology
Automatic Test Generation for Combinational Circuits
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Regression Testing.
White Box testing & Inspections
Reordered/Amended test suite
By Hyunsook Do, Sebastian Elbaum, Gregg Rothermel
Mitigating the Effects of Flaky Tests on Mutation Testing
Presentation transcript:

When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington

2 Executing them in a different order: Order dependent Dependent test Two tests: createFile(“foo”)... readFile(“foo”)... (the intended test results) Executing them in default order:

Why should we care about test dependence? Makes test behaviors inconsistent Affects downstream testing techniques 3 CPU 2 CPU 1 Test parallelization Test prioritization Test selection

Test independence is assumed by: –Test selection –Test prioritization –Test parallel execution –Test factoring –Test generation –…–… Conventional wisdom: test dependence is not a significant issue 4 31 papers in ICSE, FSE, ISSTA, ASE, ICST, TSE, and TOSEM (2000 – 2013)

Test independence is assumed by: –Test selection –Test prioritization –Test parallel execution –Test factoring –Test generation –… Conventional wisdom: test dependence is not a significant issue 31 papers in ICSE, FSE, ISSTA, ASE, ICST, TSE, and TOSEM (2000 – 2013) Assume test independence without justification As a threat to validity Consider test dependence

Recent work Illinois MIR work on flaky tests and test dependences [ Luo FSE’14, Gyori ISSTA’15, Gligoric ISSTA’15 ] –Tests revealing inconsistent results Dependent test is a special type of flaky test. UW PLSE work on empirically revisiting the test independence assumption [Zhang et al ISSTA’14] –Dependent test assumption should not be ignored 6

Is the test independence assumption valid? Does test dependence arise in practice? What repercussions does test dependence have? How can we nullify the impact of test dependence? 7 ‒ Affecting downstream testing techniques ‒ General algorithm adds/reorders tests for techniques such as prioritization, etc. No! ‒ Yes, in both human-written and automatically-generated suites

Is the test independence assumption valid? Does test dependence arise in practice? What repercussions does test dependence have? How can we nullify the impact of test dependence? 8 ‒ Affecting downstream testing techniques ‒ General algorithm adds/reorders tests for techniques such as prioritization, etc. ‒ Yes, in both human-written and automatically-generated suites

Methodology 9 Reported dependent tests 5 issue tracking systems New dependent tests 5 real-world projects

Methodology 10 Reported dependent tests 5 issue tracking systems Search for 4 key phrases: (“dependent test”, “test dependence”, “test execution order”, “di ff erent test outcome”) Manually inspect 450 matched bug reports Identify 96 distinct dependent tests Characteristics: ‒ Root cause ‒ Developers’ action

Root cause dependent tests

Root cause static variable file system database Unknown at least 61% are due to side-e ff ecting access to static variables.

Developers’ action 13 98% of the reported tests are marked as major or minor issues 91% of the dependence has been fixed ‒ Improving documentation ‒ Fixing test code or source code

Methodology 14 Human-written test suites ‒ 6413 tests Automatically-generated test suites ‒ use Randoop [Pacheco’07] ‒ tests Selected these subjects from previous project [Zhang et al. ISSTA’14] that identified dependent tests in them 37 (0.6%) dependent tests 608 (5.5%) dependent tests New dependent tests 5 real-world projects

Is the test independence assumption valid? Does test dependence arise in practice? What repercussions does test dependence have? How can we nullify the impact of test dependence? 15 ‒ Affecting downstream testing techniques ‒ General algorithm adds/reorders tests for techniques such as prioritization, etc. ‒ Yes, in both human-written and automatically-generated suites

Test prioritization 16 … A test execution order … A new test execution order Achieve coverage faster Improve fault detection rate … Each test should yield the same result.

Four test prioritization techniques [ Elbaum et al. ISSTA 2000 ] 17 Test prioritization technique Prioritize on coverage of statements Prioritize on coverage of statements not yet covered Prioritize on coverage of methods Prioritize on coverage of methods not yet covered Record the number of dependent tests yielding different results Total: 37 human-written and 608 automatically-generated dependent tests 5 real-world projects

Evaluating test prioritization techniques 18 Test prioritization techniqueNumber of tests that yield different results Prioritize on coverage of statements5 (13.5%) Prioritize on coverage of statements not yet covered9 (24.3%) Prioritize on coverage of methods7 (18.9%) Prioritize on coverage of methods not yet covered6 (16.2%) Implication: ‒ On average, 18% chance test dependence would affect test prioritization on human-written tests Out of 37 human- written dependent tests

Evaluating test prioritization techniques 19 Test prioritization techniqueNumber of tests that yield different results Prioritize on coverage of statements372 (61.2%) Prioritize on coverage of statements not yet covered331 (54.4%) Prioritize on coverage of methods381 (62.3%) Prioritize on coverage of methods not yet covered357 (58.7%) Implication: ‒ On average, 59% chance test dependence would affect test prioritization on automatically-generated tests Out of 608 automatically- generated tests

Test selection 20 … A test execution order … A subset of the test execution order Runs faster … Each test should yield the same result.

Six test selection techniques [ Harrold et al. OOPSLA 2001 ] 21 Selection granularityOrdered by StatementTest id (no re-ordering) StatementNumber of elements tests cover StatementNumber of uncovered elements tests cover FunctionTest id (no re-ordering) FunctionNumber of elements tests cover FunctionNumber of uncovered elements tests cover Record the number of dependent tests yielding different results Total: 37 human- written and 608 automatically- generated dependent tests 5 real-world projects

Evaluating test selection techniques 22 Implication: ‒ On average, 3.2% chance test dependence would affect test selection on human-written tests Out of 37 human- written dependent tests Selection granularityOrdered byNumber of tests that yield different results StatementTest id (no re-ordering)1 (2.7%) StatementNumber of elements tests cover 1 (2.7%) StatementNumber of uncovered elements tests cover 1 (2.7%) FunctionTest id (no re-ordering)1 (2.7%) FunctionNumber of elements tests cover 1 (2.7%) FunctionNumber of uncovered elements tests cover 2 (5.4%)

Evaluating test selection techniques 23 Implication: ‒ On average, 32% chance test dependence would affect test selection on automatically-generated tests Out of 608 automatically- generated dependent tests Selection granularityOrdered byNumber of tests that yield different results StatementTest id (no re-ordering)95 (15.6%) StatementNumber of elements tests cover 109 (17.9%) StatementNumber of uncovered elements tests cover 109 (17.9%) FunctionTest id (no re-ordering)266 (44.0%) FunctionNumber of elements tests cover 294 (48.4%) FunctionNumber of uncovered elements tests cover 297 (48.8%)

Test parallelization 24 … A test execution order Reduce test latency … Each test should yield the same result. … Schedules the test execution order across multiple CPUs CPU 1 CPU 2

Two test parallelization techniques [ 1 ] 25 Record the number of dependent tests yielding different results Total: 37 human- written and 608 automatically- generated dependent tests 5 real-world projects [1] Executing unit tests in parallel on a multi-CPU/core machine in Visual Studio. 1/executing-unit-tests-in-parallel- on-a-multi-cpu-core- machine.aspx. Test parallelization technique Parallelize on test id Parallelize on test execution time

Evaluating test parallelization techniques 26 Implication: ‒ On average, when the #CPUs = 2, 27% chance test dependence would affect test parallelization for human-written tests. When the #CPUs = 16, 36% chance Out of 37 human- written dependent tests Parallelize on test idParallelize on test execution time #CPUs = 2#CPUs = 16#CPUs = 2#CPUs = 16 2 (5.4%)13 (35.1%)14 (37.8%) #CPUs = 4 and 8 were evaluated but omitted for space reasons

Evaluating test parallelization techniques 27 Implication: ‒ On average, when the #CPUs = 2, 46% chance test dependence would affect test parallelization for automatically-generated tests. When the #CPUs = 16, 64% chance Parallelize on test idParallelize on test execution time #CPUs = 2#CPUs = 16#CPUs = 2#CPUs = (31.9%)349 (57.4%)360 (59.2%)433 (71.2%) Out of 608 automatically- generated dependent tests #CPUs = 4 and 8 were evaluated but omitted for space reasons

Impact of test dependence 28 TechniqueTest suite typeChance of impact by test dependence PrioritizationHuman-writtenLow PrioritizationAutomatically-generatedHigh SelectionHuman-writtenLow SelectionAutomatically-generatedModerate ParallelizationHuman-writtenModerate ParallelizationAutomatically-generatedHigh ChanceAverage % of dependent test exposed Low0-25% Moderate25-50% High+50% Dependent tests does affect downstream testing technique especially for automaticatlly- generated test suites!

Is the test independence assumption valid? Does test dependence arise in practice? What repercussions does test dependence have? How can we nullify the impact of test dependence? 29 ‒ Affecting downstream testing techniques ‒ General algorithm adds/reorders tests for techniques such as prioritization, etc. ‒ Yes, in both human-written and automatically-generated suites

General algorithm to nullify test dependence 30 A test suite: -Product of test prioritization, selection, parallelization Known test dependences: -Can be generated through approximate algorithms [Zhang et al. ISSTA’14] or empty -Reuseable for different testing techniques and when developers change their code … A test suite … Reordered/Amended test suite … Known test dependences …

Prioritization algorithm to nullify test dependence 31 Measured average area under the curve (APFD) for percentage of faults detected over life of the test suite -APFD of original prioritization algorithms was 89.1%. -This dependence-aware algorithm was 88.1% — a negligible difference. … A test suite Prioritization … Prioritized test suite … Reordered test suite … Known test dependences … General algorithm

Selection algorithm to nullify test dependence 32 Measured number of tests selected -Number of tests selected on average of original selection algorithms was 41.6%. -This dependence-aware algorithm selected 42.2% — a negligible difference. … A test suite Selection … Selected test suite … Reordered/Amended test suite … Known test dependences … General algorithm

Parallelization algorithm to nullify test dependence 33 Measured time taken by slowest machine and its average speedup compared to unparallelized suites -Average speedup of original parallelization algorithms was 41%. -This dependence-aware algorithm’s speedup was 55%. … A test suite Parallelization … Subsequences of test suite … Reordered/Amended test suite … Known test dependences … General algorithm

Future work For test selection, measure the time it takes for our dependence-aware test suites to run compared to the dependence-unaware test suites 34 Evaluate our effectiveness at incrementally recomputing test dependences when developers make code changes

Evaluating and coping with impact of test dependence –Test dependence arises in practice –Test dependence does affect downstream testing techniques –Our general algorithm is effective in practice to nullify impact of test dependence Our tools, experiments, etc. Contributions 35

[Backup slides] 36

Why more dependent tests in automatically-generated test suites? Manual test suites: –Developer’s understanding of the code and their testing goals help build well-structured tests –Developers often try to initialize and destroy the shared objects each unit test may use Auto test suites: –Most tools are not “state-aware” –The generated tests often “misuse” APIs, e.g., setting up the environment incorrectly –Most tools can not generate environment setup / destroy code 37

Dependent tests vs. Nondeterministic tests Nondeterminism does not imply dependence –A program may execute non-deterministically, but its tests may deterministically succeed. Test dependence does not imply nondeterminism –A program may have no sources of nondeterminism, but its tests can still be dependent on each other 38