Owolabi Legunsen, Farah Hariri, August Shi,

Slides:

Advertisements

Similar presentations

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Advertisements

An Evaluation of MC/DC Coverage for Pair-wise Test Cases By David Anderson Software Testing Research Group (STRG)

Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.

Program Slicing Mark Weiser and Precise Dynamic Slicing Algorithms Xiangyu Zhang, Rajiv Gupta & Youtao Zhang Presented by Harini Ramaprasad.

A Comparison of Online and Dynamic Impact Analysis Algorithms Ben Breech Mike Tegtmeyer Lori Pollock University of Delaware.

Test Design Techniques

Class Specification Implementation Graph By: Njume Njinimbam Chi-Chang Sun.

Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.

Bug Localization with Machine Learning Techniques Wujie Zheng

10 Aug 2010 ECE/BENG-492 SENIOR ADVANCED DESIGN PROJECT Meeting #7.

Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.

JavaScript: API’s, Parameters and Creating Functions with Parameters

Regression Testing with its types

IAEA E-learning Program

What to do when a test fails

Data Structure and Algorithms

Input Space Partition Testing CS 4501 / 6501 Software Testing

Unified Modeling Language

CS 326 Programming Languages, Concepts and Implementation

Chapter 8 – Software Testing

Graph Coverage for Specifications CS 4501 / 6501 Software Testing

Using the Excel Creation Template to Create a Variable Parameter Problem (Macro Enabled “Alpha 1.4.2”) Getting started – Example 1 Note – You should be.

Artificial Intelligence Lecture No. 5

Aditya P. Mathur Purdue University

Development History Granularity Transformations

#GOALS BREAKING DOWN PROJECTS & FINALS

Automated Code Coverage Analysis

Applied Software Implementation & Testing

Constructive Cost Model

Effective Writing Where and how to start?

Mutation Testing Meets Approximate Computing

Writing a Technical Report

LESSON 12 - Loops and Simulations

Parallelizing Dynamic Time Warping

Compiler Construction

It is great that we automate our tests, but why are they so bad?

A Balanced Introduction to Computer Science David Reed, Creighton University ©2005 Pearson Prentice Hall ISBN X Chapter 13 (Reed) - Conditional.

Balancing Trade-Offs in Test-Suite Reduction

Sparse and Redundant Representations and Their Applications in

August Shi, Tifany Yung, Alex Gyori, and Darko Marinov

Objective of This Course

Distributed Systems CS

Testing and Test-Driven Development CSC 4700 Software Engineering

Predicting Fault-Prone Modules Based on Metrics Transitions

More on Estimation In general, effort estimation is based on several parameters and the model ( E= a + b*S**c ): Personnel Environment Quality Size or.

IPOG: A General Strategy for T-Way Software Testing

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Graph Coverage for Specifications CS 4501 / 6501 Software Testing

Conjoint Analysis.

One-Way Analysis of Variance

Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.

Dongyun Jin, Patrick Meredith, Dennis Griffith, Grigore Rosu

Regression testing Tor Stållhane.

Analysis of Algorithms

Introduction Previous work Test Suite Minimization

CS 240 – Advanced Programming Concepts

Getting started – Example 1

Analysis of Algorithms

Outline System architecture Current work Experiments Next Steps

ECE 352 Digital System Fundamentals

A Balanced Introduction to Computer Science David Reed, Creighton University ©2005 Pearson Prentice Hall ISBN X Chapter 13 (Reed) - Conditional.

By Hyunsook Do, Sebastian Elbaum, Gregg Rothermel

ECE 352 Digital System Fundamentals

Reporting 101 Keenan & Mona.

Analysis of Algorithms

Why do we refactor? Technical Debt Items Versus Size: A Forensic Investigation of What Matters Hello everyone, I’m Ehsan Zabardast I am a PhD candidate.

SSDT, Docker, and (Azure) DevOps

Software Testing and QA Theory and Practice (Chapter 5: Data Flow Testing) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.

Mitigating the Effects of Flaky Tests on Mutation Testing

Presentation transcript:

An Extensive Study of Static Regression Test Selection in Modern Software Evolution Owolabi Legunsen, Farah Hariri, August Shi, Yafeng Lu, Lingming Zhang, and Darko Marinov CS 498ST Lecture 09/21/2017 University of Illinois at Urbana Champaign Hello everyone, my name is Owolabi and today, I will be presenting “An extensive study of static regression test selection in modern software evolution”. This is work that I did together with Farah, August, Yafeng, Lingming and Darko. We are affiliated with the University of Illinois and the University of Texas at Dallas. CCF-1409423, CCF-1421503, CCF-1438982, CCF-1439957, CCF-1566589

Regression Testing A T1 T1 B T2 T2 C C T3 T3 D T4 T4 E F In regression testing, all tests are run after every change to ensure that code changes did not break existing functionality. To illustrate, let’s consider the following example that I will use in the rest of my talk. Suppose that a codebase consists of entities A through F, which have the dependencies shown by these arrows. For example, the arrow between A and B means that A depends on B. Further, there are Tests T1 through T4 for this code. Now, let’s say that a change is made to F. In regression testing, all the tests are rerun. The problem with regression testing is that it can be slow, especially when there are many tests. T4 T4 E F

Recall Regression Testing Pros & Cons Automated testing after every change Early detection of faults Ensures faults that are fixed stay fixed Cons: Long time to run (N is very large!) Expensive to maintain Potentially unreliable (flaky tests)

Regression Testing Rerun tests to ensure that code changes did not break existing functionality A T1 T1 B T2 T2 C C T3 T3 D In regression testing, all tests are run after every change to ensure that code changes did not break existing functionality. To illustrate, let’s consider the following example that I will use in the rest of my talk. Suppose that a codebase consists of entities A through F, which have the dependencies shown by these arrows. For example, the arrow between A and B means that A depends on B. Further, there are Tests T1 through T4 for this code. Now, let’s say that a change is made to F. In regression testing, all the tests are rerun. The problem with regression testing is that it can be slow, especially when there are many tests. T4 T4 E F Problem: Regression testing can be very slow! (many tests)

Regression Testing can be SLOW! August Shi told you about Regression Testing at Microsoft Tests for a single product at CompanyA takes 7 days! Let’s look at some examples from open-source

Regression Testing can be SLOW! (2) In Parallel In Sequence

Regression Testing can be SLOW! (3) In Parallel In Sequence

Discussion What is your own personal experience with slow regression testing? What would be your response if your tests took 5 hours to run after every change?

Today’s Lecture Speeding up Regression Testing through Regression Test Selection (RTS) Discuss FSE 2016 paper in which we evaluated various approaches to RTS Demonstrate Open-source RTS tool, STARTS (https://github.com/TestingResearchIllinois/starts)

Regression Test Selection (RTS) Speed up regression testing by rerunning only tests that are affected by code changes A Quiz: Which classes can change behavior due to change in C? T1 B T2 C C Quiz: Which test classes can need to be rerun? T3 D T4 E The goal of regression test selection, RTS, is to speed up regression testing by rerunning only tests that are affected by code changes. Given our previous example and the change to F, an RTS technique will first find that both A and B depend on F and then select to rerun only T1 and T1, which are the only tests that are affected by the changes. Finding dependencies among classes can be done statically or dynamically. In this paper, we studied static RTS techniques and compared them with a state-of-the-art dynamic RTS technique. F

Regression Test Selection (RTS) Speed up regression testing by rerunning only tests that are affected by code changes A A T1 T1 B B T2 T2 C C T3 D Finding dependencies can be done statically or dynamically The goal of regression test selection, RTS, is to speed up regression testing by rerunning only tests that are affected by code changes. Given our previous example and the change to F, an RTS technique will first find that both A and B depend on F and then select to rerun only T1 and T1, which are the only tests that are affected by the changes. Finding dependencies among classes can be done statically or dynamically. In this paper, we studied static RTS techniques and compared them with a state-of-the-art dynamic RTS technique. T4 E F This paper: we studied static RTS approaches and compared with state-of-the-art dynamic RTS

Motivation for our Study Dynamic RTS has been getting adopted recently Ekstazi (http://ekstazi.org/) is used in several projects Clover (http://openclover.org/) is an open-source code coverage tool that also does RTS Dynamic RTS may not always be applicable Instrumentation costs can be high Dependencies may be incomplete, e.g., due to non-determinism Static RTS was proposed previously but not evaluated at scale on modern software We conducted this study for the following reasons: Although dynamic RTS has been gaining adoption recently, it has some limitations that make it to not always be applicable. For example dynamic RTS relies on instrumentation which can be costly and the dependencies it captures may be incomplete in cases where the code under test exhibits nondeterminism. In addition, although static RTS was proposed over two decades ago, it was not evaluated at scale on modern software systems.

How RTS works Find Dependencies Code + Tests Dependencies Analyze Dependencies Changes Affected Tests Before I go into details about our study, let me first say more about RTS in general. There are two inputs to an RTS technique: the code plus tests and the changes between two versions of the codebase. The output are the tests that are affected by the changes. First, an RTS technique finds the dependencies among the various entities in the codebase. Then, using the changes, these dependencies are analyzed in order to find the affected tests. Note that an affected test is one that can behave differently after the code changes and a test is affected if any of its dependencies changed. An affected test can behave differently due to code changes A test is affected if any of its dependencies changed

Finding and Analyzing Dependencies Dependencies: entities that can affect test behavior A Finding Dependencies: T1 depends on A, B, C, D T2 depends on B, C T3 depends on E T4 depends on D, E, F T1 B T2 C C T3 D A test’s dependencies are any entities that can affect test behavior. To illustrate the finding and analyzing of dependencies in an RTS technique, let us consider again our example, where F is the only change. An RTS technique first finds the dependencies for each test. In this case, it finds that T1 depends on A, B, C and F, that T2 depends on B and F, and so on. Note that each test depends on itself. Now, given the information that only F changed, these dependencies are analyzed to find which tests depend on F, resulting in T1 and T2 being returned as the affected tests. T4 E Quiz: Did we capture all the dependencies? F

Finding and Analyzing Dependencies Dependencies: entities that can affect test behavior A Finding Dependencies: T1 depends on A, B, C, D, T1 T2 depends on B, C, T2 T3 depends on E, T3 T4 depends on D, E, F, T4 Analyzing Dependencies: T1 & T2 are affected T1 B T2 C C T3 D A test’s dependencies are any entities that can affect test behavior. To illustrate the finding and analyzing of dependencies in an RTS technique, let us consider again our example, where F is the only change. An RTS technique first finds the dependencies for each test. In this case, it finds that T1 depends on A, B, C and F, that T2 depends on B and F, and so on. Note that each test depends on itself. Now, given the information that only F changed, these dependencies are analyzed to find which tests depend on F, resulting in T1 and T2 being returned as the affected tests. T4 E F

Important RTS Considerations End-to-end time of RTS must be less than time to run all tests Run All Tests Time Savings Find Dependencies Analyze Run Affected Tests End-to-End Time for RTS There are certain considerations that are very important for any RTS technique. First, the end-to-end time for any RTS technique must be less than the time to simply rerun all the tests. Suppose that this big bar represents the time to run all tests. The time for RTS is shown below as the sum of the time to find and analyze dependencies as we as to run the affected tests. This is the end-to-end time for RTS, and the goal is to keep this end-to-end time as small as possible to maximize the time savings. There are two other important considerations. First, RTS is “safe” if it selects to rerun all affected tests, and “precise” if it selects to “rerun” only affected tests. RTS is safe if it selects to rerun all affected tests RTS is precise if it selects to rerun only affected tests

RTS Techniques Evaluated Finding dependencies can be done dynamically or statically Dependencies can be at different levels of granularity, e.g., methods, classes, jar files, etc. In this paper, we compare these approaches: Class-Level Dynamic Class-Level Static Method-Level Static End-to-End Time Safety Precision ? ? ? Like I mentioned before, finding dependencies can be done statically or dynamically. In addition, dependencies can be at different levels of granularity such as methods, classes, jar files and so on. In this paper, we evaluated and compared three RTS techniques in terms of End-to-end time, safety and precision: a class-level dynamic technique, a class-level static technique and a method-level static technique. We found the method-level RTS technique performed rather poorly, compared with the class-level techniques and I will not be discussing it in the rest of my talk. Please see our paper for the details. See details on method-level RTS in paper

Class-Level Dynamic RTS (Ekstazi[1]) Find Dependencies: dynamically track classes used while running each test class Changes: classes whose .class files differ Analyze Dependencies: select test classes for which any of its dependencies changed The class-level RTS technique that we evaluated is Ekstazi, which was proposed by Gligoric et. al. in their ISSTA 15 paper. To find dependencies, Ekstazi dynamically tracks what classes are used while running each test class. Ekstazi computes as changed, any whose bytecode changed since the previous version and Ekstazi selects to rerun test classes for which at least one dependency changed. [1] M. Gligoric, L. Eloussi, and D. Marinov. Practical Regression Test Selection with Dynamic File Dependencies. ISSTA 2015

Class-Level STAtic RTS (STARTS[2]) First, statically build a class dependency graph Each class has an edge to direct parents and referenced classes Find Dependencies: classes reachable from test classes in the graph Changes: computed in same way as Ekstazi Analyze Dependencies: select test classes that reach a changed class in the graph The static class-level RTS technique that we evaluated works in the following way: First, it builds a class dependency graph in which each class has an edge to its it’s direct parents and any classes that it references. Specifically, the dependency graph is an IRG, as described by Orso et. al., in their FSE 2004 paper. Once the IRG is constructed, dependencies are computed as all classes that are reachable from each test class in the IRG. Changes are computed the same way as in Ekstazi and the tests that are selected to be rerun are those that can reach a changed class in the IRG. [2] https://github.com/TestingResearchIllinois/starts

Variants of RTS Techniques studied We studied 12 RTS techniques in total 2 variants of the static/dynamic class-level RTS Offline: pre-compute dependencies before changes are known Online: compute dependencies after changes are known 8 variants of static method-level RTS technique Going into more details about our study, we studied 12 RTS techniques in total. We studied 2 variants each of the class-level RTS techniques: an offline variant and an online variant. In the offline variant, dependencies are precomputed from the old version, before changes are known. That way, once the changes are known, it is much faster to analyze the dependencies and find the affected tests. However, in the online variant, dependencies are computed on the fly, after the changes are known. We also studied 8 variants of the method-level RTS technique and you can find the details in our paper. The main reason for online offline is performance See details on method-level RTS in paper

Research Questions RQ1: How do RTS techniques compare w.r.t. number of tests selected? RQ2: How do RTS techniques compare w.r.t. end-to-end time? RQ3: How do static RTS techniques compare with class- level dynamic RTS in terms of precision and safety? RQ4: How do variants of MethSRTS influence the cost/safety trade-offs? We asked four research questions in our evaluation. First, we wanted to know how the different techniques compared in terms of the number of tests selected. Second we wanted to know how the techniques compared in terms of their end-to-end time. Third, we wanted to know how the static RTS techniques compared with dynamic RTS in terms of safety and finally, we wanted to know how the variants of the method-level static RTS influenced the cost/safety tradeoff. In this talk, I will only discuss the first three RQs, please see our paper for the results concerning RQ4… See answer to RQ4 in paper

Experimental Setup 22 open-source projects from ASF and GitHub Single-module Maven projects with JUnit4 tests Project sizes: from 2 kLOC to 185 kLOC 985 revisions of these 22 projects Selection criteria: subset of latest 100 commits Compile successfully All tests pass Ekstazi runs successfully We evaluated all these RTS techniques on 22 open-source projects selected from the Apache Software Foundation and from Github. These were all single-module Maven projects with Junit4 tests and the project sizes ranged from two thousand lines of code to 185 thousand lines of code. We used a combined total of 985 revisions of these 22 projects and these revisions were selected in the following way. We started from the latest 100 commits in each project and selected the subset of revisions that satisfied three criteria: compilation was successful, all tests pass and Ekstazi runs successfully.

Ekstazi selects fewer tests RQ1: Tests Selected This plot shows the percentage of tests selected by static class-level RTS and Ekstazi, relative to rerunning all tests in all versions of the 22 projects in our study. The red boxes represents Ekstazi and the blue boxes represents the static class-level RTS technique. As expected, for most of these projects, Ekstazi selects fewer tests than class-level SRTS, when comparing the means, as shown by the direction of these arrows. On average across all these projects, Ekstazi selects 20.6% of tests while the static class-level RTS selected 29.4% of tests. Ekstazi selects fewer tests than STARTS 20.6% vs. 29.4%

RQ2: End-to-End Time For RQ2, this table shows the min, max, and average end-to-end times as percentages relative to the time to running all tests in all 22 projects. We’ve sorted the projects in decreasing order of relative end-to-end time. We can see in these cases, both static and dynamic RTS were slower than simply rerunning all the tests. Running all the tests in these projects took very little time so RTS overhead slowed down the testing process. If we compare averages, Ekstazi and ClassSRTS have very similar end-to-end times, which is notable when it is considered that ClassSRTS selected more tests, on average than Ekstazi.

RQ3: Safety and Precision Safety and precision were measured against Ekstazi Safety violation: STARTS misses Ekstazi-selected tests: Precision violation: STARTS selects tests that Ekstazi does not: 𝑆𝑎𝑓𝑒𝑡𝑦𝑉𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛= |𝐸 \ 𝑆| |𝐸 ∪𝑆| 𝐸=𝑡𝑒𝑠𝑡𝑠 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑏𝑦 𝐸𝑘𝑠𝑡𝑎𝑧𝑖 𝑆=𝑡𝑒𝑠𝑡𝑠 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑏𝑦 𝑆𝑇𝐴𝑅𝑇𝑆 For RQ3, we measured the safety and precision of ClassSRTS relative to Ekstazi. Specifically, we measured “safety violation”, which penalized ClassSRTS for missing tests that Ekstazi selects. We also measured “precision violation, which penalized ClassSRTS for selecting tests that Ekstazi did not select. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑉𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛= |𝑆 \ 𝐸| |𝐸 ∪𝑆|

RQ3: Safety and Precision 𝑆𝑎𝑓𝑒𝑡𝑦𝑉𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛= |𝐸 \ 𝑆| |𝐸 ∪𝑆| 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑉𝑖𝑜𝑙𝑎𝑡𝑖𝑜𝑛= |𝑆 \ 𝐸| |𝐸 ∪𝑆| Here are the results. In this table, the revs column shows the percentage of revisions where there is a violation (compared to Ekstazi), while the min, max and avg percentages refer to the degree of such violations when they occur. We can see that there are very few revisions with safety violations, and we also see the average percentage of safety violations in those revisions. Overall ClassSRTS was only unsafe in 0.2% of the revisions and the average safety violation was 6.8%. Here’s what the numbers looked like for precision violations. ClassSRTS was imprecise in 33% of the revisions and the overall average precision violation was 42.9%

Reflection caused all Safety Violations Example simplified from Apache commons-math STARTS misses this edge because it is not aware of reflection AbstractIntegrator InterpolatorTest AbstractInterpolatorTest Integrator name = interpolatorName.replaceAll("Interpolator", "Integrator"); Class clz = (Class) Class.forName(name); i = clz.getConstructor(…).newInstance(field, field.getOne()); Now that we’ve seen that ClassSRTS incurred safety and precision violations, lets look into *why* violations happened. The reason for safety violations in our experiments was reflection. Consider the following safety violation from apache commons-math. We show only the relevant portion of the class dependency graph here, and we see that InterpolatorTest depends on AbstractInterpolatorTest and that Integrator depends on AbstractIntegrator, which is a changed class. Given this change, ClassSRTS fails to select to rerun InterpolatorTest because it misses this edge. What is happening is that AbstractIntegrator uses reflection to dynamically create instances of Integrator, so there is a dynamic dependency. However, as ClassSRTS is not aware of reflection, it misses this edge.

Since this paper was accepted … We open-sourced STARTS https://github.com/TestingResearchIllinois/starts STARTS now handles multi-module Maven projects STARTS now finds dependencies from bytecode much faster We are making STARTS safer with respect to reflection We are evaluating STARTS on larger software systems Email me (legunse2@Illinois.edu) if interested in tool paper describing STARTS implementation

Quick STARTS Demo Video Demo: https://youtu.be/PCNtk8jphrM README shows how to add STARTS to a project https://github.com/TestingResearchIllinois/starts/blob/master/RE ADME.md

Conclusions We performed the first, large-scale empirical study of static RTS and its comparison with dynamic RTS At the class level, we found static RTS (STARTS) comparable with state-of-the-art dynamic RTS (Ekstazi) Similar end-to-end times STARTS had very few safety violations Method-level static RTS requires more work to be usable To conclude, the work that I just presented is the first study of static RTS and its comparison with dynamic RTS. We found class-level static RTS to be comparable with class-level dynamic RTS in terms of end-to-end time, while incurring very few safety violations. However, at the method level, static RTS performed poorly and requires more work to be usable. We are now working on making static RTS safe with respect to reflection. If you have any questions, feel free to email me at this address. I’ll be happy to take questions now. legunse2@illinois.edu

STARTS Projects that you can do Estimate test-running time for “mvn starts:select” https://github.com/TestingResearchIllinois/starts/issues/11 Fix integration tests on Windows and add more (>40) integration tests https://github.com/TestingResearchIllinois/starts/issues/12 Make STARTS analysis incremental and faster. Currently on each revision, STARTS parses entire application code to find dependencies builds dependency graph from scratch finds dependencies for all tests

STARTS Projects that you can do (2) Implement Maven goals in STARTS for online mode and offline mode and evaluate on many projects Evaluate STARTS vs. Ekstazi on at least 10 projects where tests take at least 1 hour to run Make STARTS work on Travis and shadow many projects Add STARTS support for other Build Systems: Gradle, Bazel