Prioritizing Test Cases for Regression Testing Article By: Rothermel, et al. Presentation by: Martin, Otto, and Prashanth
Test case prioritization techniques - schedule test cases for execution in an order that attempts to increase their effectiveness at meeting some performance goal. One goal is the rate of fault detection - a measure of how quickly faults are detected within the testing process An improved rate of fault detection during testing can provide faster feedback on the system under test and let software engineers begin correcting faults earlier than might otherwise be possible. One application of prioritization techniques involves regression testing
This paper describes several techniques for using test execution information to prioritize test cases for regression testing, including: This paper describes several techniques for using test execution information to prioritize test cases for regression testing, including: 1) techniques that order test cases based on their total coverage of code components, 1) techniques that order test cases based on their total coverage of code components, 2) techniques that order test cases based on their coverage of code components not previously covered, and 2) techniques that order test cases based on their coverage of code components not previously covered, and 3) techniques that order test cases based on their estimated ability to reveal faults in the code components that they cover. 3) techniques that order test cases based on their estimated ability to reveal faults in the code components that they cover.
When the time required to re-execute an entire test suite is short, test case prioritization may not be cost-effective-it may be sufficient simply to schedule test cases in any order. When the time required to re-execute an entire test suite is short, test case prioritization may not be cost-effective-it may be sufficient simply to schedule test cases in any order. When the time required to execute an entire test suite is sufficiently long, however, test-case prioritization may be beneficial because, in this case, meeting testing goals earlier can yield meaningful benefits. When the time required to execute an entire test suite is sufficiently long, however, test-case prioritization may be beneficial because, in this case, meeting testing goals earlier can yield meaningful benefits. In general test case prioritization, given program P and test suite T, we prioritize the test cases in T with the intent of finding an ordering of test cases that will be useful over a succession of subsequent modified versions of P. In general test case prioritization, given program P and test suite T, we prioritize the test cases in T with the intent of finding an ordering of test cases that will be useful over a succession of subsequent modified versions of P. In the case of regression testing, prioritization techniques can use information gathered in previous runs of existing test cases to help prioritize the test cases for subsequent runs. In the case of regression testing, prioritization techniques can use information gathered in previous runs of existing test cases to help prioritize the test cases for subsequent runs.
This paper considers 9 different test case prioritization techniques. This paper considers 9 different test case prioritization techniques. The first three techniques serve as experimental controls The first three techniques serve as experimental controls The last six techniques represent heuristics that could be implemented using software tools The last six techniques represent heuristics that could be implemented using software tools A source of motivation for these approaches is the conjecture that the availability of test execution data can be an asset. A source of motivation for these approaches is the conjecture that the availability of test execution data can be an asset. This assumes that past test execution data can be used to predict, with sufficient accuracy, subsequent execution behavior. This assumes that past test execution data can be used to predict, with sufficient accuracy, subsequent execution behavior.
Definition 1. The Test Case Prioritization Problem: Definition 1. The Test Case Prioritization Problem: Given: T, a test suite, PT, the set of permutations of T, and f, a function from PT to the real numbers. Given: T, a test suite, PT, the set of permutations of T, and f, a function from PT to the real numbers. PT represents the set of all possible prioritizations (orderings) of T PT represents the set of all possible prioritizations (orderings) of T f is a function that, applied to any such ordering, yields an award value for that ordering. f is a function that, applied to any such ordering, yields an award value for that ordering.
A challenge: care must be taken to keep the cost of performing the prioritization from excessively delaying the very regression testing activities it is intended to facilitate. A challenge: care must be taken to keep the cost of performing the prioritization from excessively delaying the very regression testing activities it is intended to facilitate.
M3: Optimal prioritization. M3: Optimal prioritization. Given program P and a set of known faults for P, if we can determine, for test suite T, which test cases in T expose which faults in P, then we can determine an optimal ordering of the test cases in T for maximizing T's rate of fault detection for that set of faults. Given program P and a set of known faults for P, if we can determine, for test suite T, which test cases in T expose which faults in P, then we can determine an optimal ordering of the test cases in T for maximizing T's rate of fault detection for that set of faults. This is not a practical technique, as it requires a priori knowledge of the existence of faults and of which test cases expose which faults. This is not a practical technique, as it requires a priori knowledge of the existence of faults and of which test cases expose which faults. However, by using this technique in the empirical studies, we can gain insight into the success of other practical heuristics, by comparing their solutions to optimal solutions. However, by using this technique in the empirical studies, we can gain insight into the success of other practical heuristics, by comparing their solutions to optimal solutions.
M4: Total statement coverage prioritization. M4: Total statement coverage prioritization. By instrumenting a program, we can determine, for any test case, which statements in that program were exercised (covered) by that test case. By instrumenting a program, we can determine, for any test case, which statements in that program were exercised (covered) by that test case. We can then prioritize test cases in terms of the total number of statements they cover by counting the number of statements covered by each test case and then sorting the test cases in descending order of that number. We can then prioritize test cases in terms of the total number of statements they cover by counting the number of statements covered by each test case and then sorting the test cases in descending order of that number.
M5: Additional statement coverage prioritization. M5: Additional statement coverage prioritization. Total statement coverage prioritization schedules test cases in the order of total coverage achieved; however, having executed a test case and covered certain statements, more may be gained in subsequent testing by executing statements that have not yet been covered. Total statement coverage prioritization schedules test cases in the order of total coverage achieved; however, having executed a test case and covered certain statements, more may be gained in subsequent testing by executing statements that have not yet been covered. Additional statement coverage prioritization iteratively selects a test case that yields the greatest statement coverage, then adjusts the coverage information on all remaining test cases to indicate their coverage of statements not yet covered and repeats this process until all statements covered by at least one test case. Additional statement coverage prioritization iteratively selects a test case that yields the greatest statement coverage, then adjusts the coverage information on all remaining test cases to indicate their coverage of statements not yet covered and repeats this process until all statements covered by at least one test case. We may reach a point where each statement has been covered by at least one test case, and the remaining unprioritized test cases cannot add additional statement coverage. We could order these remaining test cases using any prioritization technique. We may reach a point where each statement has been covered by at least one test case, and the remaining unprioritized test cases cannot add additional statement coverage. We could order these remaining test cases using any prioritization technique.
M6: Total branch coverage prioritization. M6: Total branch coverage prioritization. Total branch coverage prioritization is the same as total statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements. Total branch coverage prioritization is the same as total statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements. In this context, we define branch coverage as coverage of each possible overall outcome of a (possibly compound) condition in a predicate. Thus, for example, each if or while statement must be exercised such that it evaluates at least once to true and at least once to false. In this context, we define branch coverage as coverage of each possible overall outcome of a (possibly compound) condition in a predicate. Thus, for example, each if or while statement must be exercised such that it evaluates at least once to true and at least once to false.
M7: Additional branch coverage prioritization. M7: Additional branch coverage prioritization. Additional branch coverage prioritization is the same as additional statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements. Additional branch coverage prioritization is the same as additional statement coverage prioritization, except that it uses test coverage measured in terms of program branches rather than statements. After complete coverage has been achieved the remaining test cases are prioritized by resetting coverage vectors to their initial values and reapplying additional branch coverage prioritization to the remaining test cases. After complete coverage has been achieved the remaining test cases are prioritized by resetting coverage vectors to their initial values and reapplying additional branch coverage prioritization to the remaining test cases.
M8: Total fault-exposing-potential (FEP) prioritization. M8: Total fault-exposing-potential (FEP) prioritization. Some faults are more easily exposed than other faults, and some test cases are more adept at revealing particular faults than other test cases. Some faults are more easily exposed than other faults, and some test cases are more adept at revealing particular faults than other test cases. The ability of a test case to expose a fault-that test case's fault exposing potential (FEP)-depends not only on whether the test case covers (executes) a faulty statement, but also on the probability that a fault in that statement will cause a failure for that test case The ability of a test case to expose a fault-that test case's fault exposing potential (FEP)-depends not only on whether the test case covers (executes) a faulty statement, but also on the probability that a fault in that statement will cause a failure for that test case Three probabilities that could be used in determining FEP: Three probabilities that could be used in determining FEP: 1) the probability that a statement s is executed (execution probability), 1) the probability that a statement s is executed (execution probability), 2) the probability that a change in s can cause a change in program state (infection probability), and 2) the probability that a change in s can cause a change in program state (infection probability), and 3) the probability that a change in state propagates to output (propagation probability). 3) the probability that a change in state propagates to output (propagation probability).
This paper adopts an approach that uses mutation analysis, to produce a combined estimate of propagation-and-infection that does not incorporate independent execution probabilities. This paper adopts an approach that uses mutation analysis, to produce a combined estimate of propagation-and-infection that does not incorporate independent execution probabilities. Mutation analysis creates a large number of faulty versions (mutants) of a program by altering program statements, and uses these to assess the quality of test suites by measuring whether those test suites can detect those faults (‘kill’ those mutants). Mutation analysis creates a large number of faulty versions (mutants) of a program by altering program statements, and uses these to assess the quality of test suites by measuring whether those test suites can detect those faults (‘kill’ those mutants). Given program P and test suite T, we first create a set of mutants N ={n 1 ; n 2 ;... ; n m } for P, noting which statement s j in P contains each mutant. Next, for each test case t i in T, we execute each mutant version n k of P on t i, noting whether t i kills that mutant. Given program P and test suite T, we first create a set of mutants N ={n 1 ; n 2 ;... ; n m } for P, noting which statement s j in P contains each mutant. Next, for each test case t i in T, we execute each mutant version n k of P on t i, noting whether t i kills that mutant. Having collected this information for every test case and mutant, we consider each test case t i and each statement s j in P, and calculate the fault-exposing potential FEP(s, t) of t i on s j as the ratio of mutants of s j killed by t i to the total number of mutants of s j. Having collected this information for every test case and mutant, we consider each test case t i and each statement s j in P, and calculate the fault-exposing potential FEP(s, t) of t i on s j as the ratio of mutants of s j killed by t i to the total number of mutants of s j.
To perform total FEP prioritization, given these FEP(s; t) values, we next calculate, for each test case t i in T, an award value, by summing the FEP(s j ; t i ) values for all statements s j in P. To perform total FEP prioritization, given these FEP(s; t) values, we next calculate, for each test case t i in T, an award value, by summing the FEP(s j ; t i ) values for all statements s j in P. Given these award values, we then prioritize test cases by sorting them in order of descending award value. Given these award values, we then prioritize test cases by sorting them in order of descending award value.
M9: Additional fault-exposing-potential (FEP) prioritization. M9: Additional fault-exposing-potential (FEP) prioritization. This lets us account for the fact that additional executions of a statement may be less valuable than initial executions. This lets us account for the fact that additional executions of a statement may be less valuable than initial executions. We require a mechanism for measuring the value of an execution of a statement, that can be related to FEP values. We require a mechanism for measuring the value of an execution of a statement, that can be related to FEP values. For this, we use the term confidence. We say that the confidence in statement s, C(s), is an estimate of the probability that s is correct. For this, we use the term confidence. We say that the confidence in statement s, C(s), is an estimate of the probability that s is correct. If we execute a test case t that exercises s and does not reveal a fault in s, C(s) should increase. If we execute a test case t that exercises s and does not reveal a fault in s, C(s) should increase.
Research Questions Research Questions Can test case prioritization improve the rate of fault detection in test suites? Can test case prioritization improve the rate of fault detection in test suites? How do the various test case prioritization techniques discussed earlier compare to one another in terms of effects on rate of fault detection? How do the various test case prioritization techniques discussed earlier compare to one another in terms of effects on rate of fault detection? Effectiveness Measures Effectiveness Measures Use a weighted Average of the Percentage of Faults Detected (APFD) Use a weighted Average of the Percentage of Faults Detected (APFD) Ranges from Ranges from Higher numbers means faster detection Higher numbers means faster detection Problems with APFD Problems with APFD Doesn’t measure cost of prioritization Doesn’t measure cost of prioritization Cost is normally amortized because test suites are created after the release of a version of the software Cost is normally amortized because test suites are created after the release of a version of the software
Effectiveness Example
Programs used Programs used Aristotle program analysis system for test coverage and control graph information Aristotle program analysis system for test coverage and control graph information Proteum mutation system to obtain mutation scores. Proteum mutation system to obtain mutation scores. Used 8 C programs as subjects Used 8 C programs as subjects First 7 were created at Siemens, the eighth is a European Space Agency program First 7 were created at Siemens, the eighth is a European Space Agency program
Siemens Programs - Description Siemens Programs - Description 7 programs used by Siemens in a study that observed the “fault detecting effectiveness of coverage criteria” 7 programs used by Siemens in a study that observed the “fault detecting effectiveness of coverage criteria” Created faulty versions of these programs by manual seeding them with single errors creating the “number of versions” column Created faulty versions of these programs by manual seeding them with single errors creating the “number of versions” column Using single line faults only allows researchers to determine whether a test case discovers the error or not Using single line faults only allows researchers to determine whether a test case discovers the error or not For each of the seven programs, a test case suite was created by Siemens. First via a black box method, they then completed the suite using white box testing, so that each “executable statement, edge, and definition use pair … was exercised by at least 30 test cases. For each of the seven programs, a test case suite was created by Siemens. First via a black box method, they then completed the suite using white box testing, so that each “executable statement, edge, and definition use pair … was exercised by at least 30 test cases. Kept faulty programs whose errors were detectable by between 3 and 350 test cases Kept faulty programs whose errors were detectable by between 3 and 350 test cases Test suites were created by the researchers by random selection until a branch coverage adequate test suite was created Test suites were created by the researchers by random selection until a branch coverage adequate test suite was created Proteum was used to create mutants of the seven programs Proteum was used to create mutants of the seven programs
Space Program – Description Space Program – Description 33 versions of space with only one fault in each were created by the ESA, 2 more were created by the research team 33 versions of space with only one fault in each were created by the ESA, 2 more were created by the research team Initial pool of test cases were obtained from Vokolos and Frankl Initial pool of test cases were obtained from Vokolos and Frankl Used these as a base and added cases until each statement and edge was exercised by at least 30 test cases Used these as a base and added cases until each statement and edge was exercised by at least 30 test cases Created a branch coverage adequate test suite in the same way as the Siemens program Created a branch coverage adequate test suite in the same way as the Siemens program Also created mutants via Proteum Also created mutants via Proteum
Empirical Studies and Results Empirical Studies and Results 4 different studies using the 8 programs 4 different studies using the 8 programs Siemens programs with APFD measured relative to Siemens faults Siemens programs with APFD measured relative to Siemens faults Siemens programs with APFD measured relative to mutants Siemens programs with APFD measured relative to mutants Space with APFD measured relative to actual faults Space with APFD measured relative to actual faults Space with APFD measure relative to mutants Space with APFD measure relative to mutants
Siemens programs with APFD measured relative to Siemens faults – Study Format Siemens programs with APFD measured relative to Siemens faults – Study Format M2 to M9 were applied to each of the 1000 test suites, resulting in 8000 prioritized test suites M2 to M9 were applied to each of the 1000 test suites, resulting in 8000 prioritized test suites The original 1000 were used as M1 The original 1000 were used as M1 Calculated the APFD relative to the faults provided by the program Calculated the APFD relative to the faults provided by the program
Example boxplot
Study 1 - Overall observations Study 1 - Overall observations M3 is markedly better than all of the others (as expected) M3 is markedly better than all of the others (as expected) The test case prioritization techniques offered appear to have some improvement, but more statistics needed to be done to confirm The test case prioritization techniques offered appear to have some improvement, but more statistics needed to be done to confirm Upon completion of these statistics, more results were revealed Upon completion of these statistics, more results were revealed Branch based coverage did as well or better than statement coverage Branch based coverage did as well or better than statement coverage All except one indicates that total branch coverage did as well or better than additional branch coverage All except one indicates that total branch coverage did as well or better than additional branch coverage All total statement coverage did as well or better than additional statement coverage All total statement coverage did as well or better than additional statement coverage In 5 of 7 programs, even randomly prioritized test suites did better than untreated test suites In 5 of 7 programs, even randomly prioritized test suites did better than untreated test suites
Example Groupings
Siemens programs with APFD measured relative to mutants – Study Format Siemens programs with APFD measured relative to mutants – Study Format Same format as the first study, 9000 test suites used, 1000 for each prioritization technique Same format as the first study, 9000 test suites used, 1000 for each prioritization technique But rather than run those test cases on the small subset of known errors, they were applied to mutated programs that were created to form a larger bed of programs to test against But rather than run those test cases on the small subset of known errors, they were applied to mutated programs that were created to form a larger bed of programs to test against Results Results Additional and Total FEP prioritization outperformed all others (except optimal) Additional and Total FEP prioritization outperformed all others (except optimal) Branch almost always outperformed statement Branch almost always outperformed statement Total statement outperformed additional Total statement outperformed additional But additional branch coverage outperformed total branch coverage But additional branch coverage outperformed total branch coverage However, in this study random did not outperform the control However, in this study random did not outperform the control
Space with APFD measured relative to Actual Faults Space with APFD measured relative to Actual Faults M2 – M9 were applied to each of the 50 test suites, resulting in 400 test suites, plus the original 50 resulting in 450 total test suites M2 – M9 were applied to each of the 50 test suites, resulting in 400 test suites, plus the original 50 resulting in 450 total test suites Additional FEP outperformed all others, but there was no significant difference among the rest Additional FEP outperformed all others, but there was no significant difference among the rest Also random is no better than the control Also random is no better than the control
Study 3 Groupings
Space with APFD measured relative to mutants Space with APFD measured relative to mutants Same technique as other space study, only using 132,163 mutant version of the software Same technique as other space study, only using 132,163 mutant version of the software Additional FEP outperformed all others Additional FEP outperformed all others Branch and statement are indistinguishable Branch and statement are indistinguishable But additional coverage always outperforms its total counterpart But additional coverage always outperforms its total counterpart
Study 4 Groupings
Threats to Validity Threats to Validity Construct Validity – You are measuring what you say you are measuring (and not something else) Construct Validity – You are measuring what you say you are measuring (and not something else) Internal Validity – Ability to say that the causal relationship is true Internal Validity – Ability to say that the causal relationship is true External Validity – Ability to generalize results across the field External Validity – Ability to generalize results across the field
Construct Validity Construct Validity APFD is highly accurate, but it is not the only method of measuring fault detection, could also measure percentage of test suite that must be run before all errors are found APFD is highly accurate, but it is not the only method of measuring fault detection, could also measure percentage of test suite that must be run before all errors are found No value to later tests that detect the same error No value to later tests that detect the same error FEP based calculations – Other estimates may more accurately capture the probability of a test case finding a fault FEP based calculations – Other estimates may more accurately capture the probability of a test case finding a fault Effectiveness is measured without cost Effectiveness is measured without cost
Internal Validity Internal Validity Instrumentation bias can bias results especially in APFD and prioritization measurement tools Instrumentation bias can bias results especially in APFD and prioritization measurement tools Performed code revision Performed code revision Also limit problems by running prioritization algorithm on each test suite and each subject program Also limit problems by running prioritization algorithm on each test suite and each subject program
External Validity External Validity The Siemens programs are non-trivial but not representative of real world programs. The space program is, but is only one program The Siemens programs are non-trivial but not representative of real world programs. The space program is, but is only one program Faults in Siemens programs were seeded (not like those in the real world) Faults in Siemens programs were seeded (not like those in the real world) Faults in space were found during development, but these may differ from those found later in the development process. Plus they are only one set of faults found by one set of programmers Faults in space were found during development, but these may differ from those found later in the development process. Plus they are only one set of faults found by one set of programmers Single faults version programs are also not representative of the real world Single faults version programs are also not representative of the real world The test suites were created with only a single method, other real world methods exist The test suites were created with only a single method, other real world methods exist These threats can only be answered by more studies with different test suites, programs, and errors These threats can only be answered by more studies with different test suites, programs, and errors
Additional Discussion And Practical Implications
Test case prioritization can substantially improve rate of fault detection of test suites. Test case prioritization can substantially improve rate of fault detection of test suites. Additional FEP prioritization techniques do not always justify the additional expenses incurred, as is gathered from cases where specific coverage based techniques outperformed them and also in cases where the total gain in APFD, when the additional FEP techniques did perform the best, was not large enough. Additional FEP prioritization techniques do not always justify the additional expenses incurred, as is gathered from cases where specific coverage based techniques outperformed them and also in cases where the total gain in APFD, when the additional FEP techniques did perform the best, was not large enough. Branch-coverage-based techniques almost always performed as well if not better than statement-coverage-based techniques. Thus if the two techniques incur similar costs, branch-coverage-techniques are advocated. Branch-coverage-based techniques almost always performed as well if not better than statement-coverage-based techniques. Thus if the two techniques incur similar costs, branch-coverage-techniques are advocated.
Total statement and branch coverage techniques perform almost at par with the additional branch and statement coverage techniques, entitling its use due to its lower complexity. Total statement and branch coverage techniques perform almost at par with the additional branch and statement coverage techniques, entitling its use due to its lower complexity. However, this does not apply for space (Study 4) program where the additional branch and statement coverage techniques outperformed the total statement and branch coverage techniques by a huge margin. However, this does not apply for space (Study 4) program where the additional branch and statement coverage techniques outperformed the total statement and branch coverage techniques by a huge margin. Randomly prioritized test suites typically outperform untreated test suites. Randomly prioritized test suites typically outperform untreated test suites.
Conclusion
Any one of the prioritization techniques offer some amount of improved fault detection capabilities. Any one of the prioritization techniques offer some amount of improved fault detection capabilities. These studies are of interest only to research groups, due to the high expense that they incur. However, code coverage based techniques have immediate practical implications. These studies are of interest only to research groups, due to the high expense that they incur. However, code coverage based techniques have immediate practical implications.
Future Work
Additional studies to be performed using wider range of programs, faults and test suites. Additional studies to be performed using wider range of programs, faults and test suites. The gap between optimal prioritization and FEP prioritization techniques is yet to be bridged. The gap between optimal prioritization and FEP prioritization techniques is yet to be bridged. Determining which prioritization technique is warranted by particular types of programs and test suites. Determining which prioritization technique is warranted by particular types of programs and test suites. Other prioritization objectives have to be investigated. Other prioritization objectives have to be investigated. Version specific techniques Version specific techniques Techniques may not only be applied to regression testing but also during the initial testing of the software. Techniques may not only be applied to regression testing but also during the initial testing of the software.