Metamorphic Testing Techniques to Detect Defects in Applications without Test Oracles Christian Murphy Thesis Defense April 12, 2010.

Metamorphic Testing Techniques to Detect Defects in Applications without Test Oracles Christian Murphy Thesis Defense April 12, 2010

2 Overview Software testing is important! Certain types of applications are particularly hard to test because there is no “test oracle”  Machine Learning, Discrete Event Simulation, Optimization, Scientific Computing, etc. Even when there is no oracle, it is possible to detect defects if properties of the software are violated My research introduces and evaluates new techniques for testing such “non-testable programs” [Weyuker, Computer Journal’82]

3 Motivating Example: Machine Learning

4 Motivating Example: Simulation

5 Problem Statement Partial oracles may exist for a limited subset of the input domain in applications such as Machine Learning, Discrete Event Simulation, Scientific Computing, Optimization, etc. Obvious errors (e.g., crashes) can be detected with certain inputs or testing techniques However, it is difficult to detect subtle computational defects in applications without test oracles in the general case

6 What do I mean by “defect”? Deviation of the implementation from the specification Violation of a sound property of the software “Discrete localized” calculation errors  Off-by-one  Incorrect sentinel values for loops  Wrong comparison or mathematical operator Misinterpretation of specification  Parts of input domain not handled  Incorrect assumptions made about input

7 Observation Many programs without oracles have properties such that certain changes to the input yield predictable changes to the output We can detect defects in these programs by looking for any violations of these “metamorphic properties” This is known as “metamorphic testing”  [T.Y. Chen et al., Info. & Soft. Tech vol.4, 2002]

8 Research Goals Facilitate the way that metamorphic testing is used in practice Develop new testing techniques based on metamorphic testing Demonstrate the effectiveness of metamorphic testing techniques

9 Hypotheses For programs that do not have a test oracle, an automated approach to metamorphic testing is more effective at detecting defects than other approaches An approach that conducts function-level metamorphic testing in the context of a running application will further increase the effectiveness It is feasible to continue this type of testing in the deployment environment, with minimal impact on the end user

10 Contributions 1. A set of guidelines to help identify metamorphic properties 2. New empirical studies comparing the effectiveness of metamorphic testing to other approaches 3. An approach for detecting defects in non-deterministic applications called Heuristic Metamorphic Testing 4. A new testing technique called Metamorphic Runtime Checking based on function-level metamorphic properties 5. A generalized technique for testing in the deployment environment called In Vivo Testing

11 Outline Background  Related Work  Metamorphic Testing Metamorphic Testing Empirical Studies Metamorphic Runtime Checking Future Work & Conclusion

12 Other Approaches [Baresi & Young, 2001] Formal specifications  A complete specification is essentially a test oracle Embedded assertions  Can check that the software behaves as expected Algebraic properties  Used to generate test cases for abstract datatypes Trace checking & Log file analysis  Analyze intermediate results and sequence of executions

13 Metamorphic Testing [Chen et al., 2002] If new test case output f(t(x)) is as expected, it is not necessarily correct However, if f(t(x)) is not as expected, either f(x) or f(t(x)) – or both! – is wrong x f f(x) Initial test case t(x) f f(t(x)) New test case t f(x) and f(t(x)) are “pseudo-oracles” Transformation function based on metamorphic properties of f

14 Metamorphic Testing Example Consider a function to determine the standard deviation of a set of numbers abcdef Initial input cebafd New test case #1 2a2b2c2d2e2f New test case #3 s std_dev s ? 2s ? std_dev s ? New test case #2 a+2b+2c+2d+2e+2f+2

15 Outline

16 Empirical Study Is metamorphic testing more effective than other approaches in detecting defects in applications without test oracles? Approaches investigated  Metamorphic Testing Using metamorphic properties of the entire application  Runtime Assertion Checking Using Daikon-detected program invariants  Partial Oracle Simple inputs for which correct output can easily be determined

17 Applications Investigated Machine Learning  C4.5: decision tree classifier  MartiRank: ranking  Support Vector Machines (SVM): vector-based classifier  PAYL: anomaly-based intrusion detection system Discrete Event Simulation  JSim: used in simulating hospital ER Information Retrieval  Lucene: Apache framework’s text search engine Optimization  gaffitter: genetic algorithm approach to bin-packing problem

18 Methodology Mutation testing was used to seed defects into each application  Comparison operators were reversed  Math operators were changed  Off-by-one errors were introduced For each program, we created multiple versions, each with exactly one mutation We ignored mutants that yielded outputs that were obviously wrong, caused crashes, etc. Effectiveness is determined by measuring what percentage of the mutants were “killed”

19 Experimental Results

20 Analysis of Results Assertions are good for checking bounds and relationships but not for changes to values Metamorphic testing particularly good for detecting errors in loop conditions Metamorphic testing was not very effective for PAYL (5%) and gaffitter (33%)  fewer properties identified  defects had little impact on output

21 Outline

22 Metamorphic Runtime Checking Results of previous study revealed limitations of scope and robustness in metamorphic testing What if we consider the metamorphic properties of individual functions and check those properties as the entire program is running? A combination of metamorphic testing and runtime assertion checking

23 Metamorphic Runtime Checking Tester specifies the metamorphic properties of individual functions using a special notation in the code (based on JML) Pre-processor instruments code with corresponding metamorphic tests Tester runs entire program as normal (e.g., to perform system tests) Violation of any property reveals a defect

24 Metamorphic test MRC Model of Execution Function f is about to be executed with input x in state S Create a sandbox for the test Execute f(x) to get result Send result to test Program continues Transform input to get t(x) Execute f(t(x)) Compare outputs Report violations The metamorphic test is conducted at the same point in the program execution as the original function call The metamorphic test runs in parallel with the rest of the application

25 Empirical Study Can Metamorphic Runtime Checking detect defects not found by system-level metamorphic testing? Same mutants used in previous study  29% were not found by metamorphic testing Metamorphic properties identified at function level using suggested guidelines

26 Experimental Results

27 Analysis of Results Scope: Function-level testing allowed us to:  identify additional metamorphic properties  execute more tests Robustness: Metamorphic testing “inside” the application detected subtle defects that did not have much effect on the overall program output

28 Combined Results

29 Outline

30 Results Demonstrated that metamorphic testing advances the state of the art in detecting defects in applications without test oracles Proved that Metamorphic Runtime Checking will reveal defects not found by using system-level properties Showed that it is feasible to continue this type of testing in the deployment environment, with minimal impact on the end user

31 Short-Term Opportunities Automatic detection of metamorphic properties  Using dynamic and/or static techniques Fault localization  Once a defect has been detected, figure out where it occurred and how to fix it Implementation issues  Reducing overhead  Handling external databases, network traffic, etc.

32 Long-Term Directions Testing of multi-process or distributed applications in these domains Collaborative defect detection and notification Investigate the impact on the software development processes used in the domains of non-testable programs

33 Contributions & Accomplishments 1. A set of metamorphic testing guidelines [Murphy, Kaiser, Hu, Wu; SEKE’08] 2. New empirical studies [Xie, Ho, Murphy, Kaiser, Xu, Chen; QSIC’09] 3. Heuristic Metamorphic Testing [Murphy, Shen, Kaiser; ISSTA’09] 4. Metamorphic Runtime Checking [Murphy, Shen, Kaiser; ICST’09] 5. In Vivo Testing [Murphy, Kaiser, Vo, Chu; ICST’09] [Murphy, Vaughan, Ilahi, Kaiser; AST’10]

34 Thank you!

35 Motivation Backup Slides!

36 Assessment of Quality 1994: Hatton et al. pointed out a “disturbing” number of defects due to calculation errors in scientific computing software [TSE vol.20] 2007: Hatton reports that “many scientific results are corrupted, perhaps fatally so, by undiscovered mistakes in the software used to calculate and present those results” [Computer vol.40]

37 Complexity vs. Effectiveness Complexity Effectiveness Embedded Assertions Algebraic Specifications Formal Specifications Trace Checking & Log Analysis System-level Metamorphic Testing Metamorphic Runtime Checking

38 Motivation Metamorphic Properties

39 Categories of Metamorphic Properties Additive: Increase (or decrease) numerical values by a constant Multiplicative: Multiply numerical values by a constant Permutative: Randomly permute the order of elements in a set Invertive: Negate the elements in a set Inclusive: Add a new element to a set Exclusive: Remove an element from a set Compositional: Compose a set

40 Sample Metamorphic Properties 1. Permuting the order of the examples in the training data should not affect the model 2. If all attribute values in the training data are multiplied by a positive constant, the model should stay the same 3. If all attribute values in the training data are increased by a positive constant, the model should stay the same 4. Updating a model with a new example should yield the same model created with training data originally containing that example 5. If all attribute values in the training data are multiplied by -1, and an example to be classified is also multiplied by -1, the classification should be the same 6. Permuting the order of the examples in the testing data should not affect their classification 7. If all attribute values in the training data are multiplied by a positive constant, and an example to be classified is also multiplied by the same positive constant, the classification should be the same 8. If all attribute values in the training data are increased by a positive constant, and an example to be classified is also increased by the same positive constant, the classification should be the same

41 Other Classes of Properties (1) Statistical  Same mean, variance, etc. as the original Heuristic  Approximately equal to the original Semantically Equivalent  Domain specific

42 Other Classes of Properties (2) Noise Based  Add/Change data that should not affect result Partial  Change to part of input only affects part of output Compositional  New input relies on original output  ShortestPath(a, b) = ShortestPath(a, c) + ShortestPath(c, b)

43 Automatic Detection of Properties Static  Use machine learning to model what code looks like that exhibits certain properties, then determine whether other code matches that model  Use symbolic execution to check “algebraically” Dynamic  Observe multiple executions and infer properties

44 Motivation Automated Metamorphic Testing

45 Automated Metamorphic Testing Tester specifies the application’s metamorphic properties Test framework does the rest:  Transform inputs  Execute program with each input  Compare outputs according to specification

46 AMST Model

47 Specifying Metamorphic Properties

48 Motivation Heuristic Metamorphic Testing

49 Statistical Metamorphic Testing Introduced by Guderlei & Mayer in 2007 The application is run multiple times with the same input to get a mean value μ o and variance σ o Metamorphic properties are applied The application is run multiple times with the new input to get a mean value μ 1 and variance σ 1 If the means are not statistically similar, then the property is considered violated

50 Heuristic Metamorphic Testing When we expect that a change to the input will produce “similar” results, but cannot determine the expected similarity in advance Use input X to generate outputs M 1 through M k Use some metric to create a profile of the outputs Use input X’ (created according to a metamorphic property) to generate outputs N 1 through N k Create a profile of those outputs Use statistical techniques (e.g. Student t-test) to check that the profile of outputs N is similar to that of outputs M

51 Heuristic Metamorphic Testing x y1y1 nd_f x y2y2 x ynyn t(x) y’ 1 nd_f y’ 2 nd_f y’ n nd_f t(x) Do the profiles demonstrate the expected relationship? profile of y 1 …y n profile of y’ 1 …y’ n

52 HMT Example 2 sort ? 1 ? 4 3 1 ? 2 3 4 ? 1 2 3 ? 4 ? ? 1 ? 2 3 4 Build a profile based on normalized equivalence P permute 4 1 ? 3 2 ? sort 1 ? ? 2 3 4 1 2 ? 3 4 ? Build a profile based on normalized equivalence and compare it statistically to the first profile P’ = ?

53 HMT Empirical Study Is Heuristic Metamorphic Testing more effective than other approaches in detecting defects in non- deterministic applications without test oracles? Approaches investigated  Heuristic Metamorphic Testing  Embedded Assertions  Partial Oracle Applications investigated  MartiRank: sorting sparse data sets  JSim: non-deterministic event timing

54 HMT Study Results & Analysis Heuristic Metamorphic Testing killed 59 of the 78 mutants Partial oracle and assertion checking ineffective for JSim because no single execution was outside the specified range

55 Motivation Metamorphic Runtime Checking

56 Extensions to JML

57 Creating Test Functions /*@ @meta std_dev(\multiply(A, 2)) == \result * 2 */ public double __std_dev(double[] A) {... } protected boolean __MRCtest0_std_dev (double[] A, double result) { return Columbus.approximatelyEqualTo (__std_dev(Columbus.multiply(A, 2)), result * 2); }

58 Instrumentation public double std_dev(double[] A) { // call original function and save result double result = __std_dev(A); // create sandbox int pid = Columbus.createSandbox(); // program continues as normal if (pid != 0) return result; else { // run test in child process if (!__MRCtest0_std_dev(A, result)) Columbus.fail(); // handle failure Columbus.exit(); // clean up }

59 MRC: Case Studies We investigated the WEKA and RapidMiner toolkits for Machine Learning in Java For WEKA, we tested four apps:  Naïve Bayes, Support Vector Machines (SVM), C4.5 Decision Tree, and k-Nearest Neighbors For RapidMiner, we tested one app:  Naïve Bayes

60 MRC: Case Study Setup For each of the five apps, we specified 4-6 metamorphic properties of selected methods (based on our knowledge of the expected behavior of the overall application) Testing was conducted using data sets from UCI Machine Learning Repository Goal was to determine whether the properties held as expected

61 MRC: Case Study Findings Discovered defects in WEKA k-NN and WEKA Naïve Bayes related to modifying the machine learning “model”  This was the result of a variable not being updated appropriately Discovered a defect in RapidMiner Naïve Bayes related to determining confidence  There was an error in the calculation

62 Motivation Metamorphic Testing Experimental Study

63 Approaches Not Investigated Formal specification  Issues related to completeness  Prev. work converted specifications to invariants Algebraic properties  Not appropriate at system-level  Automatic detection only supported in Java Log/trace file analysis  Need more detailed knowledge of implementation Pseudo-oracles  None appropriate for applications investigated

64 Methodology: Metamorphic Testing Each variant (containing one mutation) acted as a pseudo-oracle for itself:  Program was run to produce an output with the original input dataset  Metamorphic properties applied to create new input datasets  Program run on new inputs to create new outputs  If outputs not as expected, the mutant had been killed (i.e. the defect had been detected)

65 Methodology: Partial Oracle Data sets were chosen so that the correct output could be calculated by hand These data sets were typically smaller than the ones used for other approaches To ensure fairness, the data sets were selected so that the line coverage was approximately the same for each approach

66 Methodology: Runtime Assertion Checking Daikon was used to detect program invariants in the “gold standard” implementation Because Daikon can generate spurious invariants, programs were run with a variety of inputs, and obvious spurious invariants were discarded Invariants then checked at runtime

67 Defects Detected in Study #1

68 Study #1: SVM Results Permuting the input was very effective at killing off- by-one mutants Many functions in SVM analyze a set of numbers (mean, standard dev, etc.) Off-by-one mutants caused some element of the set to be omitted By permuting, a different number would be omitted This revealed the defect

69 Study #1: SVM Example Permuting the input reveals this defect because both m_I1 and m_I4 will be different Partial oracle does not because only one element is omitted, so one will remain same; for small data sets, this did not affect the overall result

70 Study #1: C4.5 Results Negating the input was very effective C4.5 creates a decision tree in which nodes contain clauses like “if attr n > α then class = C” If the data set is negated, those nodes should change to “if attr n ≤ -α then class = C”, i.e. both the operator and the sign of α In most cases, only one of the changes occurred

71 Study #1: C4.5 Example Mutant causes ClassFreq to have negative values, violating assertion Permuting the order of elements does not affect the output in this case

72 Study #1: MartiRank Results Permuting and negating were effective at killing comparison operator mutants MartiRank depends heavily on sorting Permuting and negating change which numbers get sorted and what the result should be, thus inducing the differences in the final sorted list

73 Study #1: Effectiveness of Properties

74 Study #1: Lucene Results Most mutants gave a non-zero score to the term “foo”, thus L3 detected the defect

75 Study #1: gaffitter Results G1: increasing the number of generations should increase the overall quality G2: multiplying item and bin sizes by a constant should not affect the solution Most of defects killed by G1 related to incorrectly selecting candidate solutions

76 Empirical Studies: Threats to Validity Representativeness of selected programs Types of defects Data sets Daikon-generated program invariants Selection of metamorphic properties

77 Motivation Metamorphic Runtime Checking Experimental Study

78 Study #2 Results If we only consider functions for which metamorphic properties were identified, there were 189 total mutants MRC detected 96.3%, compared to 67.7% for system-level metamorphic testing

79 Study #2 PAYL Results Both functions call numerous other functions, but we can circumvent restrictions on the input domain Permuting input tends to kill off-by-one mutants

80 Study #2 gaffitter Results

81 Study #2: gaffitter Example 1239 Metamorphic Property: If we switch the order, the new output should be predictable 12345 678945678 Simply, the elements not included in the original cross-over Genetic Algorithm takes two sets and “crosses over” at a particular element 12345 6789

82 Study #2: gaffitter Example Metamorphic property is violated: elements 3 and 8 should not appear in both sets 12345 6789 Now consider a defect in which the cross-over happens at the wrong point 12345 6789 1238 9678345

83 Study #2: gaffitter Example 1238 Erroneous implementation Correct implementation 12345 6789 9 1239 12345 6789 This defect is only detected by system-level metamorphic testing if element 8 has any impact on the “quality” of the final solution. However, a single element is unlikely to do so.

84 Study #2 Lucene Results MRC killed three mutants not killed by MT All three were in the idf function

85 Study #2: Lucene Example Search query results are ordered according to a score “ROMEO or JULIET” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 Consider a defect in which the scores are off by one. The results stay the same because only the order is important. “ROMEO or JULIET” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 5.8374.6813.377 6.8375.6814.377 Partial oracle does not reveal this defect because the scores cannot be calculated in advance.

86 Study #2: Lucene Example “ROMEO or JULIET” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 System-level metamorphic property: changing the query order shouldn’t affect result “JULIET or ROMEO” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 6.8375.6814.377 6.8375.6814.377 Even though the defect exists, the property still holds and the defect is not detected.

87 Study #2: Lucene Example The score itself is computed as the result of many subcalculations. Score(q) = ∑Similarity(f)*Weight(q i ) + … + idf(q) + … Metamorphic Runtime Checking can detect that there is an error in this function by checking its individual (mathematical) properties.

88 Motivation In Vivo Testing

89 Generalization of MRC In Metamorphic Runtime Checking, the software tests itself Why only run metamorphic tests? Why limit ourselves only to applications without test oracles? Why not allow the software to continue testing itself as it runs in the production environment?

90 In Vivo Testing An approach whereby software tests itself in the production environment by running any type of test (unit, integration, “parameterized unit”, etc.) at specified program points Tests are run in a sandbox so as not to affect the original process Invite implementation: less than half a millisecond overhead per test

91 Example of Defect: Cache private int numItems = 0, currSize = 0; private int maxCapacity = 1024; // in bytes public int getNumItems() { return numItems; } public boolean addItem(CacheItem i) throws... { numItems++; add(i); currSize += i.size; return true; } if (currSize + i.size < maxCapacity) { } else { return false; } Should only be incremented within “if” block Number of items in the cache Their size (in bytes) Maximum capacity

92 Insufficient Unit Test public void testAddItem() { Cache c = new Cache(); assert(c.addItem(new CacheItem())) assert(c.getNumItems() == 1); assert(c.addItem(new CacheItem())) assert(c.getNumItems() == 2); } 1. Assumes an empty/new cache 2. Doesn’t take into account various states that the cache can be in

93 Defects Targeted 1. Unit tests that make incomplete assumptions about the state of objects in the application 2. Possible field configurations that were not tested in the lab 3. A legal user action that puts the system in an unexpected state 4. A sequence of unanticipated user actions that breaks the system 5. Defects that only appear intermittently

94 In Vivo: Model of Execution Function is about to be executed NO Execute function Yes Run a test? Create sandbox Run test Fork Stop Rest of program continues

95 Writing In Vivo Tests /* Method to be tested */ public boolean addItem(CacheItem i) {... } /* JUnit style test */ public void testAddItem() { Cache c = new Cache(); if (c.addItem(new CacheItem())) assert (c.getNumItems() == 1); } CacheItem i) { int oldNumItems = getNumItems(); this; boolean In Vivo returnoldNumItems+1; else return true; i))

96 Instrumentation /* Method to be tested */ public boolean __addItem(CacheItem i) {... } /* In Vivo style test */ public boolean testAddItem(CacheItem i) {... } public boolean addItem(CacheItem i) { if (Invite.runTest(“Cache.addItem”)) { Invite.createSandboxAndFork(); if (Invite.isTestProcess()) { if (testAddItem(i) == false) Invite.fail(); else Invite.succeed(); Invite.destroySandboxAndExit(); } return __addItem(i); }

97 In Vivo Testing: Case Studies Applied testing approach to two caching systems  OSCache 2.1.1  Apache JCS 1.3 Both had known defects that were found by users (no corresponding unit tests for these defects) Goal: demonstrate that “traditional” unit tests would miss these but In Vivo testing would detect them

98 In Vivo Testing: Experimental Setup An undergraduate student created unit tests for the methods that contained the defects These tests passed in “development” Student was then asked to convert the unit tests to In Vivo tests Driver created to simulate real usage in a “deployment environment”

99 In Vivo Testing: Discussion In Vivo testing revealed all defects, even though unit testing did not Some defects only appeared in certain states, e.g. when the cache was at full capacity  These are the very types of defects that In Vivo testing is targeted at However, the approach depends heavily on the quality of the tests themselves

100 In Vivo Testing: Performance

101 More Robust Sandboxes “Safe” test case selection  [Willmor and Embury, ICSE’06] Copy-on-write database snapshots  MS SQL Server v8

102 In Vivo Testing: Related Work Self-checking Software  Gamma [A.Orso et al, ISSTA’02]  Skoll: [A.Memon et al., ICSE’03]  Cooperative Bug Isolation [B.Liblit et al., PLDI’03]  COTS components [S.Beydeda, COMPSAC’06] Property-based Software Testing  D.Rosenblum: runtime assertion checking  I.Nunes: checking algebraic properties [ICFEM’06]

103 Motivation Related Work

104 Limitations of Other Approaches Formal specification languages  Issues related to completeness  Balance between expressiveness and implementability Algebraic properties  Useful for data structures, but not for arbitrary functions or entire programs  Limitations of previous work in runtime checking Log/trace file analysis  Requires careful planning in advance

105 Previous Work in MT T.Y.Chen et al.: applying metamorphic testing to applications without oracles [Info. & Soft. Tech. vol.44, 2002] Domain-specific testing  Graphics [J.Mayer and R.Guderlei, QSIC’07]  Bioinformatics [T.Y.Chen et al., BMC Bioinf. 10(24), 2009]  Middleware [W.K.Chan et al., QSIC’05]  Others…

106 Previous Studies [Hu et al., SOQUA’ 06] Invariants hand-generated Smaller programs Only deterministic applications Didn’t consider partial oracle

107 Developer Effort [Hu et al., SOQUA’ 06] Students were given three-hour training sessions on MT and on assertion checking Given three hours to identify metamorphic properties and program invariants Averaged about the same number of metamorphic properties as invariants The metamorphic properties were more effective at killing mutants

108 Fault Localization Delta debugging  [Zeller, FSE’02]  Compare trace of failed execution vs. successful ones Cooperative Bug Isolation  [Liblit et al., PLDI’03]  Numerous instances report results and failed execution is compared to those Statistical approach  [Baah, Gray, Harrold; SoQUA’06]  Combines model of normal behavior with runtime monitoring

Metamorphic Testing Techniques to Detect Defects in Applications without Test Oracles Christian Murphy Thesis Defense April 12, 2010.

Similar presentations

Presentation on theme: "Metamorphic Testing Techniques to Detect Defects in Applications without Test Oracles Christian Murphy Thesis Defense April 12, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Metamorphic Testing Techniques to Detect Defects in Applications without Test Oracles Christian Murphy Thesis Defense April 12, 2010.

Similar presentations

Presentation on theme: "Metamorphic Testing Techniques to Detect Defects in Applications without Test Oracles Christian Murphy Thesis Defense April 12, 2010."— Presentation transcript:

Similar presentations

About project

Feedback