Download presentation
Presentation is loading. Please wait.
1
Metamorphic Testing Techniques to Detect Defects in Applications without Test Oracles Christian Murphy Thesis Defense April 12, 2010
2
2 Overview Software testing is important! Certain types of applications are particularly hard to test because there is no “test oracle” Machine Learning, Discrete Event Simulation, Optimization, Scientific Computing, etc. Even when there is no oracle, it is possible to detect defects if properties of the software are violated My research introduces and evaluates new techniques for testing such “non-testable programs” [Weyuker, Computer Journal’82]
3
3 Motivating Example: Machine Learning
4
4 Motivating Example: Simulation
5
5 Problem Statement Partial oracles may exist for a limited subset of the input domain in applications such as Machine Learning, Discrete Event Simulation, Scientific Computing, Optimization, etc. Obvious errors (e.g., crashes) can be detected with certain inputs or testing techniques However, it is difficult to detect subtle computational defects in applications without test oracles in the general case
6
6 What do I mean by “defect”? Deviation of the implementation from the specification Violation of a sound property of the software “Discrete localized” calculation errors Off-by-one Incorrect sentinel values for loops Wrong comparison or mathematical operator Misinterpretation of specification Parts of input domain not handled Incorrect assumptions made about input
7
7 Observation Many programs without oracles have properties such that certain changes to the input yield predictable changes to the output We can detect defects in these programs by looking for any violations of these “metamorphic properties” This is known as “metamorphic testing” [T.Y. Chen et al., Info. & Soft. Tech vol.4, 2002]
8
8 Research Goals Facilitate the way that metamorphic testing is used in practice Develop new testing techniques based on metamorphic testing Demonstrate the effectiveness of metamorphic testing techniques
9
9 Hypotheses For programs that do not have a test oracle, an automated approach to metamorphic testing is more effective at detecting defects than other approaches An approach that conducts function-level metamorphic testing in the context of a running application will further increase the effectiveness It is feasible to continue this type of testing in the deployment environment, with minimal impact on the end user
10
10 Contributions 1. A set of guidelines to help identify metamorphic properties 2. New empirical studies comparing the effectiveness of metamorphic testing to other approaches 3. An approach for detecting defects in non-deterministic applications called Heuristic Metamorphic Testing 4. A new testing technique called Metamorphic Runtime Checking based on function-level metamorphic properties 5. A generalized technique for testing in the deployment environment called In Vivo Testing
11
11 Outline Background Related Work Metamorphic Testing Metamorphic Testing Empirical Studies Metamorphic Runtime Checking Future Work & Conclusion
12
12 Other Approaches [Baresi & Young, 2001] Formal specifications A complete specification is essentially a test oracle Embedded assertions Can check that the software behaves as expected Algebraic properties Used to generate test cases for abstract datatypes Trace checking & Log file analysis Analyze intermediate results and sequence of executions
13
13 Metamorphic Testing [Chen et al., 2002] If new test case output f(t(x)) is as expected, it is not necessarily correct However, if f(t(x)) is not as expected, either f(x) or f(t(x)) – or both! – is wrong x f f(x) Initial test case t(x) f f(t(x)) New test case t f(x) and f(t(x)) are “pseudo-oracles” Transformation function based on metamorphic properties of f
14
14 Metamorphic Testing Example Consider a function to determine the standard deviation of a set of numbers abcdef Initial input cebafd New test case #1 2a2b2c2d2e2f New test case #3 s std_dev s ? 2s ? std_dev s ? New test case #2 a+2b+2c+2d+2e+2f+2
15
15 Outline
16
16 Empirical Study Is metamorphic testing more effective than other approaches in detecting defects in applications without test oracles? Approaches investigated Metamorphic Testing Using metamorphic properties of the entire application Runtime Assertion Checking Using Daikon-detected program invariants Partial Oracle Simple inputs for which correct output can easily be determined
17
17 Applications Investigated Machine Learning C4.5: decision tree classifier MartiRank: ranking Support Vector Machines (SVM): vector-based classifier PAYL: anomaly-based intrusion detection system Discrete Event Simulation JSim: used in simulating hospital ER Information Retrieval Lucene: Apache framework’s text search engine Optimization gaffitter: genetic algorithm approach to bin-packing problem
18
18 Methodology Mutation testing was used to seed defects into each application Comparison operators were reversed Math operators were changed Off-by-one errors were introduced For each program, we created multiple versions, each with exactly one mutation We ignored mutants that yielded outputs that were obviously wrong, caused crashes, etc. Effectiveness is determined by measuring what percentage of the mutants were “killed”
19
19 Experimental Results
20
20 Analysis of Results Assertions are good for checking bounds and relationships but not for changes to values Metamorphic testing particularly good for detecting errors in loop conditions Metamorphic testing was not very effective for PAYL (5%) and gaffitter (33%) fewer properties identified defects had little impact on output
21
21 Outline
22
22 Metamorphic Runtime Checking Results of previous study revealed limitations of scope and robustness in metamorphic testing What if we consider the metamorphic properties of individual functions and check those properties as the entire program is running? A combination of metamorphic testing and runtime assertion checking
23
23 Metamorphic Runtime Checking Tester specifies the metamorphic properties of individual functions using a special notation in the code (based on JML) Pre-processor instruments code with corresponding metamorphic tests Tester runs entire program as normal (e.g., to perform system tests) Violation of any property reveals a defect
24
24 Metamorphic test MRC Model of Execution Function f is about to be executed with input x in state S Create a sandbox for the test Execute f(x) to get result Send result to test Program continues Transform input to get t(x) Execute f(t(x)) Compare outputs Report violations The metamorphic test is conducted at the same point in the program execution as the original function call The metamorphic test runs in parallel with the rest of the application
25
25 Empirical Study Can Metamorphic Runtime Checking detect defects not found by system-level metamorphic testing? Same mutants used in previous study 29% were not found by metamorphic testing Metamorphic properties identified at function level using suggested guidelines
26
26 Experimental Results
27
27 Analysis of Results Scope: Function-level testing allowed us to: identify additional metamorphic properties execute more tests Robustness: Metamorphic testing “inside” the application detected subtle defects that did not have much effect on the overall program output
28
28 Combined Results
29
29 Outline
30
30 Results Demonstrated that metamorphic testing advances the state of the art in detecting defects in applications without test oracles Proved that Metamorphic Runtime Checking will reveal defects not found by using system-level properties Showed that it is feasible to continue this type of testing in the deployment environment, with minimal impact on the end user
31
31 Short-Term Opportunities Automatic detection of metamorphic properties Using dynamic and/or static techniques Fault localization Once a defect has been detected, figure out where it occurred and how to fix it Implementation issues Reducing overhead Handling external databases, network traffic, etc.
32
32 Long-Term Directions Testing of multi-process or distributed applications in these domains Collaborative defect detection and notification Investigate the impact on the software development processes used in the domains of non-testable programs
33
33 Contributions & Accomplishments 1. A set of metamorphic testing guidelines [Murphy, Kaiser, Hu, Wu; SEKE’08] 2. New empirical studies [Xie, Ho, Murphy, Kaiser, Xu, Chen; QSIC’09] 3. Heuristic Metamorphic Testing [Murphy, Shen, Kaiser; ISSTA’09] 4. Metamorphic Runtime Checking [Murphy, Shen, Kaiser; ICST’09] 5. In Vivo Testing [Murphy, Kaiser, Vo, Chu; ICST’09] [Murphy, Vaughan, Ilahi, Kaiser; AST’10]
34
34 Thank you!
35
35 Motivation Backup Slides!
36
36 Assessment of Quality 1994: Hatton et al. pointed out a “disturbing” number of defects due to calculation errors in scientific computing software [TSE vol.20] 2007: Hatton reports that “many scientific results are corrupted, perhaps fatally so, by undiscovered mistakes in the software used to calculate and present those results” [Computer vol.40]
37
37 Complexity vs. Effectiveness Complexity Effectiveness Embedded Assertions Algebraic Specifications Formal Specifications Trace Checking & Log Analysis System-level Metamorphic Testing Metamorphic Runtime Checking
38
38 Motivation Metamorphic Properties
39
39 Categories of Metamorphic Properties Additive: Increase (or decrease) numerical values by a constant Multiplicative: Multiply numerical values by a constant Permutative: Randomly permute the order of elements in a set Invertive: Negate the elements in a set Inclusive: Add a new element to a set Exclusive: Remove an element from a set Compositional: Compose a set
40
40 Sample Metamorphic Properties 1. Permuting the order of the examples in the training data should not affect the model 2. If all attribute values in the training data are multiplied by a positive constant, the model should stay the same 3. If all attribute values in the training data are increased by a positive constant, the model should stay the same 4. Updating a model with a new example should yield the same model created with training data originally containing that example 5. If all attribute values in the training data are multiplied by -1, and an example to be classified is also multiplied by -1, the classification should be the same 6. Permuting the order of the examples in the testing data should not affect their classification 7. If all attribute values in the training data are multiplied by a positive constant, and an example to be classified is also multiplied by the same positive constant, the classification should be the same 8. If all attribute values in the training data are increased by a positive constant, and an example to be classified is also increased by the same positive constant, the classification should be the same
41
41 Other Classes of Properties (1) Statistical Same mean, variance, etc. as the original Heuristic Approximately equal to the original Semantically Equivalent Domain specific
42
42 Other Classes of Properties (2) Noise Based Add/Change data that should not affect result Partial Change to part of input only affects part of output Compositional New input relies on original output ShortestPath(a, b) = ShortestPath(a, c) + ShortestPath(c, b)
43
43 Automatic Detection of Properties Static Use machine learning to model what code looks like that exhibits certain properties, then determine whether other code matches that model Use symbolic execution to check “algebraically” Dynamic Observe multiple executions and infer properties
44
44 Motivation Automated Metamorphic Testing
45
45 Automated Metamorphic Testing Tester specifies the application’s metamorphic properties Test framework does the rest: Transform inputs Execute program with each input Compare outputs according to specification
46
46 AMST Model
47
47 Specifying Metamorphic Properties
48
48 Motivation Heuristic Metamorphic Testing
49
49 Statistical Metamorphic Testing Introduced by Guderlei & Mayer in 2007 The application is run multiple times with the same input to get a mean value μ o and variance σ o Metamorphic properties are applied The application is run multiple times with the new input to get a mean value μ 1 and variance σ 1 If the means are not statistically similar, then the property is considered violated
50
50 Heuristic Metamorphic Testing When we expect that a change to the input will produce “similar” results, but cannot determine the expected similarity in advance Use input X to generate outputs M 1 through M k Use some metric to create a profile of the outputs Use input X’ (created according to a metamorphic property) to generate outputs N 1 through N k Create a profile of those outputs Use statistical techniques (e.g. Student t-test) to check that the profile of outputs N is similar to that of outputs M
51
51 Heuristic Metamorphic Testing x y1y1 nd_f x y2y2 x ynyn t(x) y’ 1 nd_f y’ 2 nd_f y’ n nd_f t(x) Do the profiles demonstrate the expected relationship? profile of y 1 …y n profile of y’ 1 …y’ n
52
52 HMT Example 2 sort ? 1 ? 4 3 1 ? 2 3 4 ? 1 2 3 ? 4 ? ? 1 ? 2 3 4 Build a profile based on normalized equivalence P permute 4 1 ? 3 2 ? sort 1 ? ? 2 3 4 1 2 ? 3 4 ? Build a profile based on normalized equivalence and compare it statistically to the first profile P’ = ?
53
53 HMT Empirical Study Is Heuristic Metamorphic Testing more effective than other approaches in detecting defects in non- deterministic applications without test oracles? Approaches investigated Heuristic Metamorphic Testing Embedded Assertions Partial Oracle Applications investigated MartiRank: sorting sparse data sets JSim: non-deterministic event timing
54
54 HMT Study Results & Analysis Heuristic Metamorphic Testing killed 59 of the 78 mutants Partial oracle and assertion checking ineffective for JSim because no single execution was outside the specified range
55
55 Motivation Metamorphic Runtime Checking
56
56 Extensions to JML
57
57 Creating Test Functions /*@ @meta std_dev(\multiply(A, 2)) == \result * 2 */ public double __std_dev(double[] A) {... } protected boolean __MRCtest0_std_dev (double[] A, double result) { return Columbus.approximatelyEqualTo (__std_dev(Columbus.multiply(A, 2)), result * 2); }
58
58 Instrumentation public double std_dev(double[] A) { // call original function and save result double result = __std_dev(A); // create sandbox int pid = Columbus.createSandbox(); // program continues as normal if (pid != 0) return result; else { // run test in child process if (!__MRCtest0_std_dev(A, result)) Columbus.fail(); // handle failure Columbus.exit(); // clean up }
59
59 MRC: Case Studies We investigated the WEKA and RapidMiner toolkits for Machine Learning in Java For WEKA, we tested four apps: Naïve Bayes, Support Vector Machines (SVM), C4.5 Decision Tree, and k-Nearest Neighbors For RapidMiner, we tested one app: Naïve Bayes
60
60 MRC: Case Study Setup For each of the five apps, we specified 4-6 metamorphic properties of selected methods (based on our knowledge of the expected behavior of the overall application) Testing was conducted using data sets from UCI Machine Learning Repository Goal was to determine whether the properties held as expected
61
61 MRC: Case Study Findings Discovered defects in WEKA k-NN and WEKA Naïve Bayes related to modifying the machine learning “model” This was the result of a variable not being updated appropriately Discovered a defect in RapidMiner Naïve Bayes related to determining confidence There was an error in the calculation
62
62 Motivation Metamorphic Testing Experimental Study
63
63 Approaches Not Investigated Formal specification Issues related to completeness Prev. work converted specifications to invariants Algebraic properties Not appropriate at system-level Automatic detection only supported in Java Log/trace file analysis Need more detailed knowledge of implementation Pseudo-oracles None appropriate for applications investigated
64
64 Methodology: Metamorphic Testing Each variant (containing one mutation) acted as a pseudo-oracle for itself: Program was run to produce an output with the original input dataset Metamorphic properties applied to create new input datasets Program run on new inputs to create new outputs If outputs not as expected, the mutant had been killed (i.e. the defect had been detected)
65
65 Methodology: Partial Oracle Data sets were chosen so that the correct output could be calculated by hand These data sets were typically smaller than the ones used for other approaches To ensure fairness, the data sets were selected so that the line coverage was approximately the same for each approach
66
66 Methodology: Runtime Assertion Checking Daikon was used to detect program invariants in the “gold standard” implementation Because Daikon can generate spurious invariants, programs were run with a variety of inputs, and obvious spurious invariants were discarded Invariants then checked at runtime
67
67 Defects Detected in Study #1
68
68 Study #1: SVM Results Permuting the input was very effective at killing off- by-one mutants Many functions in SVM analyze a set of numbers (mean, standard dev, etc.) Off-by-one mutants caused some element of the set to be omitted By permuting, a different number would be omitted This revealed the defect
69
69 Study #1: SVM Example Permuting the input reveals this defect because both m_I1 and m_I4 will be different Partial oracle does not because only one element is omitted, so one will remain same; for small data sets, this did not affect the overall result
70
70 Study #1: C4.5 Results Negating the input was very effective C4.5 creates a decision tree in which nodes contain clauses like “if attr n > α then class = C” If the data set is negated, those nodes should change to “if attr n ≤ -α then class = C”, i.e. both the operator and the sign of α In most cases, only one of the changes occurred
71
71 Study #1: C4.5 Example Mutant causes ClassFreq to have negative values, violating assertion Permuting the order of elements does not affect the output in this case
72
72 Study #1: MartiRank Results Permuting and negating were effective at killing comparison operator mutants MartiRank depends heavily on sorting Permuting and negating change which numbers get sorted and what the result should be, thus inducing the differences in the final sorted list
73
73 Study #1: Effectiveness of Properties
74
74 Study #1: Lucene Results Most mutants gave a non-zero score to the term “foo”, thus L3 detected the defect
75
75 Study #1: gaffitter Results G1: increasing the number of generations should increase the overall quality G2: multiplying item and bin sizes by a constant should not affect the solution Most of defects killed by G1 related to incorrectly selecting candidate solutions
76
76 Empirical Studies: Threats to Validity Representativeness of selected programs Types of defects Data sets Daikon-generated program invariants Selection of metamorphic properties
77
77 Motivation Metamorphic Runtime Checking Experimental Study
78
78 Study #2 Results If we only consider functions for which metamorphic properties were identified, there were 189 total mutants MRC detected 96.3%, compared to 67.7% for system-level metamorphic testing
79
79 Study #2 PAYL Results Both functions call numerous other functions, but we can circumvent restrictions on the input domain Permuting input tends to kill off-by-one mutants
80
80 Study #2 gaffitter Results
81
81 Study #2: gaffitter Example 1239 Metamorphic Property: If we switch the order, the new output should be predictable 12345 678945678 Simply, the elements not included in the original cross-over Genetic Algorithm takes two sets and “crosses over” at a particular element 12345 6789
82
82 Study #2: gaffitter Example Metamorphic property is violated: elements 3 and 8 should not appear in both sets 12345 6789 Now consider a defect in which the cross-over happens at the wrong point 12345 6789 1238 9678345
83
83 Study #2: gaffitter Example 1238 Erroneous implementation Correct implementation 12345 6789 9 1239 12345 6789 This defect is only detected by system-level metamorphic testing if element 8 has any impact on the “quality” of the final solution. However, a single element is unlikely to do so.
84
84 Study #2 Lucene Results MRC killed three mutants not killed by MT All three were in the idf function
85
85 Study #2: Lucene Example Search query results are ordered according to a score “ROMEO or JULIET” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 Consider a defect in which the scores are off by one. The results stay the same because only the order is important. “ROMEO or JULIET” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 5.8374.6813.377 6.8375.6814.377 Partial oracle does not reveal this defect because the scores cannot be calculated in advance.
86
86 Study #2: Lucene Example “ROMEO or JULIET” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 System-level metamorphic property: changing the query order shouldn’t affect result “JULIET or ROMEO” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 6.8375.6814.377 6.8375.6814.377 Even though the defect exists, the property still holds and the defect is not detected.
87
87 Study #2: Lucene Example The score itself is computed as the result of many subcalculations. Score(q) = ∑Similarity(f)*Weight(q i ) + … + idf(q) + … Metamorphic Runtime Checking can detect that there is an error in this function by checking its individual (mathematical) properties.
88
88 Motivation In Vivo Testing
89
89 Generalization of MRC In Metamorphic Runtime Checking, the software tests itself Why only run metamorphic tests? Why limit ourselves only to applications without test oracles? Why not allow the software to continue testing itself as it runs in the production environment?
90
90 In Vivo Testing An approach whereby software tests itself in the production environment by running any type of test (unit, integration, “parameterized unit”, etc.) at specified program points Tests are run in a sandbox so as not to affect the original process Invite implementation: less than half a millisecond overhead per test
91
91 Example of Defect: Cache private int numItems = 0, currSize = 0; private int maxCapacity = 1024; // in bytes public int getNumItems() { return numItems; } public boolean addItem(CacheItem i) throws... { numItems++; add(i); currSize += i.size; return true; } if (currSize + i.size < maxCapacity) { } else { return false; } Should only be incremented within “if” block Number of items in the cache Their size (in bytes) Maximum capacity
92
92 Insufficient Unit Test public void testAddItem() { Cache c = new Cache(); assert(c.addItem(new CacheItem())) assert(c.getNumItems() == 1); assert(c.addItem(new CacheItem())) assert(c.getNumItems() == 2); } 1. Assumes an empty/new cache 2. Doesn’t take into account various states that the cache can be in
93
93 Defects Targeted 1. Unit tests that make incomplete assumptions about the state of objects in the application 2. Possible field configurations that were not tested in the lab 3. A legal user action that puts the system in an unexpected state 4. A sequence of unanticipated user actions that breaks the system 5. Defects that only appear intermittently
94
94 In Vivo: Model of Execution Function is about to be executed NO Execute function Yes Run a test? Create sandbox Run test Fork Stop Rest of program continues
95
95 Writing In Vivo Tests /* Method to be tested */ public boolean addItem(CacheItem i) {... } /* JUnit style test */ public void testAddItem() { Cache c = new Cache(); if (c.addItem(new CacheItem())) assert (c.getNumItems() == 1); } CacheItem i) { int oldNumItems = getNumItems(); this; boolean In Vivo returnoldNumItems+1; else return true; i))
96
96 Instrumentation /* Method to be tested */ public boolean __addItem(CacheItem i) {... } /* In Vivo style test */ public boolean testAddItem(CacheItem i) {... } public boolean addItem(CacheItem i) { if (Invite.runTest(“Cache.addItem”)) { Invite.createSandboxAndFork(); if (Invite.isTestProcess()) { if (testAddItem(i) == false) Invite.fail(); else Invite.succeed(); Invite.destroySandboxAndExit(); } return __addItem(i); }
97
97 In Vivo Testing: Case Studies Applied testing approach to two caching systems OSCache 2.1.1 Apache JCS 1.3 Both had known defects that were found by users (no corresponding unit tests for these defects) Goal: demonstrate that “traditional” unit tests would miss these but In Vivo testing would detect them
98
98 In Vivo Testing: Experimental Setup An undergraduate student created unit tests for the methods that contained the defects These tests passed in “development” Student was then asked to convert the unit tests to In Vivo tests Driver created to simulate real usage in a “deployment environment”
99
99 In Vivo Testing: Discussion In Vivo testing revealed all defects, even though unit testing did not Some defects only appeared in certain states, e.g. when the cache was at full capacity These are the very types of defects that In Vivo testing is targeted at However, the approach depends heavily on the quality of the tests themselves
100
100 In Vivo Testing: Performance
101
101 More Robust Sandboxes “Safe” test case selection [Willmor and Embury, ICSE’06] Copy-on-write database snapshots MS SQL Server v8
102
102 In Vivo Testing: Related Work Self-checking Software Gamma [A.Orso et al, ISSTA’02] Skoll: [A.Memon et al., ICSE’03] Cooperative Bug Isolation [B.Liblit et al., PLDI’03] COTS components [S.Beydeda, COMPSAC’06] Property-based Software Testing D.Rosenblum: runtime assertion checking I.Nunes: checking algebraic properties [ICFEM’06]
103
103 Motivation Related Work
104
104 Limitations of Other Approaches Formal specification languages Issues related to completeness Balance between expressiveness and implementability Algebraic properties Useful for data structures, but not for arbitrary functions or entire programs Limitations of previous work in runtime checking Log/trace file analysis Requires careful planning in advance
105
105 Previous Work in MT T.Y.Chen et al.: applying metamorphic testing to applications without oracles [Info. & Soft. Tech. vol.44, 2002] Domain-specific testing Graphics [J.Mayer and R.Guderlei, QSIC’07] Bioinformatics [T.Y.Chen et al., BMC Bioinf. 10(24), 2009] Middleware [W.K.Chan et al., QSIC’05] Others…
106
106 Previous Studies [Hu et al., SOQUA’ 06] Invariants hand-generated Smaller programs Only deterministic applications Didn’t consider partial oracle
107
107 Developer Effort [Hu et al., SOQUA’ 06] Students were given three-hour training sessions on MT and on assertion checking Given three hours to identify metamorphic properties and program invariants Averaged about the same number of metamorphic properties as invariants The metamorphic properties were more effective at killing mutants
108
108 Fault Localization Delta debugging [Zeller, FSE’02] Compare trace of failed execution vs. successful ones Cooperative Bug Isolation [Liblit et al., PLDI’03] Numerous instances report results and failed execution is compared to those Statistical approach [Baah, Gray, Harrold; SoQUA’06] Combines model of normal behavior with runtime monitoring
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.