Ask the Mutants: Mutating Faulty Programs for Fault Localization

Slides:

Advertisements

Similar presentations

Time-Aware Test Suite Prioritization Kristen R. Walcott, Mary Lou Soffa University of Virginia International Symposium on Software Testing and Analysis.

Advertisements

MUTATION TESTING. Mutation testing is a technique that focuses on measuring the adequacy (quality) of test data (or test cases). Modify a program by introducing.

An Analysis and Survey of the Development of Mutation Testing by Yue Jia and Mark Harmon A Quick Summary For SWE6673.

Paraμ A Partial and Higher-Order Mutation Tool with Concurrency Operators Pratyusha Madiraju AdVanced Empirical Software Testing and Analysis (AVESTA)

Mutation Analysis with Coverage Discounting Peter Lisherness, Nicole Lesperance, and Kwang-Ting (Tim) Cheng University of California – Santa Barbara.

Prioritizing User-session-based Test Cases for Web Applications Testing Sreedevi Sampath, Renne C. Bryce, Gokulanand Viswanath, Vani Kandimalla, A.Gunes.

Coverage Discounting: Improved Testbench Qualification by Combining Mutation Analysis with Functional Coverage Nicole Lesperance, Peter Lisherness, and.

1 Software Testing and Quality Assurance Lecture 9 - Software Testing Techniques.

Using Rational Approximations for Evaluating the Reliability of Highly Reliable Systems Z. Koren, J. Rajagopal, C. M. Krishna, and I. Koren Dept. of Elect.

CS590Z Statistical Debugging Xiangyu Zhang (part of the slides are from Chao Liu)

Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Michael Ernst, page 1 Improving Test Suites via Operational Abstraction Michael Ernst MIT Lab for Computer Science Joint.

Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Software Bug Localization with Markov Logic Sai Zhang, Congle Zhang University of Washington Presented by Todd Schiller.

State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.

Automated Diagnosis of Software Configuration Errors

Software Testing and Validation SWE 434

JMUSE: Java 프로그램을 위한 돌연변이 기반 오류 추적 시스템 연광흠, 김문주 Software Testing & Verification Group (SWTV) CS Dept., KAIST.

Theory and Practice, Do They Match ? A Case with Spectrum-Based Fault Localization Tien-Duy B. Le, Ferdian Thung, and David Lo School of Information Systems.

An Automated Approach to Predict Effectiveness of Fault Localization Tools Tien-Duy B. Le, and David Lo School of Information Systems Singapore Management.

Software Reliability SEG3202 N. El Kadri.

Bug Localization with Machine Learning Techniques Wujie Zheng

Test Drivers and Stubs More Unit Testing Test Drivers and Stubs CEN 5076 Class 11 – 11/14.

REPRESENTATIONS AND OPERATORS FOR IMPROVING EVOLUTIONARY SOFTWARE REPAIR Claire Le Goues Westley Weimer Stephanie Forrest

Prioritizing Test Cases for Regression Testing Article By: Rothermel, et al. Presentation by: Martin, Otto, and Prashanth.

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

Bug Localization with Association Rule Mining Wujie Zheng

Computer Science 1 Test Selection and Augmentation of Regression System Tests for Security Policy Evolution JeeHyun Hwang, Tao Xie, and collaborators at.

Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.

WERST – Methodology Group

8/23/00ISSTA Comparison of Delivered Reliability of Branch, Data Flow, and Operational Testing: A Case Study Phyllis G. Frankl Yuetang Deng Polytechnic.

Chapter 8 Testing the Programs. Integration Testing  Combine individual comp., into a working s/m.  Test strategy gives why & how comp., are combined.

Mutation Testing Laraib Zahid & Mariam Arshad. What is Mutation Testing?  Fault-based Testing: directed towards “typical” faults that could occur in.

A PRELIMINARY EMPIRICAL ASSESSMENT OF SIMILARITY FOR COMBINATORIAL INTERACTION TESTING OF SOFTWARE PRODUCT LINES Stefan Fischer Roberto E. Lopez-Herrejon.

Test Case Purification for Improving Fault Localization presented by Taehoon Kwak SoftWare Testing & Verification Group Jifeng Xuan, Martin Monperrus [FSE’14]

Tung Dao* Lingming Zhang+ Na Meng* Virginia Tech*

Software Defects Cmpe 550 Fall 2005

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Anti-patterns in Search-based Program Repair

Evidence-Based Automated Program Fixing

Software Testing.

John D. McGregor Session 9 Testing Vocabulary

Approaches to ---Testing Software

Software Verification and Validation

Detecting Table Clones and Smells in Spreadsheets

Analyzing the Validity of Selective Mutation with Dominator Mutants

Towards Trustworthy Program Repair

Using Execution Feedback in Test Case Generation

Understanding Results

Learning Software Behavior for Automated Diagnosis

MinJi Kim, Muriel Médard, João Barros

John D. McGregor Session 9 Testing Vocabulary

John D. McGregor Session 9 Testing Vocabulary

Improving Test Suites for Efficient Fault Localization

Software Testing (Lecture 11-a)

Reference-Driven Performance Anomaly Identification

Test Case Purification for Improving Fault Localization

Automated Fitness Guided Fault Localization

50.530: Software Engineering

Precise Condition Synthesis for Program Repair

Using Automated Program Repair for Evaluating the Effectiveness of

By Hyunsook Do, Sebastian Elbaum, Gregg Rothermel

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Software Testing.

Mitigating the Effects of Flaky Tests on Mutation Testing

Mutation Testing Faults are introduced into the program by creating many versions of the program called mutants. Each mutant contains a single fault. Test.

Presentation transcript:

Ask the Mutants: Mutating Faulty Programs for Fault Localization Seokhyeon Moon, Yunho Kim, Moonzoo Kim CS Dept. KAIST, South Korea http://swtv.kaist.ac.kr Shin Yoo CS Dept. University College London, UK http://crest.cs.ucl.ac.uk/

Ranking of faulty stmt among all executed stmts (%) Talk Summary MUSE: MUtation-baSEd fault localization technique Utilize mechanical program changes (mutation) to get hints for localizing a fault precisely 5.6x more precise than the state-of-the-art SBFL Op2 MUSE ranks the faulty statement within the top 10 for 38 out of the 51 SIR benchmark programs/faults (10KLOC) Also we introduce a new evaluation metric: Locality Information Loss (LIL) to measure the aptitude of a technique for automated fault repair MUSE also shows better LIL than Jacard, Ochiai, and Op2 Tarantula [ICSE 2002] Ochiai [PRDC 06] Wong [JSS 10] MUSE (MUtation baSEd FL) 2014 Year Ranking of faulty stmt among all executed stmts (%) Op2 [TOSEM 11] 25% 1% 2002 9% On the 10KLOC SIR benchmark programs 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Ask the Mutants: Mutating Faulty Programs for Fault Localization Talk Summary MUSE: MUtation-baSEd fault localization technique Utilize mechanical program changes (mutation) to get hints for localizing a fault precisely 5.6x more precise than the state-of-the-art SBFL Op2 MUSE ranks the faulty statement within the top 10 for 38 out of the 51 SIR benchmark programs/faults (10KLOC) Also we introduce a new evaluation metric: Locality Information Loss (LIL) to measure the aptitude of a technique for automated fault repair MUSE also shows better LIL than Jacard, Ochiai, and Op2 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Ask the Mutants: Mutating Faulty Programs for Fault Localization Contents Motivation: Limitations of current FL techniques Too low precision for practical application Inadequacy of the Expense metric Key idea: Different mutants have different impact on test results Experiments: 51 versions of 5 SIR benchmark programs (~10KLOC each) Comparison with Jaccard, Ochiai, and Op2 Results MUSE is 5.6X more precise than the SBFL techniques MUSE is more precise in LIL than the SBFL techniques Conclusion and Future Work 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Motivation: Finding Cause of SW Error is Difficult Debugging is an expensive step in SW development Locating the cause of errors (i.e. fault Localization), is one of the most expensive tasks in whole debugging activity [Vessey, 1985] 01 int max; 02 setMax(int x, int y){ 03 if (x >= y) 04 max = y;//should be max=x; 05 else 06 max = y; 07 print max; 08 assert(x>=y  max==x); } x:10 y:1 max==1 MUSE can automatically localize a fault precisely MUSE can find a fault by examining 10 statements in 10,000 LOC programs 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Limitation of the Expense Metric Expense metric conflates the accuracy of localization with the mode using localization techniques Only meaningful when rankings are formed and inspected linearly Qi et al[ISSTA’13] proposed to evaluate SBFL techniques by using them for automated bug fixing (GenProg) Jaccard: proved to be worse than Op2 w.r.t. ranking, but GenProg found patches quicker when faults are localized with Jaccard We need a new evaluation metric for fault localization technique that can measure the accuracy of localization precisely! 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Key Idea of MUSE Utilize differences between testing result changes of mutating faulty statements correct statements Conjecture 2 1: stmt 𝒔 𝟏 ′ … k: stmt 𝑠 k n: stmt 𝑠 n Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 : Failed test : Passed test Mutate correct statement 𝑠 1 1: stmt 𝑠 1 … k: stmt 𝑠 k n: stmt 𝑠 n Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 . Conjecture 1 Mutate faulty statement 𝒔 𝐤 1: stmt 𝑠 1 … k: stmt 𝒔 𝒌 ′ n: stmt 𝑠 n Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 What is the mutation? single syntactic code change Ex.: if(a)  if(!a) a+b  a–b 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Suspiciousness Metric of MUSE The suspiciousness metric μ 𝑆𝑢𝑠𝑝 μ 𝑠 = 𝛼 𝑠 – β 𝑠 𝛼 𝑠 : the probability of s to be a fault (Conjecture 1) The average # of failing tests that become passing ones for all mutants on 𝑠. β 𝑠 : the probability of s not to be a fault (Conjecture 2) The average # of passing tests that become failing ones for all mutants on 𝑠. Detail of the MUSE metric 𝑆𝑢𝑠𝑝 μ 𝑠 = ( 𝑚∈𝑚𝑢𝑡 𝑠 | 𝑓 𝑃 (𝑠)∩ 𝑝 𝑚 | f2p+1 − | 𝑝 𝑃 (s)∩ 𝑓 𝑚 | p2f+1 ) / (|𝑚𝑢𝑡 𝑠 | + 1) 𝑚𝑢𝑡 𝑠 : the set of all mutants of 𝑃 that mutates 𝑠 with observed changes in test results. 𝑓 𝑃 𝑠 , 𝑝 𝑃 (s) : a set of failing tests and a set of passing tests that execute 𝑠 on 𝑃 𝑝 𝑚 , 𝑓 𝑚 : a set of failing and a set of passing tests on mutant 𝑚. 𝑓2𝑝, 𝑝2𝑓:# of test result changes from fail to pass and vice versa between before and after over 𝑚𝑢𝑡 𝑃 . 𝑆𝑢𝑠𝑝 MUSE 𝑠 =𝑁𝑜𝑟𝑚_S𝑢𝑠𝑝 μ, 𝑠 +𝑁𝑜𝑟𝑚_S𝑢𝑠𝑝 SBFL, 𝑠 𝑁𝑜𝑟𝑚_𝑆𝑢𝑠𝑝(𝑓𝑙𝑡, 𝑠) is the normalized suspiciousness of a statement 𝑠 in a fault localization technique 𝑓𝑙𝑡, which is normalized into [0,1]. We use Jaccard as a SBFL component 2018-09-19

Ask the Mutants: Mutating Faulty Programs for Fault Localization MUSE Framework Test suite T m1 Exec. Test result1 Execution Test result Coverage analysis Stmts. covered by failing tests Mutation Calc. Susp. Program P mn Exec. Test resultn Step 2 Susp. & Rank Step 1 Step 3 Selecting target statements to mutate Generating and testing the mutants Calculating suspiciousness using the MUSE metric 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Locality Information Loss (LIL) Borrows from the domain of Information Theory First, take normalized suspiciousness scores as probability distribution for the likelihood of fault locality Second, compare the said distribution with the ideal distribution: P = 1.0 for the faulty statement, 0.0 for correct statements We use Kullbeck-Leibler divergence, a cross-entropy metric that measures the agreement between two probability distributions (the lower the LIL is, the more precise the localization is) 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Ask the Mutants: Mutating Faulty Programs for Fault Localization 3 Research Questions RQ1. Foundation of MUSE: How many test results change from failure to pass and vice versa on a mutant generated by mutating a faulty statement, compared with a mutant generated by mutating a correct one? RQ2. Precision in terms of the Expense metric: How precise is MUSE, compared with Jaccard, Ochiai, and Op2 in terms of Expense metric? RQ3. Precision in terms of the LIL metric: How precise is MUSE, compared with Jaccard, Ochiai, and Op2 in terms of LIL metric? 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Ask the Mutants: Mutating Faulty Programs for Fault Localization Experiment Design Target subjects: 51 versions of the 5 SIR benchmark programs flex, grep, gzip, sed, and space (~10KLOC each) Test suites 5.9~91.0 failing TCs and 24.4~235.0 passing TCs per subject Generates all possible mutants using Proteum [Maldonado et al] 66.4 mutants per statements (30.0 non-equivalent ones) 25 machines with 3.6 Ghz quad-core CPU 22 min/version 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Ask the Mutants: Mutating Faulty Programs for Fault Localization RQ1: Foundation of MUSE The # of the failing test cases on P that pass on mf is 115.2 times greater than the # on mc on average i.e., Conjecture 1 is valid The # of the passing test cases on P that fail on mc is 6.6 times greater than the # on mf on average i.e., Conjecture 2 is valid 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

RQ2: Precision in terms of the Expense metric MUSE is 5.6 times more precise than Op2 Average rank of faulty statements: MUSE 1.65% vs. Op2 9.25% Faulty statements ranked in Top 10: MUSE 38/51 vs. Op2 9/51 Note that MUSE is precise on the subject with real faults (i.e., space) as well as subjects with seeded faults. More precise 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

RQ3: Precision in terms of the LIL metric LIL measures confirm the empirical results of Qi et al. Jaccard shows lower LIL values than Ochiai or Op2, both proven to produce higher ranking However, on average, MUSE shows the lowest LIL value Space v21 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Remark1. Why is MUSE so precise? MUSE directly finds where (partial) fix can (and cannot) potentially exist by analyzing numerous mutants In a few cases, MUSE actually finds a fix it performs a program mutation that makes all test cases pass (this, in turn, increases the first term in the metric) In most other cases, MUSE finds a partial fix, i.e. a mutation that makes only some of previously failing test cases pass. A partial fix captures the chain of control and data dependencies relevant to the failure and provides a guidance towards the location of the fault. 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Remark 2. MUSE is precise to detect multiple faults in a target program # of versions % of executed stmts examined MUSE w/ single fault w/ multi faults total Avg of single fault versions Avg of multi faults versions Total flex 11 2 13 5.07 0.59 4.38 grep 1 1.31 0.50 0.91 gzip 6 7 0.96 0.07 0.84 sed 3 5 0.23 0.60 0.45 space 10 14 24 0.78 2.30 1.67 SUM/AVG 30 21 51 0.81 1.65 # of multiple faults: 2~15 per version 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Conclusion and Future Work MUSE utilizes mutants to identify fault location precisely ranks a faulty statement among the top 10 for 38 out of 51 studied faults 5.6 times more precise than Op2 (i.e., 1.65%) We propose a new metric LIL Information theoretic metric that measures the disagreement between real and given localization Future work upgrade MUSE to provide ``explanation’’ of the detected fault based on the generated partial fixes apply MUSE to larger programs such as PHP demonstrate MUSE actually helps developers localize faults effectively through case study (contacting Samsung Electronics) 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization

Ask the Mutants: Mutating Faulty Programs for Fault Localization Ask the Mutants: Mutating Faulty Programs for Fault Localization (ICST ‘2014) 2018-09-19 Ask the Mutants: Mutating Faulty Programs for Fault Localization