Quality and applicability of automated repair

Slides:



Advertisements
Similar presentations
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advertisements

Copyright © The OWASP Foundation Permission is granted to copy, distribute and/or modify this document under the terms of the OWASP License. The OWASP.
A SYSTEMATIC STUDY OF AUTOMATED PROGRAM REPAIR: FIXING 55 OUT OF 105 BUGS FOR $8 EACH Claire Le Goues Michael Dewey-Vogt Stephanie Forrest Westley Weimer.
Automatic Software Repair Using GenProg 张汉生 ZHANG Hansheng 2013/12/3.
What causes bugs? Joshua Sunshine. Bug taxonomy Bug components: – Fault/Defect – Error – Failure Bug categories – Post/pre release – Process stage – Hazard.
RIT Software Engineering
SE 450 Software Processes & Product Metrics 1 Defect Removal.
ECE122 L17: Method Development and Testing April 5, 2007 ECE 122 Engineering Problem Solving with Java Lecture 17 Method Development and Testing.
CODING Research Data Management. Research Data Management Coding When writing software or analytical code it is important that others and your future.
Zichao Qi, Fan Long, Sara Achour, and Martin Rinard MIT CSAIL
Prof. Aiken CS 169 Lecture 71 Version Control CS169 Lecture 7.
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
Signatures As Threats to Privacy Brian Neil Levine Assistant Professor Dept. of Computer Science UMass Amherst.
Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.
CS4723 Software Validation and Quality Assurance
` Research 2: Information Diversity through Information Flow Subgoal: Systematically and precisely measure program diversity by measuring the information.
LECTURE 38: REFACTORING CSC 395 – Software Engineering.
I Power Higher Computing Software Development The Software Development Process.
Automatically Repairing Broken Workflows for Evolving GUI Applications Sai Zhang University of Washington Joint work with: Hao Lü, Michael D. Ernst.
A Practical Guide To Unit Testing John E. Boal TestDrivenDeveloper.com.
Hai Wan School of Software Sun Yat-sen University KRW-2012 June 17, 2012 Boolean Program Repair Reverse Conversion Tool via SMT.
CSE403 Software Engineering Autumn 2000 Benchmark day Gary Kimura Lecture #23 November 17, 2000.
REPRESENTATIONS AND OPERATORS FOR IMPROVING EVOLUTIONARY SOFTWARE REPAIR Claire Le Goues Westley Weimer Stephanie Forrest
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington.
WATERFALL DEVELOPMENT MODEL. Waterfall model is LINEAR development lifecycle. This means each phase must be completed before moving onto the next!!! WHAT.
Software Quality Assurance SOFTWARE DEFECT. Defect Repair Defect Repair is a process of repairing the defective part or replacing it, as needed. For example,
Test Plan: Introduction o Primary focus: developer testing –Implementation phase –Release testing –Maintenance and enhancement o Secondary focus: formal.
Shadow Shadow of a Doubt: Testing for Divergences Between Software Versions Hristina PalikarevaTomasz KuchtaCristian Cadar ICSE’16, 20 th May 2016 This.
Copyright (c) Cem Kaner. All Rights Reserved. 1 Black Box Software Testing (Professional Seminar) Cem Kaner, J.D., Ph.D. Professor of Computer.
Continuous Delivery and Quality Monitoring 1 iCSC2016, Kamil Henryk Król, CERN Continuous Delivery and Quality Monitoring Kamil Henryk Król CERN Inverted.
Applied Software Project Management SOFTWARE TESTING Applied Software Project Management 1.
Open source development model and methodologies.
Code Learning and Transfer for Automatic Patch Generation
Continuous Delivery and Quality Monitoring
Regression Testing with its types
Software Engineering (CSI 321)
Why We Refactor? Confessions of GitHub Contributors
Chapter 8 – Software Testing
Towards Trustworthy Program Repair
Comparing Population Density over a century
12 Steps to Useful Software Metrics
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
CR18: Advanced Compilers Paper Reading
Ruru Yue1, Na Meng2, Qianxiang Wang1 1Peking University 2Virginia Tech
Evaluating and Improving Fault Localization
Ask the Mutants: Mutating Faulty Programs for Fault Localization
Applied Software Implementation & Testing
Some Important Techniques For Regression Testing That You Must Know.
Optimizing L&D Contribution to Business Outcomes
Gathering Information: Monitoring your Progress
Masatomo Hashimoto Akira Mori Tomonori Izumida
Gathering Information: Monitoring your Progress
Extreme Programming Extreme programming is "a lightweight methodology for small-to-medium-sized teams developing software in the face of vague or rapidly.
Welcome to Corporate Training -1
Continuous Integration
VUzzer: Application-aware Evolutionary Fuzzing
Software Test Automation Louisiana Tech University
Code search & recommendation engines
Version Control CS169 Lecture 7 Prof. Aiken CS 169 Lecture 7.
Assignment 1 - Needfinding
Daily Warm Up 08/26/2015 – 08/27/2015 List, in order, the steps of the scientific method. The independent variable is the thing that is _______________.
Java & Testing.
Precise Condition Synthesis for Program Repair
Using Automated Program Repair for Evaluating the Effectiveness of
Reordered/Amended test suite
By Hyunsook Do, Sebastian Elbaum, Gregg Rothermel
Shin Hwei Tan, Hiroaki Yoshida, Mukul R. Prasad, Abhik Roychoudhury
Mitigating the Effects of Flaky Tests on Mutation Testing
August Shi, Wing Lam, Reed Oei, Tao Xie, Darko Marinov
Presentation transcript:

Quality and applicability of automated repair Yuriy Brun, UMass Amherst Ted Smith Yalin Ke Manish Motwani Mauricio Soto Earl Barr Prem Devanbu Claire Le Goues René Just Katie Stolee

Quality of the patches produced by automated repair Applicability of automated repair to important and hard defects

What does it mean for automated repair to be successful? How many bugs can it fix? that is, make all known test cases pass But is making test cases pass the ultimate success? No! It conflates training and evaluation data! notable exceptions: Fry et al. ISSTA’12 Kim et al. ICSE’13 Qi et al. ISSTA’15 Martinez et al. EMSE’16 Don’t get me wrong: Tweaking code to pass all tests is really cool! It’s just not the end-all of automated repair

Lingering questions Are the produced repairs any good? How can we tell quality objectively?

How can we know if APR repairs Look at the produced patches by hand [Qi, Long, Achour, Rinard ISSTA’15] [Martinez, Durieux, Sommerard, Xuan, Monperrus ESEM’15] Have others look at the produced patches by hand [ Fry, Landau, Weimer ISSTA’12] [Kim, Nam, Song, Kim ICSE’13] Produce patches with test suite T, evaluate them on independent test suite T' [Brun, Barr, Xiao, Le Goues, Devanbu 2013] [Smith, Barr, Le Goues, Brun ESEC/FSE 2015] objective repeatable Zichao, Fan, Sara, Martin MartAn’s team

IntroClass Benchmark Requires a large set of bugs for programs with 2 independent test suites and the test suites need to be good IntroClass: 998 bugs in very small, student-written C programs, with a KLEE-generated test suite, and a human-written test suite. http://repairbenchmarks.cs.umass.edu, [Le Goues, Holtschulte, Smith, Brun, Devanbu, Forrest, Weimer TSE’15] 450 unique bugs

Cobra effect The British government was concerned about the number of venomous cobra snakes in Delhi. They put a bounty on the cobra heads. Business-minded individuals began breeding cobras to harvest them for the heads. The British government cancelled the program and the breeders released all their snakes, greatly increasing the population.

Do GenProg and TrpAutoRepair patches pass kept-out tests? Overfitting is common! Repair quality is not to be taken for granted.

Repair quality distribution GenProg TrpAutoRepair

More GenProg and TrpAutoRepair findings The better the test suite coverage, the better the patch APR causes harm to high-quality programs, but is helpful for low-quality programs Human-written tests lead to better patches More answers and details in “Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair” by Smith, Barr, Le Goues, Brun ESEC/FSE 2015

Why study patch quality? To build repair techniques that produce high-quality repairs!

Can we improve the patch quality? Recent work: SearchRepair [Ke, Stolee, Le Goues, Brun ASE’15] SPR [Long and Rinard ESEC/FSE’15] Prophet [Long and Rinard POPL’16, ICSE’16] SearchRepair, SPR, Prophet produce higher quality patches than GenProg, TrpAutoRepair, AE

Applicability What kinds of defects can automated repair repair? Can it repair defects hard for humans to fix? Can it repair defects humans think are important? Can is synthesize feature requests?

Defect sets ManyBugs [Le Goues, Holtschulte, Smith, Brun, Devanbu, Forrest, Weimer TSE’15] Defects4J [Just, Jalali, Ernst ISSTA’14]

How do you measure importance? Importance: priority, time from report to fix, project versions affected, etc. Complexity: size of the developer-written patch Test effectiveness: relevant tests, triggering tests, coverage, etc. Independence: does the defect depend on other defects? Developed a methodology for measuring a defect’s importance, complexity, test-suite effectiveness, and independence using: bug trackers, version control history, and human-written patches.

Seven repair techniques Java: GenProg GenProgJ TrpAutoRepair KaliJ AE Nopol Kali SPR Prophet Our focus was measuring defect properties, so we used prior results for which defects can be repaired.

Importance, complexity, tests APR is more likely to patch defects of a higher priority. (surprising!) Human time to fix didn’t correlate with APR’s ability to patch. (encouraging!) APR is less likely to repair defects for which human patches were large, but fixed some hard defects. More tests can reduce ability to produce patches. (concerning!)

Quality, patch attributes APR1 is less likely to produce correct patches for more complex defects. but the effect is smaller for producing high-quality patches than low-quality ones. (surprising and encouraging!) Higher-coverage test suites resulted in higher-quality patches. APR struggled with inserting loops, new variables, new function calls, and changing signatures. 1Only Prophet and SPR produce a sufficient number of high-quality patches to evaluate.

Contributions Repair Quality Research needed focusing on repair quality. Repeatable, automated, objective methodology for evaluating automated repair quality. including the IntroClass dataset Repair Applicability Defects humans and automated repair find difficult are different. Methodology and dataset for measuring repair applicability to hard and important defects.