EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7, 2013 1.

Slides:

Advertisements

Similar presentations

Usage of the memoQ web service API by LSP – a case study

Advertisements

CS487 Software Engineering Omar Aldawud

A SYSTEMATIC STUDY OF AUTOMATED PROGRAM REPAIR: FIXING 55 OUT OF 105 BUGS FOR $8 EACH Claire Le Goues Michael Dewey-Vogt Stephanie Forrest Westley Weimer.

Bouncer securing software by blocking bad input Miguel Castro Manuel Costa, Lidong Zhou, Lintao Zhang, and Marcus Peinado Microsoft Research.

Automatic Software Repair Using GenProg 张汉生 ZHANG Hansheng 2013/12/3.

Ubiquitous Computing Definitions Ubiquitous computing is the method of enhancing computer use by making many computers available throughout the physical.

Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.

An Agile Approach for Web Systems Engineering A Presentation of an Article by V.E.S. Souza and R.A. Falbo.

Mining the web to improve semantic-based multimedia search and digital libraries

Eliminating Stack Overflow by Abstract Interpretation John Regehr Alastair Reid Kirk Webb University of Utah.

The Geant4 physics validation repository

Debugging CPSC 315 – Programming Studio Fall 2008.

Leveraging User Interactions for In-Depth Testing of Web Applications Sean McAllister, Engin Kirda, and Christopher Kruegel RAID ’08 1 Seoyeon Kang November.

The Basic Tools Presented by: Robert E., & Jonathan Chase.

Chapter 2: Algorithm Discovery and Design

CODING Research Data Management. Research Data Management Coding When writing software or analytical code it is important that others and your future.

Regression testing Tor Stållhane. What is regression testing – 1 Regression testing is testing done to check that a system update does not re- introduce.

Zichao Qi, Fan Long, Sara Achour, and Martin Rinard MIT CSAIL

Impact Analysis of Database Schema Changes Andy Maule, Wolfgang Emmerich and David S. Rosenblum London Software Systems Dept. of Computer Science, University.

Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.

Unit Testing & Defensive Programming. F-22 Raptor Fighter.

ITM352 PHP and Dynamic Web Pages: Server Side Processing.

The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.

1 Using Heuristic Search Techniques to Extract Design Abstractions from Source Code The Genetic and Evolutionary Computation Conference (GECCO'02). Brian.

CS 21a: Intro to Computing I Department of Information Systems and Computer Science Ateneo de Manila University.

OpenAlea An OpenSource platform for plant modeling C. Pradal, S. Dufour-Kowalski, F. Boudon, C. Fournier, C. Godin.

Revolutionizing the Field of Grey-box Attack Surface Testing with Evolutionary Fuzzing Department of Computer Science & Engineering College of Engineering.

1 Debugging and Testing Overview Defensive Programming The goal is to prevent failures Debugging The goal is to find cause of failures and fix it Testing.

A Web Crawler Design for Data Mining

` Research 2: Information Diversity through Information Flow Subgoal: Systematically and precisely measure program diversity by measuring the information.

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

Helix Automatic Software Repair with Evolutionary Computation Stephanie Forrest Westley Weimer.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

©2010 John Wiley and Sons Chapter 12 Research Methods in Human-Computer Interaction Chapter 12- Automated Data Collection.

AUTOMATIC PROGRAM REPAIR USING GENETIC PROGRAMMING CLAIRE LE GOUES APRIL 22,

What Change History Tells Us about Thread Synchronization RUI GU, GUOLIANG JIN, LINHAI SONG, LINJIE ZHU, SHAN LU UNIVERSITY OF WISCONSIN – MADISON, USA.

Which Configuration Option Should I Change? Sai Zhang, Michael D. Ernst University of Washington Presented by: Kıvanç Muşlu.

DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

AUTOMATIC PROGRAM REPAIR USING GENETIC PROGRAMMING 1 CLAIRE LE GOUES APRIL 22, 2013

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© What we currently know about software fault prediction: A systematic review of the fault prediction.

Apache JMeter By Lamiya Qasim. Apache JMeter Tool for load test functional behavior and measure performance. Questions: Does JMeter offers support for.

Automated Patch Generation Adapted from Tevfik Bultan’s Lecture.

REPRESENTATIONS AND OPERATORS FOR IMPROVING EVOLUTIONARY SOFTWARE REPAIR Claire Le Goues Westley Weimer Stephanie Forrest

Xusheng Xiao North Carolina State University CSC 720 Project Presentation 1.

Test Specifications A Specification System for Multi-Platform Test Suite Configuration, Build, and Execution Greg Cooksey.

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.

Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.

CPSC 873 John D. McGregor Session 9 Testing Vocabulary.

Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.

CPSC 871 John D. McGregor Module 8 Session 1 Testing.

Week 5-6 MondayTuesdayWednesdayThursdayFriday Testing III No reading Group meetings Testing IVSection ZFR due ZFR demos Progress report due Readings out.

ECE 750 Topic 8 Meta-programming languages, systems, and applications Automatic Program Specialization for J ava – U. P. Schultz, J. L. Lawall, C. Consel.

SEMINAR - SCALABLE, BEHAVIOR-BASED MALWARE CLUSTERING GUIDES : BOJAN KOLOSNJAJI, MOHAMMAD REZA NOROUZIAN, GEORGE WEBSTER PRESENTER RAMAKANT AGRAWAL.

CPSC 372 John D. McGregor Module 8 Session 1 Testing.

Software Defined Networking BY RAVI NAMBOORI. Overview  Origins of SDN.  What is SDN ?  Original Definition of SDN.  What = Why We need SDN ?  Conclusion.

Road Owners and PMS Christopher R. Bennett Senior Transport Specialist East Asia and Pacific Transport The World Bank Washington, D.C.

Code Learning and Transfer for Automatic Patch Generation

Anti-patterns in Search-based Program Repair

MASS Java Documentation, Verification, and Testing

Towards Trustworthy Program Repair

CSCE 315 – Programming Studio, Fall 2017 Tanzir Ahmed

Model Based Testing Venkata Ramana Bandari, Expert Software Engineer

Regression testing Tor Stållhane.

Precise Condition Synthesis for Program Repair

Testing Slides adopted from John Jannotti, Brown University

Presentation transcript:

EMPIRICAL EVALUATION OF INNOVATIONS IN AUTOMATIC REPAIR CLAIRE LE GOUES SITE VISIT FEBRUARY 7,

“Benchmarks set standards for innovation, and can encourage or stifle it.” -Blackburn et al. 2

2009: 15 papers on automatic program repair* 2011: Dagstuhl seminar on self-repairing programs 2012: 30 papers on automatic program repair* 2013: dedicated program repair track at ICSE *manually reviewed the results of a search of the ACM digital library for “automatic program repair” 3 AUTOMATIC PROGRAM REPAIR OVER TIME

Manually sift through bugtraq data. Indicative example: Axis project for automatically repairing concurrency bugs 9 weeks of sifting to find 8 bugs to study. Direct quote from Charles Zhang, senior author, on the process: "it's very painful” Very difficult to compare against previous or related work or generate sufficiently large datasets. 4 CURRENT APPROACH

GOAL: HIGH-QUALITY EMPIRICAL EVALUATION 5

SUBGOAL: HIGH-QUALITY BENCHMARK SUITE 6

Indicative of important real-world bugs, found systematically in open-source programs. Support a variety of research objectives. “Latitudinal” studies: many different types of bugs and programs “Longitudinal” studies: many iterative bugs in one program. Scientifically meaningful: passing test cases  repair Admit push-button, simple integration with tools like GenProg. 7 BENCHMARK REQUIREMENTS

Indicative of important real-world bugs, found systematically in open-source programs. Support a variety of research objectives. “Latitudinal” studies: many different types of bugs and programs “Longitudinal” studies: many iterative bugs in one program. Scientifically meaningful: passing test cases  repair Admit push-button, simple integration with tools like GenProg. 8 BENCHMARK REQUIREMENTS

Goal: a large set of important, reproducible bugs in non-trivial programs. Approach: use historical data to approximate discovery and repair of bugs in the wild. SYSTEMATIC BENCHMARK SELECTION 9

Indicative of important real-world bugs, found systematically in open-source programs: Add new programs to the set, with as wide a variety of types as possible (support “latitudinal” studies) Support a variety of research objectives: Allow studies of iterative bugs, development, and repair: generate a very large (100) set of bugs in one program (php) (support “longitudinal” studies). 10 NEW BUGS, NEW PROGRAMS

ProgramLOCTestsBugsDescription fbc97, Language (legacy) gmp145, Multiple precision math gzip491,000125Data compression libtiff77, Image manipulation lighttpd62, Web server php 1,046, ,995100Language (web) python407, Language (general) wireshark 2,814, Network packet analyzer valgrind711, Simulator and debugger vlc522,00017??Media player svn629,0001,748??Source control Total 7,001, ,

Indicative of important real-world bugs, found systematically in open-source programs. Support a variety of research objectives. “Latitudinal” studies: many different types of bugs and programs “Longitudinal” studies: many iterative bugs in one program. Scientifically meaningful: passing test cases  repair Admit push-button, simple integration with tools like GenProg. 12 BENCHMARK REQUIREMENTS

They must exist. Sometimes, but not always, true (see: Jonathan Dorn) 13 TEST CASE CHALLENGES

ProgramLOCTestsBugsDescription fbc97, Language (legacy) gmp145, Multiple precision math gzip491,000125Data compression libtiff77, Image manipulation lighttpd62, Web server php 1,046, ,995100Language (web) python407, Language (general) wireshark 2,814, Network packet analyzer valgrind711, Simulator and debugger Total 5,850, , BENCHMARKS

They must exist. Sometimes, but not always, true (see: Jonathan Dorn) They should be of high quality. This has been a challenge from day 0: nullhttpd Lincoln labs noticed it too: sort In both cases, adding test cases led to better repairs. 15 TEST CASE CHALLENGES

They must exist. Sometimes, but not always, true (see: Jonathan Dorn) They should be of high quality. This has been a challenge from day 0: nullhttpd Lincoln labs noticed it too: sort In both cases, adding test cases led to better repairs. They must be automated to run one at a time, programmatically, from within another framework. 16 TEST CASE CHALLENGES

Need to be able to compile and run new variants programmatically. Need to be able to run test cases one at a time. It’s not simple, and as we scale up to real-world systems, becomes increasingly tricky. Much of the challenge is unrelated to the program in question, instead requiring highly-technical knowledge of OS-level details. 17 PUSH-BUTTON INTEGRATION

Calling a process from within another process : system(“run test 1”) ...; wait() wait() returns the process exit status. This is complex. Example: a system call can fail because the OS ran out of memory in creating the process, or because the process itself ran out of memory. How do we tell the difference? Answer: bit masking 18 DIGRESSION ON WAIT()

Moral: integration is tricky, and lends itself to human mistakes. Possibility 1: original programmers make mistakes in developing the test suite. Test cases can have bugs, too. Possibility 2: we (GenProg devs/users) make mistakes in integration. A few old php test cases are not to our standards; faulty bitshift math for extracting the return value components. 19 REAL-WORLD COMPLEXITY

Interested in more, better benchmark design, with easy integration (without gnarly OS details). Virtual machines provide one approach. Need a better definition of “high quality test case” vs. “low quality test case:” Can the empty program pass it? Can every program pass it? Can the “always crashes” program pass it? 20 INTEGRATION CONCERNS

Over the past year, we have conducted studies of representation and operators for automatic program repair: One-point crossover on patch representation. Non-uniform mutation operator selection. Alternative fault localization framework. Results on the next slide incorporate “all the bells and whistles:” Improvements based on those large-scale studies. Manually confirmed quality of testing framework. 21 CURRENT REPAIR SUCCESS

ProgramPrevious ResultsCurrent Results fbc1/3 gmp1/2 gzip1/5 libtiff17/24 lighttpd5/9 php28/4455/100 python1/112/11 wireshark1/74/7 valgrind---1/2 Total55/10587/163 22

TRANSITION 23

REPAIR TEMPLATES 24 CLAIRE LE GOUES SHIRLEY PARK DARPA SITE VISIT FEBRUARY 7, 2013

BIO + CS INTERACTION 25

Immune response is equally fast for large and small animals. Human lung is 100x larger than mouse lung, still finds influenza infections in ~8 hours. Successfully balances local search and global response. Balance between generic and specialized T-cells: Rapid response to new pathogens vs. long-term memory of previous infections (cf. vaccines). IMMUNOLOGY: T-CELLS 26

MUTATE DISCARD INPUT EVALUATE FITNESS ACCEPT OUTPUT 27

Tradeoff between generic mutation actions and more specific action templates: Generic: INSERT, DELETE, REPLACE Specific: if ( != NULL) { } AUTOMATIC SOFTWARE REPAIR 28

HYPOTHESIS: GENPROG CAN REPAIR MORE BUGS, AND REPAIR BUGS MORE QUICKLY, IF WE AUGMENT MUTATION ACTIONS WITH “REPAIR TEMPLATES.” 29

Insight: Just like T-cells “remember” previous infections, abstract previous fixes to generate new mutations. Approach: Model previous changes using structured documentation. Cluster a large set of changes by similarity. Abstract the center of each cluster Example: if( < 0) return 0; else 30 OPTION 1: PREVIOUS CHANGES

Insight: Looking up things at a library provides people with the best example of what they are looking to reproduce. Approach: Generate static paths through C programs. Mine API usage patterns from those paths Abstract the patterns into mutation templates. Example: while(it.hasnext()) 31 OPTION 2: EXISTING BEHAVIOR

THIS WORK IS ONGOING. 32

We are generating a benchmark suite to support GenProg research, integration and tech transfer, and the automatic repair community at large. Current GenProg results for 12-hour repair scenario: 87/163 (53%) of real-world bugs in dataset. Repair templates will augment GenProg’s mutation operators to help repair more bugs, and repair bugs more quickly. 33 CONCLUSIONS