50.530: Software Engineering

Slides:



Advertisements
Similar presentations
Software Testing. Quality is Hard to Pin Down Concise, clear definition is elusive Not easily quantifiable Many things to many people You'll know it when.
Advertisements

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
CS4723 Software Engineering Lecture 10 Debugging and Fault Localization.
Simplifying and Isolating Failure-Inducing Input Presented by Nir Peer University of Maryland Andreas Zeller Saarland University, Germany Ralf Hildebrandt.
CS590Z Delta Debugging Xiangyu Zhang (slides adapted from Tevfik Bultan’s )
Tutorial 6 & 7 Symbol Table
272: Software Engineering Fall 2008 Instructor: Tevfik Bultan Lecture 17: Automated Debugging.
Testing an individual module
50.530: Software Engineering Sun Jun SUTD. Week 3: Delta Debugging.
Delta Debugging CS All Windows 3.1 Windows 95 Windows 98 Windows ME Windows 2000 Windows NT Mac System 7 Mac System 7.5 Mac System Mac System.
Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.
1 Delta Debugging Koushik Sen EECS, UC Berkeley. 2 All Windows 3.1 Windows 95 Windows 98 Windows ME Windows 2000 Windows NT Mac System 7 Mac System 7.5.
CS5103 Software Engineering Lecture 17 Debugging.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Simplifying failure Inducing input Vikas, Purdue.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 10: Automated Debugging.
Automated Patch Generation Adapted from Tevfik Bultan’s Lecture.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
1 CS510 S o f t w a r e E n g i n e e r i n g Delta Debugging Simplifying and Isolating Failure-Inducing Input Andreas Zeller and Ralf Hildebrandt IEEE.
Simplifying and Isolating Failure-Inducing Input Andreas Zeller and Ralf Hildebrandt IEEE Transactions on Software Engineering (TSE) 2002.
SOFTWARE TESTING LECTURE 9. OBSERVATIONS ABOUT TESTING “ Testing is the process of executing a program with the intention of finding errors. ” – Myers.
Structured Programming The Basics
14 Compilers, Interpreters and Debuggers
Software Testing.
Control Flow Testing Handouts
Lesson #6 Modular Programming and Functions.
Lesson #6 Modular Programming and Functions.
Behavioral Design Patterns
GC211Data Structure Lecture2 Sara Alhajjam.
Chapter 9 Structuring System Requirements: Logic Modeling
Copyright © Cengage Learning. All rights reserved.
The Pseudocode Programming Process
Reasoning About Code.
Outline of the Chapter Basic Idea Outline of Control Flow Testing
Delta Debugging Mayur Naik CIS 700 – Fall 2017
Software engineering – 1
Types of Testing Visit to more Learning Resources.
Algorithm and Ambiguity
Quality engineer and programmer Debugging
CS 240 – Lecture 11 Pseudocode.
Testing UW CSE 160 Spring 2018.
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Mark Weiser University of Maryland, College Park IEEE CHI, 1981
Chapter 9 Structuring System Requirements: Logic Modeling
Chapter 9 Structuring System Requirements: Logic Modeling
Hao Zhong Shanghai Jiao Tong University
Java Programming Loops
Masatomo Hashimoto Akira Mori Tomonori Izumida
Chapter 19 Testing Object-Oriented Applications
Theory Of Computer Science
Software visualization and analysis tool box
Chapter 10 – Software Testing
Automated Patch Generation
Algorithm and Ambiguity
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
(slides adapted from Tevfik Bultan’s )
Lesson #6 Modular Programming and Functions.
Chapter 19 Testing Object-Oriented Applications
Regression Testing.
Java Programming Loops
50.530: Software Engineering
Profs. Brewer CS 169 Lecture 13
Chapter 9 Structuring System Requirements: Logic Modeling
EECE.2160 ECE Application Programming
Chapter 9 Structuring System Requirements: Logic Modeling
Lecture 5 Scanning.
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Presentation transcript:

50.530: Software Engineering Sun Jun SUTD

Week 3: Delta Debugging

Debugging Question 1: Bug Localization/Identification How do we know where the bugs are? Question 2: Bug Fixing How do we fix the bugs automatically? the behaviors we wanted A B C the behaviors we have

“Yesterday, My program worked. Today, it does not. Why?” Andreas Zeller, ESEC/FSE 1999 “Yesterday, My program worked. Today, it does not. Why?”

Motivation “The GDB people have done it again. The new release 4.17 of the GNU debugger [6] brings several new features, languages, and platforms, but for some reason, it no longer integrates properly with my graphical front-end DDD [10]: the arguments specified within DDD are not passed to the debugged program. Something has changed within GDB such that it no longer works for me. Something? Between the 4.16 and 4.17 releases, no less than 178,000 lines have changed. How can I isolate the change that caused the failure and make GDB work again?” Andrew Zeller, 1999

Research Question Assume that we know the bug(s) is due to a finite set of changes, how do we know which change(s) is responsible?

Regression Containment let O be the original working program; let N be the new buggy program; let changes be the sequence of changes from O to N; while (true) { apply first half of the changes on O and get ON; if (ON is working) { let O := ON; } else { let N := ON if (the number of changes from O to N is 1) return; What assumptions are needed for this to work?

Complications Interference Inconsistency Granularity There may not be one single change responsible for a failure, but a combination of several changes. Inconsistency (in parallel development) combinations of changes may not result in a testable program Granularity A single logical change may affect several hundred or even thousand lines of code, but only a few lines may be responsible for the failure.

Changes could be anything Delta Debugging Find a minimal set of changes that cause a program to fail a test case, through automatically generating test cases Changes could be anything

Delta Debugging Let {a, b, c, …, n} be the set of changes. ON: applied any subset X of the changes O: applied an empty set of changes N: applied all changes

Testing We assume a function test which produces three outputs PASS: the test succeeds FAIL: the test produced the failure it was indented to capture ?: the test produced indeterminate results Let X be a set of changes and test(X) to denote the testing result with the changes in X.

Objective Find a set of changes X such that test(X) != PASS, i.e., failure-inducing test(Y) != FAIL for all subset Y of X, i.e., minimum set ON: minimum failure-inducing changes The complexity is exponential. Why?

Assumptions for Simplification Monotonicity If test(X) = FAIL, test(Y) != PASS for all Y which is a superset of X. Unambiguity If test(X) = FAIL and test(Y) = FAIL, test(X intersect Y) != PASS, i.e., a failure is caused by one change set (and not independently by two disjoint sets) Consistency test(X) != ? for all X Justified?

DD Algorithm: Example

DD Algorithm: Example

DD Algorithm algorithm DD(U, R) { if (U has one element only) { return U; } partition U into X and Y equally; if (test(X union R) = FAIL) { return DD(X, R) if (test(Y union R) = FAIL) { return DD(Y, R) return DD(X, Y union R) union DD(Y, X union R); C: a set of changes; R: changes remain to be applied

Exercise 0: Complete the Call Graph DD({1..8}, {}) DD({1..4}, {5..8}) DD({5..8}, {1..4})

Theory Theorem: Assume monotonicity, unambiguity and consistency, algorithm DD always returns the minimum failure-inducing set of changes.

Informal Proof Justified by monotonicity and unambiguity algorithm DD(U, R) { if (U has one element only) { return U; } partition U into X and Y equally; if (test(X union R) = FAIL) { return DD(X, R) if (test(Y union R) = FAIL) { return DD(Y, R) return DD(X, Y union R) union DD(Y, X union R); Justified by consistency Justified by monotonicity and unambiguity Justified by monotonicity and unambiguity

Ambiguity Unambiguity: If test(X) = FAIL and test(Y) = FAIL, test(X intersect Y) != PASS, i.e., a failure is caused by one change set (and not independently by two disjoint sets) What if it is ambiguous? DD will find one failure-inducing changes. Remove this set and apply DD again to find another.

Not Monotonic Monotonicity: If test(X) = FAIL, test(Y) != PASS for all Y which is a superset of X. What if it is not monotonic? For instance, what if test({1,2}) = FAIL but test({1..4}) = PASS? DD will find some failure-inducing changes.

Inconsistency Consistency: test(X) != ? for all X Integration failure. A change may require earlier changes that are not included in the configuration Construction failure. Although all changes can be applied, the resulting program has syntactical or semantic errors, such that construction fails. Execution failure. The program does not execute correctly; the test outcome is unresolved. Shall we enforce consistency for each commit of changes?

Research Question How do we modify the DD algorithm so as to handle inconsistency? Found: If test(X) = FAIL, X contains a failure-inducing subset. Interference: If test(X) = PASS = test(Y), X and Y form an interference. What if test(X) =? or test(Y) =? or both?

Example At step 3, can we omit {5..8}?

Example Preference: If test(X) =? and test(U-X) = PASS, X contains a failure-inducing subset and is preferred.

Example Scenario: there are 1..8 changes. Change 8 is failure-inducing, and changes 2, 3 and 7 imply each other – that is, they only can be applied as a whole.

Example

Example Is it necessary to have {5,6}?

DD+ Algorithm algorithm DD+(U, R, N) { if (U has one element only) {//found return U; } partition U into N sets X1, X2, …, Xn equally; if (test(Xi union R ) = FAIL for some Xi) {//found in Xi return DD+(Xi, R, 2); if (test(Xi union R) = PASS and test((U – Xi) union R) = PASS for some Xi) {//interference return DD+(Xi, (U-Xi) union R, 2) union DD+(U-Xi-R, Xi union R, 2); if (test(Xi union R) = ? and test((U – Xi) union R) = PASS for some Xi) {//preference return DD+(Xi, (U – Xi) union R, 2); let U’ = U intersect {U – Xi – R| test((U – Xi) union R) = FAIL}; if (N < |U|) {//try again return DD+(U’, R union {Xi | test(Xi union R) = PASS}, min(|U’|, 2N)); return U’; //nothing left

Exercise 1 Convince yourself by applying the algorithm (i.e., DD+({1..8}, {}, 2)) to the following scenario. There are 1..8 changes. Change 8 is failure-inducing, and changes 2, 3 and 7 imply each other – that is, they only can be applied as a whole.

Reducing Inconsistency What if we know that 2 and 3 and 7 imply each other? Partition the set into two sets {1,2,3,7} and {4,5,6,8}. How do we know what changes imply each other?

Grouping Related Changes To determine whether changes are related, one can use process criteria: common change dates or sources, location criteria: the affected file or directory, lexical criteria: common referencing of identifiers, syntactic criteria: common syntactic entities (functions, modules) affected by the change, semantic criteria: common program statements affected by the changed control or data flow.

Example Assume that each change requires all earlier changes to be consistent.

Case Study: DDD 3.1.2 Dumps Core DDD 3.1.2, released in December, 1998, exhibited a nasty behavioral change: When invoked with a the name of a non-existing file, DDD 3.1.2 dumped core, while its predecessor DDD 3.1.1 simply gave an error message. The DDD configuration management archive lists 116 logical changes between the 3.1.1 and 3.1.2 releases. These changes were split into 344 textual changes to the DDD source.

Random Clustering start with 344 changes reduced to 172 changes 8 subsets of changes reduced to 16 changes reduced to 1 change 31 tests in total

Grouping Changes Changes were grouped according to the date they were applied. Each change implied all earlier changes. 12 test runs and 58 minutes

Case Study: GDB 178,000 changed GDB lines, grouped into 8721 textual changes in the GDB source, with any two textual changes separated by at least two unchanged lines. No configuration management archive to obtain change dates, etc.

Random Clustering Most of the first 457 tests result in ? At test 458, one set containing 36 changes resulted in FAIL 470 tests in total; 48 hours.

Grouping Changes At top-level, changes were grouped according to directories. This was motivated by the observation that several GDB directories contain a separate library whose interface remains more or less consistent across changes. Within one directory, changes were grouped according to common files. The idea was to identify compilation units whose interface was consistent with both “yesterday’s” and “today’s” version. Within a file, changes were grouped according to common usage of identifiers. This way, we could keep changes together that operated on common variables or functions. Finally, failure resolution loop: After a failing construction, scans the error messages for identifiers, adds all changes that reference these identifiers and tries again. This is repeated until construction is possible, or until there are no more changes to add.

Grouping Changes After 9 tests, 2547 changes left After 280 tests, 18 changes left (of two files only) 289 tests in total, 20 hours

Is Delta Debugging an over-kill? Question Is Delta Debugging an over-kill?

“Simplifying and isolating failure-inducing input” Andreas Zeller, IEEE Transactions on Software Engineering, 2002 “Simplifying and isolating failure-inducing input”

Motivation In July 1999, Bugzilla, the Mozilla bug database, listed more than 370 open bug reports—bug reports that were not even simplified. With this queue growing further, the Mozilla engineers “faced imminent doom”. Overwhelmed with work, the Netscape product manager sent out the Mozilla BugAThon call for volunteers: people who would help simplify bug reports. “Simplifying” meant: turning these bug reports into minimal test cases, where every part of the input would be significant in reproducing the failure.

Motivation: Example <td align=left valign=top> <SELECT NAME="op sys" MULTIPLE SIZE=7> <OPTION VALUE="All">All<OPTION VALUE="Windows 3.1">Windows 3.1<OPTION VALUE="Windows 95">Windows 95<OPTION VALUE="Windows 98">Windows 98<OPTION VALUE="Windows ME">Windows ME<OPTION VALUE="Windows 2000">Windows 2000<OPTION VALUE="Windows NT">Windows NT<OPTION VALUE="Mac System 7">Mac System 7<OPTION VALUE="Mac System 7.5">Mac System 7.5<OPTION VALUE="Mac System 7.6.1">Mac System 7.6.1<OPTION VALUE="Mac System 8.0">Mac System 8.0<OPTION VALUE="Mac System 8.5">Mac System 8.5<OPTION VALUE="Mac System 8.6">Mac System 8.6<OPTION VALUE="Mac System 9.x">Mac System 9.x<OPTION VALUE="MacOS X">MacOS X<OPTION VALUE="Linux">Linux<OPTION VALUE="BSDI">BSDI<OPTION VALUE="FreeBSD">FreeBSD<OPTION VALUE="NetBSD">NetBSD<OPTION VALUE="OpenBSD">OpenBSD<OPTION VALUE="AIX">AIX<OPTION VALUE="BeOS">BeOS<OPTION VALUE="HP-UX">HP-UX<OPTION VALUE="IRIX">IRIX<OPTION VALUE="Neutrino">Neutrino<OPTION VALUE="OpenVMS">OpenVMS<OPTION VALUE="OS/2">OS/2<OPTION VALUE="OSF/1">OSF/1<OPTION VALUE="Solaris">Solaris<OPTION VALUE="SunOS">SunOS<OPTION VALUE="other">other</SELECT> </td> <td align=left valign=top> <SELECT NAME="priority" MULTIPLE SIZE=7> <OPTION VALUE="--">--<OPTION VALUE="P1">P1<OPTION VALUE="P2">P2<OPTION VALUE="P3">P3<OPTION VALUE="P4">P4<OPTION VALUE="P5">P5</SELECT></td><td align=left valign=top> <SELECT NAME="bug severity" MULTIPLE SIZE=7> <OPTION VALUE="blocker">blocker<OPTION VALUE="critical">critical<OPTION VALUE="major">major<OPTION VALUE="normal">normal<OPTION VALUE="minor">minor<OPTION VALUE="trivial">trivial<OPTION VALUE="enhancement">enhancement</SELECT> </tr> </table> Loading this HTML page into Mozilla and printing it causes a segmentation fault.

How do we automatically minimize a test case? Research Question How do we automatically minimize a test case?

Research Question How do we automatically minimize a test case? Solution: using Delta Debugging Yesterday: test(“”) = PASS Today: test(that html page) = FAIL The set of changes = {insert one character} Is this the best definition?

Example

Generalize DD A change can be anything which changes the circumstances of a program run, e.g., program code changes (as we have seen) program input changes (as we’re about to see) system configurations?

Generalize DD DD Requires a set of primitive changes (which can be composed) a test function (with different outputs like: PASS, FAIL, ?) What should be the primitive changes?

Minimal Test Case Let U be the set of changes containing the primitive change of inserting one character at a position in the input. The problem is to find the minimum set X (which corresponds to the minimum input) such that test(X) = FAIL.

Minimality Local Minimum: a test case (represented as a set of changes X) is local minimum if test(X) = FAIL and test(Y) != FAIL for all subset Y of X. 1-Minimal test case: a test case X is 1-minimal if test(X) = FAIL and test(Y) != FAIL for all set Y which is one-element less of X. Why? Because global minimum is hard to achieve.

DDmin Algorithm algorithm DDmin(U, N) { if (|U| = 1) {//minimum return U; } partition U into X1, X2, …, Xn equally; if (test(Xi) = FAIL for some Xi) {//reduce to subset return DDmin(Xi, 2); if (test(U - Xi) = FAIL for some Xi) {//reduce to complement return DDmin(U - Xi, max(N-1, 2)); if (N < |U|) {//increase granularity return DDmin(U, min(|U|, 2N));

Example DDmin({1..8}, 2) DDmin({5..8}, 2) DDmin({7..8}, 2)

Exercise Show how DDmin works for this example.

Theory DDmin(U, 2) is guaranteed to return a 1-minimal test case. algorithm DDmin(U, N) { if (|U| = 1) {//minimum return U; } partition U into X1, X2, …, Xn equally; if (test(Xi) = FAIL) {//reduce to subset return DDmin(Xi, 2); if (test(U - Xi) = FAIL) {//reduce to complement return DDmin(U - Xi, max(N-1, 2)); if (N < |U|) {//increase granularity return DDmin(U, min(|U|, 2N)); Reduce U in these cases If U can’t be reduced further, eventually we would test every sunset of U with one less element.

Research Question Why not simply apply the DD+ algorithm?

Case Study: GCC This program causes GNU C compiler (GCC version 2.95.2 on Intel-Linux with optimization enabled) to crash. Each change inserts the i-th character into the program.

Case Study: GCC After 731 tests generated by DDmin (and 34 seconds) Does it make sense to reduce the method name?

Case Study: GCC GCC has 31 options Can we find out which option is relevant? Yes, simply run DDmin again.

Case Study: GCC DDmin: after 731 test cases, the result is

Case Study: GCC This suggests a problem with inlining the expression i+j+1 in the array Accesses z[i] on the following line

Exercise 2 Take this program and this input as example. Apply delta debugging to locate the bug.

How would you improve Delta Debugging? Research Question How would you improve Delta Debugging?

Research Question How would you improve Delta Debugging? Are all generated test cases meaningful? How do we partition the input?

Research Question How would you improve Delta Debugging? Which test case to run first?

Research Question How would you improve Delta Debugging? What if there are multiple failure-inducing inputs?

Discussion Assume we given the following program and Delta Debugging run. How do we make of the result?

How would you improve Delta Debugging? Research Question How would you improve Delta Debugging? How can Delta Debugging be used to identify which statement in a program is buggy? (for simplicity, assume there is only one statement which is buggy).