Automated Fault Prediction The Ins, The Outs, The Ups, The Downs Elaine Weyuker June 11, 2015.

Automated Fault Prediction The Ins, The Outs, The Ups, The Downs Elaine Weyuker June 11, 2015

 To determine which files of a large software system with multiple releases are likely to contain the largest numbers of bugs in the next release.

 Help testers prioritize testing efforts.  Help developers decide when to do design and code reviews and what to reimplement.  Help managers allocate resources.

Verified that bugs were non-uniformly distributed among files. Identified properties that were likely to affect fault- proneness, and then built a statistical model and ultimately a tool to make predictions.

● Size of file (KLOCs) ● Number of changes to the file in the previous 2 releases. ● Number of bugs in the file in the last release. ● Age of file (Number of releases in the system) ● Language the file is written in.

● All of the systems we’ve studied to date use a configuration management system which integrates version control and change management functionality, including bug history. ● Data is automatically extracted from the associated data repository and passed to the prediction engine.

 Used Negative Binomial Regression  Also considered machine learning algorithms including: ◦ Recursive Partitioning ◦ Random Forests ◦ BART (Bayesian Additive Regression Trees)

● Consists of two parts. ● The back end extracts data needed to make the predictions. ● The front end makes the predictions and displays them.

 Extracts necessary data from the repository.  Predicts how many bugs will be in each file in the next release of the system.  Sorts to files in decreasing order of the number of predicted bugs.  Displays results to user.

 Percentage of actual bugs that occurred in the N% of the files predicted to have the largest number of bugs. (N=20)  Considered other measures less sensitive to the specific value of N.

SystemYears Followed ReleasesLOC% Faults Top 20% NP417538K83% WN29438K83% VT2.259329K75% TS9+35442K81% TW9+35384K93% TE727327K76% IC4181520K91% AR418281K87% IN4182116K93%

The Tool

Prediction Engine Statistical Analysis Version Mgmt /Fault Database (previous releases) Release to be predicted User-supplied parameters Fault-proneness predictions

User enters system name. User asks for fault predictions for release “ Bluestone2008.1 ” Available releases are found in the version mgmt database. User chooses the releases to analyze. User selects 4 file types. User specifies that all problems reported in System Test phase are faults.

User confirms configuration User enters filename to save the configuration. User clicks Save & Run button, to start the prediction process.

Initial prediction view for Bluestone2008.1 All files are listed in decreasing order of predicted faults

Listing is restricted to eC files

Listing is restricted to 10% of eC files

 Prediction tool is fully-operational ◦ 750 lines Python for interface ◦ 2150 lines C, 75K bytes compiled for prediction engine  Current version’s backend (written in C) is specific for the internal AT&T configuration management system but can be adapted to other configuration management systems. All that is needed is a source of the data required by the prediction model.

Variations of the Fault Prediction Model

 Developers ◦ Counts ◦ Individuals  Amount of Code Change  Calling Structure

1. Standard model 2. Developer counts 3. Individual developers 4. Line-level change metrics 5. Calling structure Overview

 Underlying statistical model ◦ Negative binomial regression  Output (dependent) variable ◦ Predicted fault count in each file of release n  Predictor (independent) variables ◦ KLOC (n) ◦ Previous faults (n-1) ◦ Previous changes (n-1, n-2) ◦ File age (number of releases) ◦ File type (C,C++,java,sql,make,sh,perl,...) The Standard Model

How many different people have worked on the file in the most recent previous release? How many different people have worked on the file in all previous releases? This is a cumulative count. How many people who changed the file were working on it for the first time? Developer counts

Faults per file in releases of System BTS

Standard Model

Developers Changing File in Previous Release

New Developers Changing File in Previous Release

Total Developers Changing File in All Previous Releases

Total developers touching file in all previous releases

None of the developer count attributes uniformly increases prediction accuracy. In all cases, adding a developer count attribute to the standard model sometimes leads to less accurate predictions than the standard model alone. The benefit is never major. Summary

The standard model includes a count of the number of changes made in the previous two releases. It does not take into account how much code was changed. We will now look at the impact on predictive accuracy of adding to the model fine-grained information about change size. Code Change

 Number of changes made to a file during a previous release  Number of lines added  Number of lines deleted  Number of lines modified  Relative size of change (line changes/LOC)  Changed/not changed Measures of Code Change

18 releases, 5 year lifespan IC: Large provisioning system  6 languages: Java (60%), C, C++, SQL, SQL-C, SQL-C++  3000+ files  1.5Mil LOC  Average of 395 faults/release  AR: Utility, data aggregation system  >10 languages: Java (77%), Perl, xml, sh,...  800 files  280K LOC  Average of 90 faults/release Two Subject Systems

Distribution of files, averages over all releases.

System IC Faults per File, by Release

System AR Faults per File, by Release

 Univariate models  Base model: log(KLOC), File age, File type  Augmented models: ◦ Previous Changes ◦ Previous {Adds / Deletes / Mods} ◦ Previous {Adds / Deletes / Mods} / LOC (relative churn) ◦ Previous Developers Prediction Models with Line-level Change Counts

Fault-percentile averages for univariate predictor models: System IC

Base Model and Added Variables: System IC Base model: KLOC, File age (number of releases), File type (C,C++,java,sql,make,sh,perl,...)

Base Model and Added Variables: System AR

 Change information provides important information for fault predictions  {Adds+Deletes+Mods} improves the accuracy of a model that doesn’t include any change information BUT  a simple count of prior changes slightly outperforms {Adds+Deletes+Mods}  Prior changed (a simple binary variable) is nearly as good as either, when added to a model without change info  Lines added is the most effective single change predictor  Lines deleted is least effective single change predictor  Relative changes is no better than absolute changes for predicting total fault count Summary

Individual Developers How can we measure the effect that a single developer has on the faultiness of a file? If developer d modifies k files in release N  how many of those files have bugs in release N+1?  how many bugs are in those files in release N+1?

The BuggyFile Ratio If d modifies k files in release N, and if b of them have bugs in release N+1, the buggyfile ratio for d is b/k System IC has 107 programmers. Over 15 releases, their buggyfile ratios vary between 0 and 1 The average is about 0.4

Average buggyfile ratio, all programmers

Buggyfile ratio for two programmers

Buggyfile ratio more typical cases

The Bug Ratio If d modifies k files in release N, and if there are B bugs in those files in release N+1, the bug ratio for d is B/k The bug ratio can vary between 0 and B Over 15 releases, we’ve seen a maximum bug ratio of about 8 The average is about 1.5

Bug RatioBuggyfile Ratio

Problems with these definitions  A file can be changed by more than one developer.  A file may be changed in Rel N and a fault detected in N+1, but that change may not have caused that fault.  A programmer might change many files in the identical trivial ways (interface, variable name,...)  The “best” programmers might be assigned to work on the most difficult files.  For most programmers, the bug ratios vary widely from release to release.

Is individual programmer bug-proneness helpful for prediction? Is this information useful for helping a project succeed? Are there better ways to measure it? Is it ethical to measure it? Does attempting to measure it lead to poor performance and unhappy programmers? Some final thoughts

Are files that have high rate of interaction with other files more fault-prone? Calling Structure File Q Method 1 Method 2 File X File Y File Z Callees of File Q File A File B Callers of File Q

For each file:  number of callers & callees  number of new callers & callees  number of prior new callers & callees  number of prior changed callers & callees  number of prior faulty callers & callees  ratio of internal calls to total calls Calling Structure Attributes Investigated

 Code and history attributes, no calling structure  Code and history attributes, including calling structure  Code attributes only, including calling structure Fault Prediction by Multi-variable Models

 Models applied to C, C++, and C-SQL files of one of the systems studied.  First model built from the single best attribute.  Each succeeding model built by adding the attribute that most improves the prediction.  Stop when no attribute improves. Fault Prediction by Multi-variable Models

Code and history attributes, no calling structure

Code, history, and calling structure attributes

Code and calling structure attributes but not numbers of faults or changes in previous releases.

 Calling structure attributes do not increase the accuracy of predictions.  History attributes (prior changes, prior faults) increase accuracy, either with or without calling structure.  We only studied these issues for two of the systems. Summary

 The Standard Model performs very well (on all nine industrial systems we have examined)  The augmented models add very little or no additional accuracy  Cumulative developers is the most effective addition to the Standard Model, but still doesn’t guarantee improved prediction or yield significant improvement. Overall Summary

◦ Will our standard model make accurate predictions for open- source systems? ◦ Will our standard model make accurate predictions agile systems? ◦ Can we predict which files will contain the faults with the highest severities? ◦ Can predictions be made for units smaller than files? ◦ Can run-time attributes be used to make fault predictions? (execution time, execution frequency, memory use, …) ◦ What is the most meaningful way to assess the effectiveness and accuracy of the predictions? What’s Ahead?

Automated Fault Prediction The Ins, The Outs, The Ups, The Downs Elaine Weyuker June 11, 2015.

Similar presentations

Presentation on theme: "Automated Fault Prediction The Ins, The Outs, The Ups, The Downs Elaine Weyuker June 11, 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automated Fault Prediction The Ins, The Outs, The Ups, The Downs Elaine Weyuker June 11, 2015.

Similar presentations

Presentation on theme: "Automated Fault Prediction The Ins, The Outs, The Ups, The Downs Elaine Weyuker June 11, 2015."— Presentation transcript:

Similar presentations

About project

Feedback