Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS4723 Software Validation and Quality Assurance

Similar presentations


Presentation on theme: "CS4723 Software Validation and Quality Assurance"— Presentation transcript:

1 CS4723 Software Validation and Quality Assurance
Lecture 14 Defect Prediction

2 Defect Prediction We have studied code review and design review
One important thing is to know what is the progress of the review Defect Prediction

3 Defect Prediction Also useful in making decisions such as follows:
Should I release the software / the new feature now or later and do some extra testing & fixing? How large a maintenance team should I have for the software project How to assign team members to different groups for different feature / modules?

4 Defect Prediction 2-Class Classification Problem.
Non-defective If error = 0 Defective If error > 0 More advanced: Grading problem Give each item a score (i.e., between 0 and 1) Rank items according to their scores

5 Defect Prediction Three major types of Models Process Models
Product Models Multivariate Models

6 Process Models Phase Containment Models Rely on history to identify
How many defects were produced in each phase, How many defects from that phase were discovered and corrected (Phase Containment) Predict defects for each phase and track discovery and removal. Assume that defects predicted and not found were passed to the next phase. Simple, easy to implement with common tools.

7 Process Models Common features (factors)
Quality of the software process (agile, well-documented, design review, …) Number of revealed bugs in history Revealed bugs in testing Test coverage Other bug removal techniques used Fixed bugs

8 Projected Software Defects
In general, defect arrivals follow a Rayleigh Distribution Curve…can predict, based upon project size and past defect densities, the curve, along with the Upper and Lower Control Bounds F(t) = 1 – e^((-t/c)^2) f(t) = 2*((t/c)^2) *e ^((-t/c)^2) \ Defects Upper Limit Lower Limit Time Recall that F(t) is the cumulative distribution density, f(t) is the probability distribution, t is time, and c is a constant.

9 Process Models Pros Cons Easy to implement
Perform prediction very quickly Taking into account the process of bug removal Cons Require to collect lots of data during software process

10 Product Models Use only information in the current shape of the product to do prediction Static Approach Static/Dynamic structure of source code Design documents Specifications

11 Features (Factors) Code Size Complexity Comments
Line Methods, Classes Function Calls Complexity Nested loop Control flow graphs, Call graphs Comments Warnings of tools like FindBugs

12 Static Code Attributes
void main() { //This is a sample code //Declare variables int a, b, c; // Initialize variables a=2; b=5; //Find the sum and display c if greater than zero c=sum(a,b); if c < 0 printf(“%d\n”, a); return; } int sum(int a, int b) // Returns the sum of two numbers return a+b; Module LOC LOCC V CC Error main() 16 4 5 2 sum() 1 3 c > 0 c LOC: Line of Code LOCC: Line of commented Code V: Number of unique operands&operators CC: Cyclometric Complexity

13 Features (Factors) Design Document Class Diagram Sequence Diagram
Complexity Coupling, Cohesion Hierarchy depth Sequence Diagram Number of objects involved Number of object Interactions

14 Features (Factors) Specification Number of features / sub-features
Complexity of feature contracts Special cases to handle Out of power Network error Non-functional requirements Performance Security Usability

15 Product Model Pros Cons
Only static data is required, so applicable to any software at any phase Cons Still require some bug history for training (can be avoided by cross-learning) Features are harder to extract

16 Code Metrics Tools To Facilitate Production Model
To Guide Code Review and Design Review Eclipse Metrics Plugin Demo

17 Code Metrics Tools Update Site Generate Metrics Dependency Graph
Generate Metrics Metrics View Rebuild Dependency Graph Preference

18 Multivariate Models Uses any of many variables and analysis of the relationships of the values for those variables and the results observed in historic projects. Effective if you have a good match for the projects from which the model is created.

19 Classification with models
Basic machine learning problem Decide Element Granularity Method, Class, Module, Feature, Whole Software Training data Collect known bugs of elements as labels Features of these elements based on process model, product model, or multivariate models. Learn how to classify based on training data Apply the learned knowledge to new data

20 Classification tools Use mature machine learning techniques
Naive Bayes Bayes Network Support Vector Machines Decision Tree

21 Comparison of the models
Two aspects of defect prediction: The relationship between software defects and code metrics Impact of the software process on the defectiveness of software No agreed answer No cost-insensitive prediction

22 Questions to answer Questions:
Which metrics are good defect predictors? Which models should be used? How accurate are those models? How much does it cost? Benefits? 22

23 A well-known study by Moser et al. in 2008
Experimental Set-Up Assessing Classification Accuracy Accuracy Classification Results Cost-Sensitive Classification Cost-Sensitive Defect Prediction 23

24 Data & Experimental Set-Up
Public data set from the Eclipse CVS repository (releases 2.0, 2.1, 3.0) by Zimmermann et al. 18 change metrics concerning change history of files 31 static code attributes metrics that Zimmerman has used at a file level 24

25 Change Metrics renaming or moving software elements
the number of files that have been committed together with file x in weeks, starting from release date to its first appearance built-in refactorings perform most of the tedious and error-prone tasks such as 25

26 Change Metrics Build Three Models for Predicting the Presence or Absence of Defects in Files Change Model uses proposed change metrics Code Model uses static code metrics Combined Model uses both types of metrics 26

27 Evaluation Metrics 27

28 28

29 Analysis of decision tree
Defect Free: Large MAX_CHANGESET or Low REVISIONS Smaller MAX_CHANGESET and Low REVISIONS and REFACTORINGS Defect Prone: High number of BUGFIXES 29

30 Cost-Sensitive Classification
Cost-sensitive classification - costs associated with different errors made by a model >1 FN implicate higher costs than FP Costly to fix an undetected defect in post release cycle than to inspect defect-free file min 30

31 Cost-Sensitive Classification
Use some heuristics to stop increasing the recall FP<30% =5 31

32 Cost-Sensitive Classification
Defect predictors based on change data outperform those based on static code attributes. 32

33 Conclusions 18 change metrics, J48 learner, =5 give accurate results for 3 releases of the Eclipse project: >75% of correctly classified files >80% recall < 30% FP rate Hence, the change metrics contain more discriminatory and meaningful information about the defect distribution that the source code itself. 33

34 Conclusions Defect prone files with high revision numbers
large bug fixing activities Defect-free files that are large CVS commits refactorings several times files 34

35 Current research on defect prediction
Better features More sophisticated code features Dependencies, type information, … More sophisticated change features Type of fix, impact of fix, fix comments, … More precise labeling Remove patches Remove new feature adding

36 Review on Defect Prediction
Models Process model Product model Multivariate model Approach Machine learning Comparison Process model is better, but require more accumulated data 36


Download ppt "CS4723 Software Validation and Quality Assurance"

Similar presentations


Ads by Google