Introduction to Defect Prediction Cmpe 589 Spring 2008.

Slides:



Advertisements
Similar presentations
On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Statistical Tools for Evaluating the Behavior of Rival Forms: Logistic Regression, Tree & Forest, and Naive Discriminative Learning R. Harald Baayen University.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
SBSE Course 3. EA applications to SE Analysis Design Implementation Testing Reference: Evolutionary Computing in Search-Based Software Engineering Leo.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Metrics Project and Process Metrics. Why do we measure? Assessing project status Allows us to track risks Before they go critical Adjust workflow See.
1 Static Testing: defect prevention SIM objectives Able to list various type of structured group examinations (manual checking) Able to statically.
Model Evaluation Metrics for Performance Evaluation
Software engineering for real-time systems
Ray. A. DeCarlo School of Electrical and Computer Engineering Purdue University, West Lafayette, IN Aditya P. Mathur Department of Computer Science Friday.
Software Quality Analysis with Limited Prior Knowledge of Faults Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn.
Slide 1.1 © The McGraw-Hill Companies, 2002 Object-Oriented and Classical Software Engineering Fifth Edition, WCB/McGraw-Hill, 2002 Stephen R. Schach
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability Software complexity and software quality.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
12 Steps to Useful Software Metrics
Defect prediction using social network analysis on issue repositories Reporter: Dandan Wang Date: 04/18/2011.
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
1 The Relationship of Cyclomatic Complexity, Essential Complexity and Error Rates Mike Chapman and Dan Solomon
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden.
Cmpe 589 Spring Software Quality Metrics Product  product attributes –Size, complexity, design features, performance, quality level Process  Used.
CS4723 Software Validation and Quality Assurance
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
SWEN 5430 Software Metrics Slide 1 Quality Management u Managing the quality of the software process and products using Software Metrics.
Software Quality See accompanying Word file “Software quality 1”
OHTO -99 SOFTWARE ENGINEERING “SOFTWARE PRODUCT QUALITY” Today: - Software quality - Quality Components - ”Good” software properties.
Software Metrics (Part II). Product Metrics  Product metrics are generally concerned with the structure of the source code (example LOC).  Product metrics.
SoftLab Boğaziçi University Department of Computer Engineering Software Engineering Research Lab
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Today Ensemble Methods. Recap of the course. Classifier Fusion
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Software Metrics Cmpe 550 Fall Software Metrics.
THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© What we currently know about software fault prediction: A systematic review of the fault prediction.
Cmpe 589 Spring 2006 Lecture 2. Software Engineering Definition –A strategy for producing high quality software.
CSc 461/561 Information Systems Engineering Lecture 5 – Software Metrics.
Software Metrics and Defect Prediction Ayşe Başar Bener.
Research Heaven, West Virginia A Framework for Early Reliability Assessment Bojan Cukic, Erdogan Gunel, Harshinder Singh, Lan Guo, Dejan Desovski West.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Dr. DEVENDRA TAYAL– THE SCOPE OF SOFTWARE ENGINEERING.
Linear Discriminant Analysis and Logistic Regression.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
CSE SW Metrics and Quality Engineering Copyright © , Dennis J. Frailey, All Rights Reserved CSE8314M13 8/20/2001Slide 1 SMU CSE 8314 /
Metrics "A science is as mature as its measurement tools."
Copyright , Dennis J. Frailey CSE Software Measurement and Quality Engineering CSE8314 M00 - Version 7.09 SMU CSE 8314 Software Measurement.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Defect Prediction using Smote & GA 1 Dr. Abdul Rauf.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
A Discourse on Complexity of Process Models J. CardosoUniversidade da Madeira J. MendlingVienna University of Economics G. NeumannVienna University of.
Software Defects Cmpe 550 Fall 2005
Machine Learning – Classification David Fenyő
Evaluation – next steps
Trees, bagging, boosting, and stacking
12 Steps to Useful Software Metrics
DEFECT PREDICTION : USING MACHINE LEARNING
SQL for Calculating Likelihood Ratios
Predict Failures with Developer Networks and Social Network Analysis
The Organizational Impacts on Software Quality and Defect Estimation
Somi Jacob and Christian Bach
Presentation transcript:

Introduction to Defect Prediction Cmpe 589 Spring 2008

Problem 1 How to tell if the project is on schedule and within budget? Earned-value charts.

Problem 2 How hard will it be for another organization to maintain this software? McCabe Complexity

Problem 3 How to tell when the subsystems are ready to be integrated Defect Density Metrics.

Problem Definition Software development lifecycle: Requirements Design Development Test (Takes ~50% of overall time) Detect and correct defects before delivering software. Test strategies: Expert judgment Manual code reviews Oracles/ Predictors as secondary tools

Problem Definition

Testing

Defect Prediction 2-Class Classification Problem. Non-defective If error = 0 Defective If error > 0 2 things needed: Raw data: Source code Software Metrics -> Static Code Attributes

Static Code Attributes void main() { //This is a sample code //Declare variables int a, b, c; // Initialize variables a=2; b=5; //Find the sum and display c if greater than zero c=sum(a,b); if c < 0 printf(“%d\n”, a); return; } int sum(int a, int b) { // Returns the sum of two numbers return a+b; } c > 0 c ModuleLOCLOCCVCCError main() sum()51310 LOC: Line of Code LOCC: Line of commented Code V: Number of unique operands&operators CC: Cyclometric Complexity

+

Defect Prediction Machine Learning based models. Defect density estimation Regression models: error pronness First classification then regression Defect prediction between versions Defect prediction for embedded systems

Constructing Predictors Baseline: Naive Bayes. Why?: Best reported results so far (Menzies et al., 2007) Remove assumptions and construct different models. Independent Attributes ->Multivariate dist. Attributes of equal importance

Weighted Naive Bayes Naive Bayes Weighted Naive Bayes

Datasets Name# Features#ModulesDefect Rate(%) CM PC PC PC PC KC KC MW

Performance Measures Defects Actual noyes Prd no AB yes CD Accuracy: (A+D)/(A+B+C+D) Pd (Hit Rate): D / (B+D) Pf (False Alarm Rate): C / (A+C)

Results: InfoGain&GainRatio Data WNB+IG (%)WNB+GR (%)IG+NB (%) pdpdpfbalpdpfbalpdpfbal CM PC PC PC PC KC KC MW Avg:

Results: Weight Assignments

Benefiting from defect data in practice Within Company vs Cross Company Data Investigated in cost estimation literature No studies in defect prediction! No conclusions in cost estimation… Straight forward interpretation of results in defect prediction. Possible reason: well defined features.

How much data do we need? Consider: Dataset size:1000 Defect rate: 8% Training instances: % *8%*90%=72 defective instances ( ) non-defective instances

Intelligent data sampling With random sampling of 100 instances we can learn as well as thousands. Can we increase the performance with wiser sampling strategies? Which data? Practical aspects: Industrial case study.

ICSOFT’07 WC vs CC Data? When to use WC or CC? How much data do we need to construct a model?

ICSOFT’07

Module Structure vs Defect Rate Fan-in, fan-out Page Rank Algorithm Call graph information on the code “small is beautiful”

Performance vs. Granularity