Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Slides:



Advertisements
Similar presentations
On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
Advertisements

Allometric Crown Width Equations for Northwest Trees Nicholas L. Crookston RMRS – Moscow June 2004.
CLUSTERING SUPPORT FOR FAULT PREDICTION IN SOFTWARE Maria La Becca Dipartimento di Matematica e Informatica, University of Basilicata, Potenza, Italy
Extension The General Linear Model with Categorical Predictors.
Presentation of the Quantitative Software Engineering (QuaSE) Lab, University of Alberta Giancarlo Succi Department of Electrical and Computer Engineering.
SEP1 - 1 Introduction to Software Engineering Processes SWENET SEP1 Module Developed with support from the National Science Foundation.
Test Metrics: A Practical Approach to Tracking & Interpretation Presented By: Shaun Bradshaw Director of Quality Solutions May 20, 2004 Test Metrics: A.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
1 Empirical Validation of Three Software Metrics Suites to Predict Fault-Proneness of Object-Oriented Classes Developed Using Highly Iterative or Agile.
Mining Metrics to Predict Component Failures Nachiappan Nagappan, Microsoft Research Thomas Ball, Microsoft Research Andreas Zeller, Saarland University.
Prediction of fault-proneness at early phase in object-oriented development Toshihiro Kamiya †, Shinji Kusumoto † and Katsuro Inoue †‡ † Osaka University.
Analysis of CK Metrics “Empirical Analysis of Object-Oriented Design Metrics for Predicting High and Low Severity Faults” Yuming Zhou and Hareton Leung,
Software engineering for real-time systems
Curve-Fitting Regression
Topic 3: Regression.
Regression Diagnostics Checking Assumptions and Data.
Classification and Prediction: Regression Analysis
A Comparative Analysis of the Efficiency of Change Metrics and Static Code Attributes for Defect Prediction Raimund Moser, Witold Pedrycz, Giancarlo Succi.
1. An Overview of the Data Analysis and Probability Standard for School Mathematics? 2.
Inference for regression - Simple linear regression
Chidamber & Kemerer Suite of Metrics
FINAL DEMO Apollo Crew, group 3 T SW Development Project.
Automated Fault Prediction The Ins, The Outs, The Ups, The Downs Elaine Weyuker June 11, 2015.
Alexander Serebrenik, Serguei Roubtsov, and Mark van den Brand D n -based Design Quality Comparison of Industrial Java Applications.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Refactoring.
Japan Advanced Institute of Science and Technology
UNIVERSITAS SCIENTIARUM SZEGEDIENSIS UNIVERSITY OF SZEGED D epartment of Software Engineering New Conceptual Coupling and Cohesion Metrics for Object-Oriented.
Software Measurement & Metrics
Samad Paydar Web Technology Lab. Ferdowsi University of Mashhad 10 th August 2011.
1 OO Metrics-Sept2001 Principal Components of Orthogonal Object-Oriented Metrics Victor Laing SRS Information Services Software Assurance Technology Center.
A Validation of Object-Oriented Design Metrics As Quality Indicators Basili et al. IEEE TSE Vol. 22, No. 10, Oct. 96.
Software Engineering Research Group, Graduate School of Engineering Science, Osaka University 1 Evaluation of a Business Application Framework Using Complexity.
We provide information Model based estimation of indicators of poverty and social exclusion Thomas Glaser Statistics Austria Directorate.
Economics 173 Business Statistics Lecture 20 Fall, 2001© Professor J. Petry
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach J. Ruthruff et al., University of Nebraska-Lincoln, NE U.S.A, Google.
Supporting Release Management & Quality Assurance for Object-Oriented Legacy Systems - Lionel C. Briand Visiting Professor Simula Research Labs.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Metrics and lessons learned for OO projects Kan Ch 12 Steve Chenoweth, RHIT Above – New chapter, same Halstead. He also predicted various other project.
Formalizing Material Flow Diagrams Robert-Jan Bijl.
Nonlinear Models. Agenda Omitted Variables Dummy Variables Nonlinear Models Nonlinear in variables Polynomial Regressions Log Transformed Regressions.
Daniel Liu & Yigal Darsa - Presentation Early Estimation of Software Quality Using In-Process Testing Metrics: A Controlled Case Study Presenters: Yigal.
Linear Discriminant Analysis and Logistic Regression.
Object-Oriented (OO) estimation Martin Vigo Gabriel H. Lozano M.
1 The Distribution of Faults in a Large Industrial Software System Thomas Ostrand Elaine Weyuker AT&T Labs -- Research Florham Park, NJ.
1 Predicting Classes in Need of Refactoring – An Application of Static Metrics Liming Zhao Jane Hayes 23 September 2006.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Object and Class Structuring Chapter 9 Part of Analysis Modeling Designing Concurrent, Distributed, and Real-Time Applications with UML Hassan Gomaa (2001)
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
1 740f02classsize18 The Confounding Effect of Class Size on the Validity of Object- Oriented Metrics Khaled El Eman, etal IEEE TOSE July 01.
Design Pattern Support based on principles of model driven development Zihao Zhao.
Chapter 7. Classification and Prediction
Course Notes Set 12: Object-Oriented Metrics
Towards a Multi-paradigm Complexity Measure
Meredith L. Wilcox FIU, Department of Epidemiology/Biostatistics
Regression Techniques
Object-Oriented Metrics
DEFECT PREDICTION : USING MACHINE LEARNING
Design Metrics Software Engineering Fall 2003
FRM: Modeling Sponsored Search Log with Full Relational Model
Design Metrics Software Engineering Fall 2003
A UML Approximation of a Subset of the CK Metrics and Their Ability to Predict Faulty Classes CAMARGO CRUZ Ana Erika Advisor: Prof. OCHIMIZU Koichiro July.
Diagnostics and Transformation for SLR
مادة الدرس : مقدمة في علم الإحصاء
Predict Failures with Developer Networks and Social Network Analysis
The Weather Turbulence
Product moment correlation
Sihua Peng, PhD Shanghai Ocean University
Big DATA.
Diagnostics and Transformation for SLR
Presentation transcript:

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science and Technology ESEM

Contents 1.Abstract 2.Background 3.Problem Analysis 4.Case study 5.Results 6.Conclusion and Future Work 2

Abstract Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects. First attempt of solution: simple log data transformations P(y=1) x X = design-complexitymetric P(Fault prone class) 3

Background Some design-complexity metrics have shown to be good predictors of fault-prone classes in LR models Among these metrics are the Chidamber & Kemerer (CK) metrics – 80 th and 20 th percentiles of the distributions can be used to determine high and low values – Their thresholds cannot be determined before their use and should be derived and used locally 4

Problem Analysis Can a LR model built with these kind of metrics work efficiently with different software projects? LEAST FAULTYMOST FAULTY Small Size SW project Large Size SW project X = Number of Methods P (y=1)

Case Study 1.Data analysis of 7 different projects and application of simple log data transformations. 2.Construction of 3 univariate LR models using a large open source project (1 st release of the MYLYN System with 638 Java classes). – Dependent Variables: CK-CBO, CK-RFC, CK-WMC – Independent Variables: Defects (from Bugzilla & CVS) 3.Test these models with 2 other smaller projects (with 11 and13 Java classes) 6

7 Challenge (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July produced biased regression estimates and reduce the predictive power of regression models BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system **

(**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July RFC Data of BNS is more spread than the data of the MYL BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** 8

(**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July RFC Data of BNS is more spread than the data of the MYL BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** 9

Case Study Solution. Simple data transformation using “Log10” Example : 10 Number of Outliers are less Data Spread is more uniform LCBO = Log10(CBO+1)LTCBO = Log10(CBO+1) + dm; Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed

Results Effects of the Log data Transformations: Elimination of great number of outliers Overall goodness of fit of the 3 models is better Discrimination (Most Faulty/Least Faulty) – All models discriminate well between most Faulty and Least Faulty classes of the Mylyn System – What about using different projects? 11

Results GroupModelCorrect Classification (RAW DATA) Correct Classification (LOG Tx DATA) Effect MF (6 classes) CBO25  RFC55= WMC66= LF (5 classes) CBO55= RFC33= WMC44= BOTH (11 classes) CBO710  RFC88= WMC10 = BANKING SYSTEM 12 MF: Most Faulty LF: Least Faulty

Results GroupModelCorrect Classification (RAW DATA) Correct Classification (LOG Tx DATA) Effect MF (9 classes) CBO37  RFC98  WMC76  LF (4 classes) CBO44= RFC03  WMC04  BOTH (13 classes) CBO711  RFC911  WMC710  E-COMMERCE SYSTEM 13 MF: Most Faulty LF: Least Faulty

Conclusions and Future work CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model. Further data exploration and study of data transformations 14

Thank you! questions, comments … contact: 15

16

17

18