Defect Prediction Techniques He Qing 2015.9.24. What is Defect Prediction? Use historical data to predict defect. To plan for allocation of defect detection.

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Assuming normally distributed data! Naïve Bayes Classifier.

Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.

Ensemble Learning: An Introduction

Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.

Software Quality Analysis with Limited Prior Knowledge of Faults Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn.

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.

A Comparative Analysis of the Efficiency of Change Metrics and Static Code Attributes for Defect Prediction Raimund Moser, Witold Pedrycz, Giancarlo Succi.

1 NASA OSMA SAS02 Software Reliability Modeling: Traditional and Non-Parametric Dolores R. Wallace Victor Laing SRS Information Services Software Assurance.

Module 04: Algorithms Topic 07: Instance-Based Learning

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

CS4723 Software Validation and Quality Assurance

Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.

SWEN 5430 Software Metrics Slide 1 Quality Management u Managing the quality of the software process and products using Software Metrics.

Benk Erika Kelemen Zsolt

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Gary M. Weiss Alexander Battistin Fordham University.

Chapter 3: Software Project Management Metrics

1 Report on results of Discriminant Analysis experiment. 27 June 2002 Norman F. Schneidewind, PhD Naval Postgraduate School 2822 Racoon Trail Pebble Beach,

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Classification Ensemble Methods 1

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Unsupervised Classification

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Experience Report: System Log Analysis for Anomaly Detection

Software Defects Cmpe 550 Fall 2005

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Machine Learning with Spark MLlib

Regression Testing with its types

Semi-Supervised Clustering

Chapter 7. Classification and Prediction

Presented by Khawar Shakeel

Advanced data mining with TagHelper and Weka

It’s All About Me From Big Data Models to Personalized Experience

Trees, bagging, boosting, and stacking

SSL Chapter 4 Risk of Semi-supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers.

DEFECT PREDICTION : USING MACHINE LEARNING

COMP61011 : Machine Learning Ensemble Models

Software Development & Project Management

Intro to Machine Learning

Combining Base Learners

K Nearest Neighbor Classification

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Data Mining, 2nd Edition

CSCI N317 Computation for Scientific Applications Unit Weka

Machine Learning Interpretability

Statistical Learning Dong Liu Dept. EEIS, USTC.

Estimating the number of components with defects post-release that showed no defects in testing C. Stringfellow A. Andrews C. Wohlin H. Peterson Jeremy.

Ensemble learning.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Intro to Machine Learning

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Junheng, Shengming, Yunsheng 11/09/2018

Machine Learning with Clinical Data

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

Defect Prediction Techniques He Qing

What is Defect Prediction? Use historical data to predict defect. To plan for allocation of defect detection resources (inspection and testing). Defect prediction techniques vary in the types of data they require Some require little data, others require more. Some use product characteristics, others consider human factors. Techniques have strengths and weaknesses depending on the quality of the inputs used for prediction.

Why Analyze and Predict Defects? Project Management Assess project progress Plan defect detection activities Product Assessment Decide product quality Process Management Assess process performance Improve capability

Outline Review of Defect Prediction Change Burst as Prediction Metric Defect Prediction with Semi-supervised Learning Experiment Design and Future Work

Outline Review of Defect Prediction Change Burst as Prediction Metric Defect Prediction with Semi-supervised Learning Experiment Design and Future Work

Technique Review A variety of features and models Reviewing module or component defect histories Various parametric models (e.g., Naïve Bayes, Logistic Regression, Decision Tree) Product Characteristics Size Complexity Cohesion Coupling Product Characteristics Size Complexity Cohesion Coupling Product History Code churn information Number of modifications Amount of V&V Product History Code churn information Number of modifications Amount of V&V Fault Proness

Recommendations The data available will help determine the technique. Availability of historical data will drive model selection Changes in the product, personnel, platform, or structure of organization all can be measured and accounted for in predictions.

Outline Review of Defect Prediction Change Burst as Prediction Metric Defect Prediction with Semi-supervised Learning Experiment Design and Future Work

Definition - A sequence of consecutive changes. Terms – Build day – on which that some change is committed. Component –.dll,.exe, etc. for windows, file for eclipse. For each component, we now determine its change bursts as sequences of consecutive changes. These change bursts are determined by two parameters – gap size, and burst size. Detecting Change Bursts

Gap size – minimum distance between two changes. Burst size – minimum number of changes in a burst.

Change Burst Metrics For each component C, we determine four types of metrics: change metrics temporal metrics people metrics churn metrics

1. Change Metrics NumberOfChanges  Assume the more a component is changed, the more likely it is to have defects. NumberOfConsecutiveChanges  |bursts( C ) | with burst size = 0. Take all consecutive changes into account, not just exceeding a given size. NumberOfChangeBursts  For a given gap size and a given burst size. TotalBurstSize  The number of changes in all change bursts.

2.Temporal Metrics TimeFirstBurst  When the first burst occurred, normalized to the total number of builds. Assume that early change bursts may have the longest impact during the project. TimeLastBurst  The last activities before release determine the final shape of the component, and thus may be particularly predictive. TimeMaxBurst  When the greatest burst occurred.

3.People Metrics PeopleTotal  The number of people who ever committed a change to the component. Assume the more people, the component more defective. TotalPeopleInBurst  Across all bursts, the number of people involved. Assume the number of people involved may be particularly defective. MaxPeopleInBurst  Across all bursts, the maximum number of people involved.

4.Churn Metrics ChurnTotal  The total number of lines that were added, deleted, or modified during changes to the component. TotalChurnInBurst  The total churn in all change bursts. MaxChurnInBurst  Across all bursts, the maximum churn.

Outline Review of Defect Prediction Change Burst as Prediction Metric Defect Prediction with Semi-supervised Learning Experiment Design and Future Work

Problems Current defect prediction techniques are mainly based on a sufficient amount of labeled historical project data. And assume that new projects share the same product/process characteristics and their defects can be predicted on previous data. However, historical data is often not available for new projects and for many organizations. Even if some previous project data is available, whether it can construct effective prediction models for completely new projects remains to be verified.

Defect Prediction with Semi- supervised Learning To address these problems, a semi-supervised learning model is adopted.  Sample a small percentage of modules, examine the quality of sampled modules  Construct a classification model based on the sample  Then use it predict the un-sampled modules in the project.

Step 1 Construct a random forest with n random trees H = {h1, h2, …,hn} on labeled historical data samples L. U U L L Training Dataset Build Classifiers H

Step 2 Use H to predict unlabeled data samples U. Different random trees have different prediction results. We count the degree of disagreement for each data sample. Unlabeled Data Samples h1h1 h2h2 …hihi …hnhn file10101 file21101 file30001 file40000 …

Step 3 Select M most disagreement data samples Test their quality, and add them along with labels into L U U L L Training Dataset Add to L

Step 4 - Iteration For each h i in H, construct a complement set H -i Use H -i to label all the unlabeled data samples. Unlabeled Data Samples h1h1 h2h2 …hihi …hnhn file1011 file2111 file3001 file4000 …

Majority voting - If the number of voting for a specific class (0 or 1) exceeding a given threshold θ, then add the data sample with the label into set L i ’ Update h i by learning a random tree using L U L i ’ Step 4 - Iteration Uh1h1 h2h2 …hihi …hnhn file1011 file2111 file3001 file4000 … Li’Li’ Li’Li’ L L Select Update hihi

Outline Review of Defect Prediction Change Burst as Prediction Metric Defect Prediction with Semi-supervised Learning Experiment Design and Future Work

Experiment Randomly split dataset into 10 folds. Select 9 folds of them as training dataset, and the remaining 1 fold as testing dataset. Randomly select 20% of training dataset as labeled data samples. Build models on training dataset, and then predict defects on testing dataset. For N results of Random Forest, we adopt majority voting to decide the final label.

Future Work Set weight to each h i  During each iteration, there exists misclassification by some members of H -i.Decreasing the weight of these classifiers may improve the prediction performance effectively. Dimensional Reduction  There are totally 500+ features.How to select effective features to build model remains to be discussed. Comparative analysis of change metrics and static code attributes, supervised learning model and semi-supervised learning model. Set different percentage of labeled data samples in training dataset, such as 20%,40%,60%,80%, and then compare the prediction accuracy.

Thank you for your attention!