How to Predict More with Less: Defect Prediction Using Machine Learners in an Implicitly Data Starved Domain Kim Kaminsky Gary D. Boetticher Department.

Slides:

Advertisements

Similar presentations

On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.

Advertisements

Ensemble Learning – Bagging, Boosting, and Stacking, and other topics

The Assessment and Application of Lineage Information in Genetic Programs for Producing Better Models Gary D. Boetticher Univ. of Houston.

Better Software Defect Prediction Using Equalized Learning With Machine Learners Kim Kaminsky Gary D. Boetticher Department of Computer Science University.

Nearest Neighbor Sampling for Better Defect Prediction Gary D. Boetticher Department of Software Engineering University of Houston - Clear Lake Houston,

Naïve-Bayes Classifiers Business Intelligence for Managers.

Rachel Harrison, Oxford Brookes University Daniel Rodríguez, Univ of Alcala José Riquelme, Univ of Seville Roberto Ruiz, Pablo de Olavide University.

Automated Software Maintainability through Machine Learning by Eric Mudge.

Institute of Intelligent Power Electronics – IPE Page1 Introduction to Basics of Genetic Algorithms Docent Xiao-Zhi Gao Department of Electrical Engineering.

The GDB Cup: Applying “Real World” Financial Data Mining in an Academic Setting Gary D. Boetticher University of Houston - Clear Lake Houston, Texas, USA.

Software Engineering II - Topic: Software Process Metrics and Project Metrics Instructor: Dr. Jerry Gao San Jose State University

Feature Selection for Regression Problems

Graph-Based Concept Learning Jesus A. Gonzalez, Lawrence B. Holder, and Diane J. Cook Department of Computer Science and Engineering University of Texas.

Software Quality Analysis with Limited Prior Knowledge of Faults Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn.

Genetic Algorithms Learning Machines for knowledge discovery.

Neural Optimization of Evolutionary Algorithm Strategy Parameters Hiral Patel.

A Comparative Analysis of Software Refinement Techniques Ion IVAN Adrian VISOIU.

Application of reliability prediction model adapted for the analysis of the ERP system Frane Urem, Krešimir Fertalj, Željko Mikulić College of Šibenik,

Texas A&M University Page 1 9/16/ :22:47 PM Wei Zhao Texas A&M University Is Computer Stuff Science, Engineering, or Something else?

Chapter 6 : Software Metrics

Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.

Using Machine Learning to Predict Project Effort: Empirical Case Studies in Data-starved Domains Gary D. Boetticher Department of Software Engineering.

Evolution Strategies Evolutionary Programming Genetic Programming Michael J. Watts

Introduction to Defect Prediction Cmpe 589 Spring 2008.

Version control – Project repository, version management capability, make facility, issue/bug tracking Change control Configuration audit – compliments.

Fuzzy Genetic Algorithm

1 Naïve Bayes Classifiers CS 171/ Definition A classifier is a system that categorizes instances Inputs to a classifier: feature/attribute values.

Evolutionary Computation Dean F. Hougen w/ contributions from Pedro Diaz-Gomez & Brent Eskridge Robotics, Evolution, Adaptation, and Learning Laboratory.

Your Poster Title Here Your name here, and names of others Place the name of your institution here Your Poster Title Here Your name here, and names of.

Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.

THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© What we currently know about software fault prediction: A systematic review of the fault prediction.

Math – What is a Function? 1. 2 input output function.

Measure18 1 Software Measurement Halstead’s Software Science.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.

Maximum Likelihood Estimation

Advanced Software Engineering Lecture 4: Process & Project Metrics.

CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.

Intro to Estimating Part Art, Part Science. Importance of Good Estimates Time (Realistic Deadlines) most software projects are late because the time was.

Transfer and Multitask Learning Steve Clanton. Multiple Tasks and Generalization “The ability of a system to recognize and apply knowledge and skills.

© Chinese University, CSE Dept. Software Engineering / Software Engineering Topic 1: Software Engineering: A Preview Your Name: ____________________.

Estimation of Distribution Algorithm and Genetic Programming Structure Complexity Lab,Seoul National University KIM KANGIL.

Cost23 1 Question of the Day u Which of the following things measure the “size” of the project in terms of the functionality that has to be provided in.

Calculation of Software Failure Probability and Test Case Selection February 14, 2007 Kim, Sung Ho.

Applying Combinatorial Testing to Data Mining Algorithms

Software Defects Cmpe 550 Fall 2005

Optimization by Quantum Computers

Selected Topics in CI I Genetic Programming Dr. Widodo Budiharto 2014.

Quantum Computing and Artificial Intelligence

Evolution Strategies Evolutionary Programming

Topics discussed in this section:

HyperNetworks Engın denız usta

Daniil Chivilikhin and Vladimir Ulyantsev

Programming Quantum Computers

Verifying – Evaluating Software Estimates

Application Level Fault Tolerance and Detection

Kim Kaminsky Gary D. Boetticher Department of Computer Science

Representation and Evolution of Lego-based Assemblies

Monte Carlo Simulation Managing uncertainty in complex environments.

Understanding the Human Estimator

Software Engineering Lecture #13.

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Scientific Process METHODS

Word Embedding Word2Vec.

Topics discussed in this section:

Topics discussed in this section:

Physics-guided machine learning for milling stability:

 Is a machine that is able to take information (input), do some work on (process), and to make new information (output) COMPUTER.

Topics discussed in this section:

Presentation transcript:

How to Predict More with Less: Defect Prediction Using Machine Learners in an Implicitly Data Starved Domain Kim Kaminsky Gary D. Boetticher Department of Computer Science University of Houston - Clear Lake Houston, Texas, USA

Preamble The maturing of Software Engineering as a discipline requires a better understanding of the complexity of the software process. Empirical-based modeling is one mechanism for improving the understanding, and thus management of the software process.

Data Starvation Issues in Software Engineering Heavily context dependent Measure A from Project X  Measure B from Project Y Unreliable data due to poor processes Organizations do not share data Projects are large project estimation data occurs infrequently

Implicitly Data Starved Domains Lots of this Number of modules Little of that Defect counts

Equalized Learning Balance Data by Replicating Sparse Instances [Mizuno99] 300 Instances of 0 Defects of 5 Defects of 9 Defects 3 Colors = 3 Diff. Instances 300 Instances of 0 Defects 20 Instances/ 5 Defects 10 Instances/9 Defects

Genetic Programming Process - 1 Fitness Value = Model performance on data. 2 (of many) Chromosomes Data + A B * - 3 D 888 out of 1000 913 out of 1000

Genetic Programming Process - 2 Mutation 2 Chromosomes Crossover + B - 3 D * A + A B + B - D 3.1 * A - 3 D

} NASA KC2 Defect Dataset Equalized produces 3013 samples 379 Unique tuples } Output: Defect Count Input: Product Metrics (Size, Complexity, Vocabulary)

Original versus Equalized Data Experiment Configuration 2000 Characters 1000 Chromosomes 50 Generations Max. 20 Trials

Original versus Equalized Data t-test Results

Conclusions Equalized learning spawns large datasets Equalized learning produces better models

Future Directions Apply to other NASA datasets Improve Performance: Distributed GP