Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data Gernot Liebchen Bheki Twala Martin.

Slides:



Advertisements
Similar presentations
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Advertisements

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Statistical Techniques I EXST7005 Lets go Power and Types of Errors.
Noam Segev, Israel Chernyak, Evgeny Reznikov Supervisor: Gabi Nakibly, Ph. D.
Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.
Implementing Mapping Composition Todd J. Green * University of Pennsylania with Philip A. Bernstein (Microsoft Research), Sergey Melnik (Microsoft Research),
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Steps of a sound simulation study
SIM5102 Software Evaluation
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 1 (cont ’d) Title : Client requirements (Review) Mandatory reading: I.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
1 Functional Testing Motivation Example Basic Methods Timing: 30 minutes.
Analyzing Quantitative Data Lecture 21 st. Recap Questionnaires are often used to collect descriptive and explanatory data Five main types of questionnaire.
Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
17.5 Rule Learning Given the importance of rule-based systems and the human effort that is required to elicit good rules from experts, it is natural to.
STAT 3130 Statistical Methods II Missing Data and Imputation.
Software Quality Assurance Lecture #8 By: Faraz Ahmed.
Software Testing Lifecycle Practice
Managing the development and purchase of information systems (Part 1)
CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.
Dealing with Noisy Data. 1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative.
Weka Project assignment 3
© 2013 PRICE Systems, LLC All Rights Reserved | Decades of Cost Management Excellence 1 Lessons Learned from the ISBSG Data Base IT Confidence 2013 Arlene.
This chapter is extracted from Sommerville’s slides. Text book chapter
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Formal Semantics Chapter Twenty-ThreeModern Programming Languages, 2nd ed.1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Testing software Team Software Development Project.
Building Simulation Model In this lecture, we are interested in whether a simulation model is accurate representation of the real system. We are interested.
ECE450 - Software Engineering II1 ECE450 – Software Engineering II Today: Introduction to Software Architecture.
Topic 4 - Database Design Unit 1 – Database Analysis and Design Advanced Higher Information Systems St Kentigern’s Academy.
Classification Ensemble Methods 1
Statistical Techniques
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
PROGRAMMING TESTING B MODULE 2: SOFTWARE SYSTEMS 22 NOVEMBER 2013.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Some type of major TSE effort TSE for an “important” statistic Form a group to design a TSE evaluation for existing survey “exemplary TSE estimation plan”
GCSE ICT 3 rd Edition The system life cycle 18 The system life cycle is a series of stages that are worked through during the development of a new information.
Defensive Programming. Good programming practices that protect you from your own programming mistakes, as well as those of others – Assertions – Parameter.
What is this? SE-2030 Dr. Mark L. Hornick 1. Same images with different levels of detail SE-2030 Dr. Mark L. Hornick 2.
Few Tips To Buy Homeowners Insurance Leads. The lead generation market is constantly developing with the increasingly developing new lead generation models.
Copyright 2013, 2009, 2005, 2002 Pearson, Education, Inc.
A Generic Approach to Big Data Alarms Prioritization
DATA TYPES.
Inferential Statistics
Using Human Errors to Inspect SRS
Solving Systems of Linear Equations in Three Variables
Chapter 1: Overview of Control
Data Types Variables are used in programs to store items of data e.g a name, a high score, an exam mark. The data stored in a variable is entered from.
Lessons from The File Copy Assignment
Data Analysis and Standard Setting
Furnace Repair to Help Make It Work Perfectly
Little work is accurate
Classification and Prediction
Data Quality By Suparna Kansakar.
Implementing Mapping Composition
Multivalued Dimensions and Bridges
KEY PROCESS AREAS (KPAs)
Lies, Damned Lies & Statistical Analysis for Language Testing
CSCI N317 Computation for Scientific Applications Unit Weka
The Computer-Assisted Personal
Prodcom ESTP course October 2010
Tools for Implementation
Tools for Implementation
Software Testing Lifecycle Practice
Nothing Is Perfect: Error Detection and Data Cleaning
A Moodle-based Peer Assessment Tool
A handbook on validation methodology. Metrics.
Defensive Programming
After the Count: Data Entry and Cleaning
Presentation transcript:

Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data Gernot Liebchen Bheki Twala Martin Shepperd Michelle Cartwright Mark Stephens

What Is It All About? Data Quality What is noise? Dataset (very brief!) The Experiment Future Work

Data Quality Data quality is an issues for people working with the data. If ignored it can result in false assumptions about the data. Garbage In = Garbage Out

What Is Noise? Well, what is quality data? –Data without problematic data Problematic data? –Data can be inaccurate (so its contaminated) –It can be atypical and stick out of the rest of the data (Outliers). So, it can be caused by noise but doesn’t have to  we just might not have understood all the mechanisms which produced the data.

What Is Noise? II We focussed on inaccurate data. Outliers can pose a problem to the analyst, but since they are ‘real’ instances they can be of value. Now, inaccurate data can be plausible or implausible. Since it is difficult to identify ‘unreal’ instances we deduce how much noise is left in a dataset by counting the implausible instances.

The Data Set Given a large dataset provided by a EDS (maybe a little about EDS?) The original dataset contains more than cases with 22 attributes Contains information about software projects carried out since the beginning of the 1990s Some attributes are more administrative (e.g. Project Name, Project ID), and might not have any impact on software productivity

Suspicions The data provider did also mention that the data might contain noise which was confirmed by the preliminary analysis of the data which also indicated the existence of outliers.

How Could It Occur? (in the case of the dataset) Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous Misunderstood standards The input tool might not provide range checking (or maybe limited) Management pressure – extreme projects were noted and acted upon. Client pressure – benchmark restrictions.

What Did We Actually Do? Applied three different noise handling methods Filtering: find a chuck out Robust Filtering: Build a model (tree) and then prune the tree, which also eliminates instances from the analysis. Filter and Polish: take the instances which were chucked out by Filtering and alter them.

The Experiment We knew we were interested in effort  therefore we needed effort. We then categorised effort in order to establish if an instance had the correct effort value. –How? Build a model using 80% of the set & then test the instances in the last 20%. But that happens later. Then cleaned the data set using the 3 noise handling methods.

Pilot Study Compare the classification error. (clean and then train a tree and test it) Over different noise levels.

Main Study Compare the number of implausible productivity values

Results Pilot

Results Main Study Filtering produced a list of 283 cases from 436 Robust produced a list of 190 from 436 Both were inspected and both contain a large number of possibly true cases

Where to go from here? Simulation to investigate true noise level. And to investigate bias introduced due to noise handling.

What was it all about? Data Quality What is noise? Dataset (very brief!) The Experiment Results Future Work

Any Questions?