Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data Gernot Liebchen Bheki Twala Martin Shepperd Michelle Cartwright Mark Stephens
What Is It All About? Data Quality What is noise? Dataset (very brief!) The Experiment Future Work
Data Quality Data quality is an issues for people working with the data. If ignored it can result in false assumptions about the data. Garbage In = Garbage Out
What Is Noise? Well, what is quality data? –Data without problematic data Problematic data? –Data can be inaccurate (so its contaminated) –It can be atypical and stick out of the rest of the data (Outliers). So, it can be caused by noise but doesn’t have to we just might not have understood all the mechanisms which produced the data.
What Is Noise? II We focussed on inaccurate data. Outliers can pose a problem to the analyst, but since they are ‘real’ instances they can be of value. Now, inaccurate data can be plausible or implausible. Since it is difficult to identify ‘unreal’ instances we deduce how much noise is left in a dataset by counting the implausible instances.
The Data Set Given a large dataset provided by a EDS (maybe a little about EDS?) The original dataset contains more than cases with 22 attributes Contains information about software projects carried out since the beginning of the 1990s Some attributes are more administrative (e.g. Project Name, Project ID), and might not have any impact on software productivity
Suspicions The data provider did also mention that the data might contain noise which was confirmed by the preliminary analysis of the data which also indicated the existence of outliers.
How Could It Occur? (in the case of the dataset) Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous Misunderstood standards The input tool might not provide range checking (or maybe limited) Management pressure – extreme projects were noted and acted upon. Client pressure – benchmark restrictions.
What Did We Actually Do? Applied three different noise handling methods Filtering: find a chuck out Robust Filtering: Build a model (tree) and then prune the tree, which also eliminates instances from the analysis. Filter and Polish: take the instances which were chucked out by Filtering and alter them.
The Experiment We knew we were interested in effort therefore we needed effort. We then categorised effort in order to establish if an instance had the correct effort value. –How? Build a model using 80% of the set & then test the instances in the last 20%. But that happens later. Then cleaned the data set using the 3 noise handling methods.
Pilot Study Compare the classification error. (clean and then train a tree and test it) Over different noise levels.
Main Study Compare the number of implausible productivity values
Results Pilot
Results Main Study Filtering produced a list of 283 cases from 436 Robust produced a list of 190 from 436 Both were inspected and both contain a large number of possibly true cases
Where to go from here? Simulation to investigate true noise level. And to investigate bias introduced due to noise handling.
What was it all about? Data Quality What is noise? Dataset (very brief!) The Experiment Results Future Work
Any Questions?