Download presentation
Presentation is loading. Please wait.
1
Report on Data Cleaning Framework
Shahbaz Hassan Wasti
2
Benchmarks/metrics for Data Cleaning Techniques
It is very difficult to measure the accuracy of different data cleaning techniques. Benchmarks and metrics to compare and evaluate different data cleaning techniques do not exist Research is being conducted to define standard benchmarks/metrics to evaluate data cleaning techniques Dasu et al. has proposed Statistical Distortion metric to evaluate the effect of data cleaning on the data. It gives how much data has been distorted after cleaning from the origin.
3
Common Practices to evaluate Data Cleaning Techniques
I have studied the different experiments presented in several research papers to compare and evaluate data cleaning technique. It is observed that following common practices have been used by the researchers to compare one technique with other baselines are Number of errors cleaned in the dirty data by using the “ground truth” Ground Truth is prepared with the help of expert while manually cleaning the sample data Scalability is measured by time taken in cleaning with respect to the noise percentage in the data Precision and recall metric to estimate the possible erros Precision = (# corrected changed values/ all changes) Recall = (corrected changed values/ all erros) F-measure = 2 x (precision x recall)/ (precision + reacall) Validating sample cleaned data with the help of crowd
4
Experiment using Bayeswipe Techinque
To evaluate the Bayesian cleaning method presented in the last meeting, I have downloaded tool from authors website Bayeswipe is developed to clean typographic, missing and substitution errors To explore the Bayeswipe I have selected two datasets Car sales data downloaded from authors website with ground truth and dirty sample University of Education Admission Dataset extracted from University of Education database
5
Statistics of Cars Dataset
Total tuples in sample dirty dataset is 9124 All the attributes are categorical and contains string data Columns Dirty Records Distinct Values Dirty Data Ground Truth Model 42 225 198 Make 47 53 18 Type 40 34 12 Year 41 27 Condition 24 2 Wheelbase 3 5 Doors 4 Engine 71
6
Results of Car dataset after applying Bayeswipe
The dataset was processed several times with same errors in the data It was surprising that Variations were observed in the results after every processing To compare the results, I have placed the ground truth dataset, dirty dataset and clean dataset in same spreadsheet It was observed that some records were cleaned correctly and some were ignored The algorithm also wrongly changed the correct data It was observed that most of the wrong correction made in the attributes with either alphanumeric or numeric data present in it. I have prepared the results summary of two processing
7
First Run Results Attribute Dirty Clean Not Cleaned Wrong Clean
Distinct Values in GT Distinct Values in Dirty Data Model 42 18 24 22 198 225 Make 47 26 21 53 Type 40 16 3 12 34 Year 41 28 13 146 27 Condition 19 2 Wheelbase 15 7 5 Doors 14 4 Engine 23 71
8
First Run Results
9
Second Run Results Attribute Dirty Clean Not Cleaned Wrong Clean
Distinct Values in GT Distinct Values in Dirty Data Model 42 25 17 28 198 225 Make 47 32 15 1 18 53 Type 40 26 14 8 12 34 Year 41 19 22 122 27 Condition 24 5 2 Wheelbase 21 13 3 Doors 9 4 Engine 71
10
Second Run Results
11
First & Second Run Result Variation
12
Statistics of University of Education Dataset (UE Dataset)
Total 1000 clean records were randomly extracted from the database These clean records will be treated as ground truth The cardinality and degree of the dataset can be increased for further experiments I have chosen UE dataset because the availability of ground truth data. All the columns are of string type and categorical data Data has following columns Shift Category Campus Program Requirement Level City Admission Year Last Examination Year
13
UE Dataset Typo errors manually introduced in following column Columns
Dirty Records Distinct Values Dirty Data Ground Truth Shift 110 9 2 Category 44 19 7 Campus 125 78 13 Program 127 57 23
14
First Run on UE Dataset Despite running the Bayeswipe on UE Dataset no record were cleaned by BayesWipe Columns Dirty Clean Not Cleaned Wrong Clean Distinct Values in GT Distinct Values in Dirty Data Shift 110 9 2 Category 44 19 7 Campus 125 78 13 Program 127 57 23
15
My Focus I will continue my experiments with Bayeswipe on some open dataset available on UCI machine learning repository. I have only checked typo errors in the above experiments Next I will prepare dataset for missing values The main problem in using open dataset is the availability of ground truth. I am working on the source and error model of the above technique to find out the reasons of inconsistency in results
16
Wrong Correction/Overcorrection Example
Back
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.