Report on Data Cleaning Framework

Slides:

Advertisements

Similar presentations

UNIT – 1 Data Preprocessing

Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13.

Chapter IV Relational Data Model Pemrograman Sistem Basis Data.

Preparing Data for Quantitative Analysis

1 QUANTITATIVE DESIGN AND ANALYSIS MARK 2048 Instructor: Armand Gervais

Mapping Studies – Why and How Andy Burn. Resources The idea of employing evidence-based practices in software engineering was proposed in (Kitchenham.

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

1 CPSC 695 Data Quality Issues M. L. Gavrilova. 2 Decisions…

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.

McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.

Usability 2004 J T Burns1 Usability & Usability Engineering.

Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.

Edge Detection Evaluation in Boundary Detection Framework Feng Ge Computer Science and Engineering, USC.

指導教授：黃三益教授學生： M 陳聖現 M 王啟樵 M 呂佳如.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.

A Statistical Approach to Speed Up Ranking/Re-Ranking Hong-Ming Chen Advisor: Professor Shih-Fu Chang.

Experimental Evaluation of Learning Algorithms Part 1.

THE ART OF CODING OF QUESTIONNAIRES By David Onen (Ph.D) Lecturer, Department Of Higher Degrees Uganda Management Institute (UMI) A paper presented to.

SINTEF Telecom and Informatics EuroSPI’99 Workshop on Data Analysis Popular Pitfalls of Data Analysis Tore Dybå, M.Sc. Research Scientist, SINTEF.

Chapter 3: Software Project Management Metrics

THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© What we currently know about software fault prediction: A systematic review of the fault prediction.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Confidence intervals. Estimation and uncertainty Theoretical distributions require input parameters. For example, the weight of male students in NUS follows.

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Efficient Computation of Combinatorial Skyline Queries Author: Yu-Chi Chung, I-Fang Su, and Chiang Lee Source: Information Systems, 38(2013), pp

How to describe Accuracy And why does it matter Jon Proctor, PhotoTopo GIS In The Rockies: October 10, 2013.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.

Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.

Big Data Quality Panel Norman Paton University of Manchester.

Biological data representation and data mining Xin Chen

FREQUENCY DISTRIBUTION

AP CSP: Cleaning Data & Creating Summary Tables

Admission Prediction System

Presented by Khawar Shakeel

Chapter 7 Process Control.

Chapter 7 Compliance Testing.

Chapter 6 Classification and Prediction

Summary Presented by : Aishwarya Deep Shukla

Data entry and preparation for analysis (data cleaning)

Classification and Prediction

Enhanced-alignment Measure for Binary Foreground Map Evaluation

Intro to Machine Learning

Tutorial for LightSIDE

Detecting Faulty Empty Cells in Spreadsheets

Expandable Group Identification in Spreadsheets

iSRD Spam Review Detection with Imbalanced Data Distributions

[jws13] Evaluation of instance matching tools: The experience of OAEI

Classification and Prediction

CSCI N317 Computation for Scientific Applications Unit Weka

An Experimental Study of the Potential of Using Small

Intro to Machine Learning

Paper ID: XX Track: Track Name

Lecture 1: Descriptive Statistics and Exploratory

©Jiawei Han and Micheline Kamber

Actively Learning Ontology Matching via User Interaction

Assignment 1: Classification by K Nearest Neighbors (KNN) technique

COSC 4368 Intro Supervised Learning Organization

Detecting Data Errors: Where are we and what needs to be done?

Data compilation and pre-validation

Presentation transcript:

Report on Data Cleaning Framework Shahbaz Hassan Wasti

Benchmarks/metrics for Data Cleaning Techniques It is very difficult to measure the accuracy of different data cleaning techniques. Benchmarks and metrics to compare and evaluate different data cleaning techniques do not exist Research is being conducted to define standard benchmarks/metrics to evaluate data cleaning techniques Dasu et al. has proposed Statistical Distortion metric to evaluate the effect of data cleaning on the data. It gives how much data has been distorted after cleaning from the origin.

Common Practices to evaluate Data Cleaning Techniques I have studied the different experiments presented in several research papers to compare and evaluate data cleaning technique. It is observed that following common practices have been used by the researchers to compare one technique with other baselines are Number of errors cleaned in the dirty data by using the “ground truth” Ground Truth is prepared with the help of expert while manually cleaning the sample data Scalability is measured by time taken in cleaning with respect to the noise percentage in the data Precision and recall metric to estimate the possible erros Precision = (# corrected changed values/ all changes) Recall = (corrected changed values/ all erros) F-measure = 2 x (precision x recall)/ (precision + reacall) Validating sample cleaned data with the help of crowd

Experiment using Bayeswipe Techinque To evaluate the Bayesian cleaning method presented in the last meeting, I have downloaded tool from authors website Bayeswipe is developed to clean typographic, missing and substitution errors To explore the Bayeswipe I have selected two datasets Car sales data downloaded from authors website with ground truth and dirty sample University of Education Admission Dataset extracted from University of Education database

Statistics of Cars Dataset Total tuples in sample dirty dataset is 9124 All the attributes are categorical and contains string data Columns Dirty Records Distinct Values Dirty Data Ground Truth Model 42 225 198 Make 47 53 18 Type 40 34 12 Year 41 27 Condition 24 2 Wheelbase 3 5 Doors 4 Engine 71

Results of Car dataset after applying Bayeswipe The dataset was processed several times with same errors in the data It was surprising that Variations were observed in the results after every processing To compare the results, I have placed the ground truth dataset, dirty dataset and clean dataset in same spreadsheet It was observed that some records were cleaned correctly and some were ignored The algorithm also wrongly changed the correct data It was observed that most of the wrong correction made in the attributes with either alphanumeric or numeric data present in it. I have prepared the results summary of two processing

First Run Results Attribute Dirty Clean Not Cleaned Wrong Clean Distinct Values in GT Distinct Values in Dirty Data Model 42 18 24 22 198 225 Make 47 26 21 53 Type 40 16 3 12 34 Year 41 28 13 146 27 Condition 19 2 Wheelbase 15 7 5 Doors 14 4 Engine 23 71

First Run Results

Second Run Results Attribute Dirty Clean Not Cleaned Wrong Clean Distinct Values in GT Distinct Values in Dirty Data Model 42 25 17 28 198 225 Make 47 32 15 1 18 53 Type 40 26 14 8 12 34 Year 41 19 22 122 27 Condition 24 5 2 Wheelbase 21 13 3 Doors 9 4 Engine 71

Second Run Results

First & Second Run Result Variation

Statistics of University of Education Dataset (UE Dataset) Total 1000 clean records were randomly extracted from the database These clean records will be treated as ground truth The cardinality and degree of the dataset can be increased for further experiments I have chosen UE dataset because the availability of ground truth data. All the columns are of string type and categorical data Data has following columns Shift Category Campus Program Requirement Level City Admission Year Last Examination Year

UE Dataset Typo errors manually introduced in following column Columns Dirty Records Distinct Values Dirty Data Ground Truth Shift 110 9 2 Category 44 19 7 Campus 125 78 13 Program 127 57 23

First Run on UE Dataset Despite running the Bayeswipe on UE Dataset no record were cleaned by BayesWipe Columns Dirty Clean Not Cleaned Wrong Clean Distinct Values in GT Distinct Values in Dirty Data Shift 110 9 2 Category 44 19 7 Campus 125 78 13 Program 127 57 23

My Focus I will continue my experiments with Bayeswipe on some open dataset available on UCI machine learning repository. I have only checked typo errors in the above experiments Next I will prepare dataset for missing values The main problem in using open dataset is the availability of ground truth. I am working on the source and error model of the above technique to find out the reasons of inconsistency in results

Wrong Correction/Overcorrection Example Back