A Methodology for Finding Bad Data

A Methodology for Finding Bad Data
Jaime Miranda1, Richard Weber1, Derek Partridge2 1)Departamento de Ingeniería Industrial Universidad de Chile 2) Department of Computer Science, University of Exeter, UK

Outline Data Mining – Introduction
KDD Process: Knowledge Discovery in Databases Methodology for finding and correcting bad data Application of proposed methodology Conclusions and future work

Process of knowledge discovery in databases (KDD)
selected data Pre-processing pre-processed data Transformation transformed data Data Mining Patterns Interpretation Evaluation Selection

Preprocessing Missing value (no value)
Example Missing value (no value) Value out of range (value impossible) Age = 250 “Bad data” (could be, but strange) Age = 112 or Age = 81 and student

Identification of data problems
Missing value (no value): sure Value out of range (value impossible): sure “Bad data” (could be, but strange): unsure

Missing values, example:

Treatment of missing values
Do not use, flag field of missing data Fill in missing value (mean value, imputation algorithms)

Value out of range, example:

“Bad data”, example:

Proposed generic methodology to find and correct “bad data” 1 of 2 (“replace all”)
Develop regression model with “good data” Identify candidates for “bad data” STOP Replace all “bad data”

Proposed generic methodology to find and correct “bad data” 2 of 2 (“replace iteratively”)
Develop regression model with “good data” Identify candidates for “bad data” Yes “bad data” remaining? STOP No Replace only “worst data” of remaining set of “bad data”

Identify candidates for “bad data”
Analysis per column, independently, identify “deviation” from “norm” e.g. Deviation from mean value Expert opinion Combination of the two (Filtering for expert judgement)

Develop regression model with “good data”
Am = F(A1, … , Am-1) i.e. predict “bad” attribute value based on all the other (good) attribute values

Example for proposed methodology: Customer segmentation

Clustering C l u s t e r n = ^ 1 Clusters

Customer segmentation with clustering

Centers of 6 segments Total database: 200.000 customers,
take subset of 320 customers for experimentation

Experiment take subset of 320 customers,
change value of attribute “Income” for 20 customers (10 values below minimum (0) and 10 values above maximum (5.000)) Apply proposed methodology

Step 1: Identify candidates for “bad data”
Identify “deviation”for attribute Income (here: Deviation from mean value) Could identify 18 of 20 “strange values”

Step 2: Regression model used: neural network (MLP)
Am = F(A1, … , Am-1)

Neural networks natural å Connections with weights Neuron artificial

Neural networks (Multilayer Perceptron)
h g s N u r I p L a y H d O å A1 Am Am-1

Results (“replace all”)

Evaluation of Results

Results (“replace iteratively vs. replace all”)

Characteristics of proposed methodology
Identifies candidates for “bad data” per attribute (column) without looking at other attributes No background knowledge regarding attributes (e.g. Negative income) Each step offers opportunities for different methods (here: Deviation detection using distance to mean, Regression model by neural network)

Future work Apply to larger data sets
Try different techniques for identifying “candidates for bad data”, e.g. By looking at other attributes Implementation in Matlab

A Methodology for Finding Bad Data

Similar presentations

Presentation on theme: "A Methodology for Finding Bad Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Methodology for Finding Bad Data

Similar presentations

Presentation on theme: "A Methodology for Finding Bad Data"— Presentation transcript:

Similar presentations

About project

Feedback