Download presentation
Presentation is loading. Please wait.
1
A Methodology for Finding Bad Data
Jaime Miranda1, Richard Weber1, Derek Partridge2 1)Departamento de Ingeniería Industrial Universidad de Chile 2) Department of Computer Science, University of Exeter, UK
2
Outline Data Mining – Introduction
KDD Process: Knowledge Discovery in Databases Methodology for finding and correcting bad data Application of proposed methodology Conclusions and future work
3
Process of knowledge discovery in databases (KDD)
selected data Pre-processing pre-processed data Transformation transformed data Data Mining Patterns Interpretation Evaluation Selection
4
Preprocessing Missing value (no value)
Example Missing value (no value) Value out of range (value impossible) Age = 250 “Bad data” (could be, but strange) Age = 112 or Age = 81 and student
5
Identification of data problems
Missing value (no value): sure Value out of range (value impossible): sure “Bad data” (could be, but strange): unsure
6
Missing values, example:
7
Treatment of missing values
Do not use, flag field of missing data Fill in missing value (mean value, imputation algorithms)
8
Value out of range, example:
9
“Bad data”, example:
10
Proposed generic methodology to find and correct “bad data” 1 of 2 (“replace all”)
Develop regression model with “good data” Identify candidates for “bad data” STOP Replace all “bad data”
11
Proposed generic methodology to find and correct “bad data” 2 of 2 (“replace iteratively”)
Develop regression model with “good data” Identify candidates for “bad data” Yes “bad data” remaining? STOP No Replace only “worst data” of remaining set of “bad data”
12
Identify candidates for “bad data”
Analysis per column, independently, identify “deviation” from “norm” e.g. Deviation from mean value Expert opinion Combination of the two (Filtering for expert judgement)
13
Develop regression model with “good data”
Am = F(A1, … , Am-1) i.e. predict “bad” attribute value based on all the other (good) attribute values
14
Example for proposed methodology: Customer segmentation
15
Clustering C l u s t e r n = ^ 1 Clusters
16
Customer segmentation with clustering
17
Centers of 6 segments Total database: 200.000 customers,
take subset of 320 customers for experimentation
18
Experiment take subset of 320 customers,
change value of attribute “Income” for 20 customers (10 values below minimum (0) and 10 values above maximum (5.000)) Apply proposed methodology
19
Step 1: Identify candidates for “bad data”
Identify “deviation”for attribute Income (here: Deviation from mean value) Could identify 18 of 20 “strange values”
20
Step 2: Regression model used: neural network (MLP)
Am = F(A1, … , Am-1)
21
Neural networks natural å Connections with weights Neuron artificial
22
Neural networks (Multilayer Perceptron)
h g s N u r I p L a y H d O å A1 Am Am-1
23
Results (“replace all”)
24
Evaluation of Results
25
Results (“replace iteratively vs. replace all”)
26
Characteristics of proposed methodology
Identifies candidates for “bad data” per attribute (column) without looking at other attributes No background knowledge regarding attributes (e.g. Negative income) Each step offers opportunities for different methods (here: Deviation detection using distance to mean, Regression model by neural network)
27
Future work Apply to larger data sets
Try different techniques for identifying “candidates for bad data”, e.g. By looking at other attributes Implementation in Matlab
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.