A Methodology for Finding Bad Data

Slides:



Advertisements
Similar presentations
Computer Science Department FMIPA IPB 2003 Neural Computing Yeni Herdiyeni Computer Science Dept. FMIPA IPB.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining: A Closer Look Chapter Data Mining Strategies.
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Introduction to Neural Networks Simon Durrant Quantitative Methods December 15th.
Introduction. 1.Data Mining and Knowledge Discovery 2.Data Mining Methods 3.Supervised Learning 4.Unsupervised Learning 5.Other Learning Paradigms 6.Introduction.
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Data Mining Techniques
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Business Intelligence, Data Mining and Data Analytics/Predictive Analytics By: Asela Thomason IS 495 Summer 2015.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type.
Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.
Data Mining Techniques in Stock Market Prediction
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
From Machine Learning to Deep Learning. Topics that I will Cover (subject to some minor adjustment) Week 2: Introduction to Deep Learning Week 3: Logistic.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Artificial Neural Network Building Using WEKA Software
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Data Mining and Decision Support
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Student Gesture Recognition System in Classroom 2.0 Chiung-Yao Fang, Min-Han Kuo, Greg-C Lee, and Sei-Wang Chen Department of Computer Science and Information.
Prepared by Fayes Salma.  Introduction: Financial Tasks  Data Mining process  Methods in Financial Data mining o Neural Network o Decision Tree  Trading.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Oracle Advanced Analytics
Data Mining – Intro.
Presented by Khawar Shakeel
An Image Database Retrieval Scheme Based Upon Multivariate Analysis and Data Mining Presented by C.C. Chang Dept. of Computer Science and Information.
DATA MINING © Prentice Hall.
School of Computer Science & Engineering
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining Jim King.
Introduction to Data Mining
MIS 451 Building Business Intelligence Systems
Topic 3: Cluster Analysis
Sangeeta Devadiga CS 157B, Spring 2007
A Unifying View on Instance Selection
Biological and Artificial Neuron
Prepared by: Mahmoud Rafeek Al-Farra
Biological and Artificial Neuron
Machine Learning Techniques for the Evaluating of External Skeletal Fixation Structure Dr.Khaled Rasheed Dr. Walter D. Potter Dr. Dennis N. Aron Ning Suo.
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Biological and Artificial Neuron
Classification & Prediction
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Classification and Prediction
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Lecture 10 – Introduction to Weka
Topic 5: Cluster Analysis
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
Presentation transcript:

A Methodology for Finding Bad Data Jaime Miranda1, Richard Weber1, Derek Partridge2 1)Departamento de Ingeniería Industrial Universidad de Chile 2) Department of Computer Science, University of Exeter, UK

Outline Data Mining – Introduction KDD Process: Knowledge Discovery in Databases Methodology for finding and correcting bad data Application of proposed methodology Conclusions and future work

Process of knowledge discovery in databases (KDD) selected data Pre-processing pre-processed data Transformation transformed data Data Mining Patterns Interpretation Evaluation Selection

Preprocessing Missing value (no value) Example Missing value (no value) Value out of range (value impossible) Age = 250 “Bad data” (could be, but strange) Age = 112 or Age = 81 and student

Identification of data problems Missing value (no value): sure Value out of range (value impossible): sure “Bad data” (could be, but strange): unsure

Missing values, example:

Treatment of missing values Do not use, flag field of missing data Fill in missing value (mean value, imputation algorithms)

Value out of range, example:

“Bad data”, example:

Proposed generic methodology to find and correct “bad data” 1 of 2 (“replace all”) Develop regression model with “good data” Identify candidates for “bad data” STOP Replace all “bad data”

Proposed generic methodology to find and correct “bad data” 2 of 2 (“replace iteratively”) Develop regression model with “good data” Identify candidates for “bad data” Yes “bad data” remaining? STOP No Replace only “worst data” of remaining set of “bad data”

Identify candidates for “bad data” Analysis per column, independently, identify “deviation” from “norm” e.g. Deviation from mean value Expert opinion Combination of the two (Filtering for expert judgement) ... ... ...

Develop regression model with “good data” Am = F(A1, … , Am-1) i.e. predict “bad” attribute value based on all the other (good) attribute values

Example for proposed methodology: Customer segmentation

Clustering C l u s t e r n = ^ 1 Clusters

Customer segmentation with clustering

Centers of 6 segments Total database: 200.000 customers, take subset of 320 customers for experimentation

Experiment take subset of 320 customers, change value of attribute “Income” for 20 customers (10 values below minimum (0) and 10 values above maximum (5.000)) Apply proposed methodology

Step 1: Identify candidates for “bad data” Identify “deviation”for attribute Income (here: Deviation from mean value) Could identify 18 of 20 “strange values”

Step 2: Regression model used: neural network (MLP) Am = F(A1, … , Am-1)

Neural networks natural å Connections with weights Neuron artificial

Neural networks (Multilayer Perceptron) h g s N u r I p L a y H d O å A1 Am Am-1

Results (“replace all”)

Evaluation of Results

Results (“replace iteratively vs. replace all”)

Characteristics of proposed methodology Identifies candidates for “bad data” per attribute (column) without looking at other attributes No background knowledge regarding attributes (e.g. Negative income) Each step offers opportunities for different methods (here: Deviation detection using distance to mean, Regression model by neural network)

Future work Apply to larger data sets Try different techniques for identifying “candidates for bad data”, e.g. By looking at other attributes Implementation in Matlab