Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Lecture-19 ETL Detail: Data Cleansing

Data Mining: Concepts and Techniques
Data Preprocessing.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Chapter 4 Data Preprocessing
Data Preprocessing.
Peter Brezany and Christian Kloner Institut für Scientific Computing
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
Data Preprocessing Pertemuan 03 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
D ATA P REPROCESSING 1. C HAPTER 3: D ATA P REPROCESSING Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization.
Preprocessing of Lifelog September 18, 2008 Sung-Bae Cho 0.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Descriptive Exploratory Data Analysis III Jagdish S. Gangolly State University of New York at Albany.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
Data Integration Data Integration Data integration: Combines data from multiple sources into a coherent store Challenges: Entity.

12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
2016年1月17日星期日 2016年1月17日星期日 2016年1月17日星期日 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
Data Mining and Decision Support
 Data integration: ◦ Combines data from multiple sources into a coherent store  Schema integration: e.g., A.cust-id  B.cust-# ◦ Integrate metadata from.
February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Data Mining What is to be done before we get to Data Mining?
Knowledge-based systems Sanaullah Manzoor CS&IT, Lahore Leads University
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Chapter 7. Classification and Prediction
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
Data Preprocessing CENG 514 June 17, 2018.
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
Data Mining Waqas Haider Bangyal.
Classification & Prediction
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Classification and Prediction
Data Preprocessing Modified from
Chapter 1 Data Preprocessing
Lecture 1: Descriptive Statistics and Exploratory
Data Mining Data Preprocessing
By Sandeep Patil, Department of Computer Engineering, I²IT
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot

2 Major Tasks in Data Preprocessing  Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration Integration of multiple databases, data cubes, or files  Data transformation Normalization and aggregation  Data reduction Obtains reduced representation in volume but produces the same or similar analytical results

3 Data Cleaning  Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data cleaning is the number one problem in data warehousing”—DCI survey  Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration

4 Data Cleaning – How to Handle Missing Data?  Ignore the tuple: usually done when class label is missing (assuming the task is classification)  Fill in the missing value manually: tedious + infeasible?  Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value  Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter  Use the most probable value to fill in the missing value: inference-based such as regression, Bayesian formula or decision tree

5 Data Cleaning – How to Handle Missing Data?  Use the most common value of an attribute to fill in the missing value  Use the most common value of an attribute for all samples belonging to the same class to fill in the missing value: smarter

6 Data Cleaning – How to Handle Noisy Data?  Binning method smooth a sorted value by consulting its neighborhood (local smoothing) first sort data and partition into bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Regression smooth by fitting the data into regression functions  Clustering detect and remove outliers  Combined computer and human inspection detect suspicious values and check by human  Many methods for data smoothing are also methods for data reduction

Data Cleaning Tool Categories  Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses, and spell-checking) to detect errors and make corrections in the data. These tools rely on parsing and fuzzy matching techniques when cleaning data from multiple sources.  Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. they may employ statistical analysis to find correlations, or clustering to identify outliers.  Data migration tools allow simple transformations to be specified, such as to replace the string “gender” by “sex”. 7

Data Cleaning Tools  Potter’sWheel, for example, is a publicly available data cleaning tool (see that integrates discrepancy detection and transformation. 8

Data Integration  Combines data from multiple sources into a coherent store  Issues: Schema integration  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#  Metadata can be used to avoid errors in schema integration Redundancy  The same attribute may have different names in different databases  An attribute may be redundant if it can be derived from another table  Redundancies may be detected by correlation analysis 9

Schema Integration…  Examples of metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values.  The metadata may also be used to help transform the data (e.g., where data codes for pay type in one database may be “H” and “S”, and 1 and 2 in another). 10

Redundancies  Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data.  For numerical attributes, we can evaluate the correlation between two attributes, A and B, by computing the correlation coefficient 11

12 Correlation Analysis (Numerical Data)  Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σ A and σ B are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product.  If r A,B > 0, A and B are positively correlated (A ’ s values increase as B ’ s). The higher, the stronger correlation.  r A,B = 0: independent; r A,B < 0: negatively correlated

13 Correlation Analysis (Categorical Data)  Χ 2 (chi-square) test  The larger the Χ 2 value, the more likely the variables are related  The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count  Correlation does not imply causality # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population

14 Data Integration (contd…) Redundancy (contd…)  Duplication should also be detected at the tuple level  Reasons Denormalized tables  Inconsistencies may arise between duplicates Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different  possible reasons: different representations, different scales, e.g., metric vs. British units