Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot
2 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results
3 Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data cleaning is the number one problem in data warehousing”—DCI survey Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration
4 Data Cleaning – How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the task is classification) Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as regression, Bayesian formula or decision tree
5 Data Cleaning – How to Handle Missing Data? Use the most common value of an attribute to fill in the missing value Use the most common value of an attribute for all samples belonging to the same class to fill in the missing value: smarter
6 Data Cleaning – How to Handle Noisy Data? Binning method smooth a sorted value by consulting its neighborhood (local smoothing) first sort data and partition into bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Many methods for data smoothing are also methods for data reduction
Data Cleaning Tool Categories Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses, and spell-checking) to detect errors and make corrections in the data. These tools rely on parsing and fuzzy matching techniques when cleaning data from multiple sources. Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. they may employ statistical analysis to find correlations, or clustering to identify outliers. Data migration tools allow simple transformations to be specified, such as to replace the string “gender” by “sex”. 7
Data Cleaning Tools Potter’sWheel, for example, is a publicly available data cleaning tool (see that integrates discrepancy detection and transformation. 8
Data Integration Combines data from multiple sources into a coherent store Issues: Schema integration Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Metadata can be used to avoid errors in schema integration Redundancy The same attribute may have different names in different databases An attribute may be redundant if it can be derived from another table Redundancies may be detected by correlation analysis 9
Schema Integration… Examples of metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values. The metadata may also be used to help transform the data (e.g., where data codes for pay type in one database may be “H” and “S”, and 1 and 2 in another). 10
Redundancies Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. For numerical attributes, we can evaluate the correlation between two attributes, A and B, by computing the correlation coefficient 11
12 Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σ A and σ B are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. If r A,B > 0, A and B are positively correlated (A ’ s values increase as B ’ s). The higher, the stronger correlation. r A,B = 0: independent; r A,B < 0: negatively correlated
13 Correlation Analysis (Categorical Data) Χ 2 (chi-square) test The larger the Χ 2 value, the more likely the variables are related The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count Correlation does not imply causality # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population
14 Data Integration (contd…) Redundancy (contd…) Duplication should also be detected at the tuple level Reasons Denormalized tables Inconsistencies may arise between duplicates Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units