Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot
2 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names Data may not be normalized Data may be huge No quality data, no quality mining results! Quality decisions must be based on quality data
3 Why Is Data Dirty? Incomplete data may come from attributes of interest may not be available e.g. customer information for sales transaction data certain data may not be considered important at the time of entry equipment malfunction data not entered due to misunderstanding inconsistent with other recorded data and thus deleted not register history or changes of the data
4 Why Is Data Dirty? (contd…) Noisy data (incorrect values) may come from faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Inconsistent data may come from Different data sources Duplicate records also need data cleaning
5 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies. Data integration Integration of multiple databases, data cubes, or files Some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies. Having a large amount of redundant data may slow down or confuse the knowledge discovery process.
6 Major Tasks in Data Preprocessing… Data transformation Distance based mining algorithm provide better results if the data to be analyzed have been normalized, that is, scaled to a specific range such as [0.0, 1.0]. It would be useful for your analysis to obtain aggregate information as to the sales per customer region—something that is not part of any pre- computed data cube in your data warehouse.
7 Major Tasks in Data Preprocessing… Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Strategies include data aggregation (e.g., building a data cube), attribute subset selection (e.g., removing irrelevant attributes through correlation analysis), dimensionality reduction (e.g., using encoding schemes such as minimum length encoding or wavelets), and numerosity reduction (e.g., “replacing” the data by alternative, smaller representations such as clusters or parametric models). Data can also be “reduced” by generalization with the use of concept hierarchies, where low-level concepts, such as city for customer location, are replaced with higher-level concepts, such as region or province or state. Data discretization Part of data reduction but with particular importance, especially for numerical data.
8 Forms of Data Preprocessing
9 Descriptive Data Summarization Motivation To better understand the data, get an overall picture, and identify typical properties
10 Measuring the Central Tendency Mean Algebraic measure Can be computed by applying an algebraic function to one or more distributive measures (sum()/count()) Weighted arithmetic mean Trimmed mean Mean is sensitive to extreme/outlier values chopping extreme values Median Better measure for skewed data Holistic measure Can only be computed on the entire data set Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data)
11 Measuring the Central Tendency (contd…) Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula For unimodal frequency curves that are moderately skewed
12 Measuring the Dispersion of Data Degree to which numerical data tend to spread is called dispersion or variance. The kth percentile of a set of data in numerical order is the value xi having the property that k percent of the data entries lie at or below xi. The median (discussed in the previous subsection) is the 50th percentile.
13 Measuring the Dispersion of Data Degree to which numerical data tend to spread Range, Quartiles, outliers and boxplots Range: Difference between largest and smallest value Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Outlier: usually, a value higher/lower than 1.5 x IQR Five number summary: min, Q1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
14 Measuring the Dispersion of Data (contd…) Variance and standard deviation Variance: (algebraic, scalable computation) Standard deviation σ is the square root of variance σ 2
15 Measuring the Dispersion of Data (contd…) The basic properties of the standard deviation, σ, as a measure of spread are σ measures spread about the mean and should be used only when the mean is chosen as the measure of center. σ =0 only when there is no spread, that is, when all observations have the same value. Otherwise s > 0. The variance and standard deviation are algebraic measures because they can be computed from distributive measures. That is, N (which is count() in SQL), ∑x i (which is the sum() of x i ), and ∑x i 2 (which is the sum() of x i 2 ) can be computed in any partition and then merged to feed into the algebraic Equation. Thus the computation of the variance and standard deviation is scalable in large databases.
16 Histogram Analysis Graph displays of basic statistical class descriptions Frequency histograms A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets. Typically, the width of each bucket is uniform. Each bucket is represented by a rectangle whose height is equal to the count or relative frequency of the values at the bucket.
17 Histogram Analysis…
18 Quantile Plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value x i
19 Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether there is a shift in going from one distribution to another
20 Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane
21 Positively and Negatively Correlated Data
22 Not Correlated Data
23 Loess Curve Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression