Data Mining Data quality

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
CLUSTERING PROXIMITY MEASURES
Indian Statistical Institute Kolkata
Distance and Similarity Measures
SPSS Session 1: Levels of Measurement and Frequency Distributions
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.

1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.
Measures of Central Tendency. Central Tendency “Values that describe the middle, or central, characteristics of a set of data” Terms used to describe.
BHS Methods in Behavioral Sciences I April 18, 2003 Chapter 4 (Ray) – Descriptive Statistics.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Distance Measures Tan et al. From Chapter 2.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Created by Tom Wegleitner, Centreville, Virginia Section 3-1.
Cluster Analysis (1).
Data observation and Descriptive Statistics
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Distance and Similarity Measures
Module 04: Algorithms Topic 07: Instance-Based Learning
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Chapter 3 Statistical Concepts.
Statistics. Question Tell whether the following statement is true or false: Nominal measurement is the ranking of objects based on their relative standing.
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Statistical Tools in Evaluation Part I. Statistical Tools in Evaluation What are statistics? –Organization and analysis of numerical data –Methods used.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Tuesday August 27, 2013 Distributions: Measures of Central Tendency & Variability.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Central Tendency Introduction to Statistics Chapter 3 Sep 1, 2009 Class #3.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
University of Sunderland CSEM03 R.E.P.L.I. Unit 1 CSEM03 REPLI Research and the use of statistical tools.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Introduction to Quantitative Research Analysis and SPSS SW242 – Session 6 Slides.
Chapter 2: Getting to Know Your Data
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
1 Chapter 4 Numerical Methods for Describing Data.
Central Tendency A statistical measure that serves as a descriptive statistic Determines a single value –summarize or condense a large set of data –accurately.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
LIS 570 Summarising and presenting data - Univariate analysis.
Anthony J Greene1 Central Tendency 1.Mean Population Vs. Sample Mean 2.Median 3.Mode 1.Describing a Distribution in Terms of Central Tendency 2.Differences.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Description Chapter 3. The Focus of Chapter 3  Chapter 2 showed you how to organize and present data.  Chapter 3 will show you how to summarize.
Dr.Rehab F.M. Gwada. Measures of Central Tendency the average or a typical, middle observed value of a variable in a data set. There are three commonly.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Chapter 4: Measures of Central Tendency. Measures of central tendency are important descriptive measures that summarize a distribution of different categories.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Outline Sampling Measurement Descriptive Statistics:
Chapter 2: Getting to Know Your Data
Lecture 2-2 Data Exploration: Understanding Data
Data Mining: Concepts and Techniques
Data Mining – Algorithms: Instance-Based Learning
Clustering Algorithms
Chapter 12 Using Descriptive Analysis, Performing
Similarity and Dissimilarity
What Is Good Clustering?
Nearest Neighbors CSC 576: Data Mining.
Chapter 7: Transformations
Data Pre-processing Lecture Notes for Chapter 2
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Data Mining Data quality Missing values imputation using Mean, Median and k-Nearest Neighbor approach Distance Measure

Data Quality Data quality is a major concern in Data Mining and Knowledge Discovery tasks. Why: At most all Data Mining algorithms induce knowledge strictly from data. The quality of knowledge extracted highly depends on the quality of data. There are two main problems in data quality:- Missing data: The data not present. Noisy data: The data present but not correct. Missing/Noisy data sources:- Hardware failure. Data transmission error. Data entry problem. Refusal of responds to answer certain questions.

Effect of Noisy Data on Results Accuracy Discover only those rules which contain support (frequency) greater >= 2 Data Mining If ‘age <= 30’ and income = ‘high’ then buys_computer = ‘yes’ If ‘age > 40’ and income = ‘medium’ then buys_computer = ‘no’ Training data Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%” Testing data or actual data

Imputation of Missing Data (Basic) Imputation is a term that denotes a procedure that replaces the missing values in a dataset by some plausible values i.e. by considering relationship among correlated values among the attributes of the dataset. If we consider only {attribute#2}, then value “cool” appears in 4 records. Probability of Imputing value (20) = 75% Probability of Imputing value (30) = 25%

Imputation of Missing Data (Basic) For {attribute#4} the value “true” appears in 3 records Probability of Imputing value (20) = 50% Probability of Imputing value (10) = 50% For {attribute#2, attribute#3} the value {“cool”, “high”} appears in only 2 records Probability of Imputing value (20) = 100%

Measuring the Central Tendency Mean (algebraic measure): Weighted arithmetic mean: Trimmed mean: chopping extreme values Median: A holistic measure Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:

Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data

Randomness of Missing Data Missing data randomness is divided into three classes. Missing completely at random (MCAR):- It occurs when the probability of instance (case) having missing value for an attribute does not depend on either the known attribute values or missing data attribute. Missing at random (MAR):- It occurs when the probability of instance (case) having missing value for an attribute depends on the known attribute values, but not on the missing data attribute. Not missing at random (NMAR):- When the probability of an instance having a missing value for an attribute could depend on the value of that attribute.

Methods of Treating Missing Data Ignoring and discarding data:- There are two main ways to discard data with missing values. Discard all those records which have missing data also called as discard case analysis. Discarding only those attributes which have high level of missing data. Imputation using Mean/median or Mod:- One of the most frequently used method (Statistical technique). Replace (numeric continuous) type “attribute missing values” using mean/median. (Median robust against noise). Replace (discrete) type attribute missing values using MOD.

Methods of Treating Missing Data Replace missing values using prediction/classification model:- Advantage:- it considers relationship among the known attribute values and the missing values, so the imputation accuracy is very high. Disadvantage:- If there is no correlation exist for some missing attribute values and know attribute values. The imputation can’t be performed. (Alternative approach):- Use hybrid combination of Prediction/Classification model and Mean/MOD. First try to impute missing value using prediction/classification model, and then Median/MOD. We will study more about this topic in Association Rules Mining.

Methods of Treating Missing Data K-Nearest Neighbor (k-NN) approach (Best approach):- k-NN imputes the missing attribute values on the basis of nearest K neighbor. Neighbors are determined on the basis of distance measure. Once K neighbors are determined, missing value are imputed by taking mean/median or MOD of known attribute values of missing attribute. Pseudo-code/analysis after studying distance measure. Missing value record Other dataset records

Similarity and Dissimilarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity

Distance Measures Two major classes of distance measure: Remember K-Nearest Neighbor are determined on the bases of some kind of “distance” between points. Two major classes of distance measure: Euclidean : based on position of points in some k -dimensional space. Noneuclidean : not related to position or space.

Scales of Measurement Applying a distance measure largely depends on the type of input data Major scales of measurement: Nominal Data (aka Nominal Scale Variables) Typically classification data, e.g. m/f no ordering, e.g. it makes no sense to state that M > F Binary variables are a special case of Nominal scale variables. Ordinal Data (aka Ordinal Scale) ordered but differences between values are not important e.g., political parties on left to right spectrum given labels 0, 1, 2 e.g., Liker scales, rank on a scale of 1..5 your degree of satisfaction e.g., restaurant ratings

Scales of Measurement Applying a distance function largely depends on the type of input data Major scales of measurement: Numeric type Data (aka interval scaled) Ordered and equal intervals. Measured on a linear scale. Differences make sense e.g., temperature (C,F), height, weight, age, date

(Cannot quantify difference) 4. Quantify the difference Scales of Measurement Only certain operations can be performed on certain scales of measurement. 1. Equality 2. Count Nominal Scale 3. Rank (Cannot quantify difference) Ordinal Scale 4. Quantify the difference Interval Scale

Some Euclidean Distances L2 norm (also common or Euclidean distance): The most common notion of “distance.” L1 norm (also Manhattan distance) distance if you had to travel along coordinates only.

Examples L1 and L2 norms y = (9,8) L2-norm: dist(x,y) = (42+32) = 5 5

Another Euclidean Distance L∞ norm : d(x,y) = the maximum of the differences between x and y in any dimension.

Non-Euclidean Distances Jaccard measure for binary vectors Cosine measure = angle between vectors from the origin to the points in question. Edit distance = number of inserts and deletes to change one string into another.

Jaccard Measure A note about Binary variables first Symmetric binary variable If both states are equally valuable and carry the same weight, that is, there is no preference on which outcome should be coded as 0 or 1. Like “gender” having the states male and female Asymmetric binary variable: If the outcomes of the states are not equally important, such as the positive and negative outcomes of a disease test. We should code the rarest one by 1 (e.g., HIV positive), and the other by 0 (HIV negative). Given two asymmetric binary variables, the agreement of two 1s (a positive match) is then considered more important than that of two 0s (a negative match).

Jaccard Measure A contingency table for binary data Simple matching coefficient (invariant, if the binary variable is symmetric): Jaccard coefficient (noninvariant if the binary variable is asymmetric): Object i Object j

Jaccard Measure Example All attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0

Cosine Measure Think of a point as a vector from the origin (0,0,…,0) to its location. Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors. Example: p1.p2 = 2; |p1| = |p2| = 3. cos() = 2/3;  is about 48 degrees. p1  p2 p1.p2 dist(p1, p2) =  = arccos(p1.p2/|p2||p1|) |p2|

Distance for Ordinal variables The value of the ordinal variable f for the ith object is rif. Where variable f has Mf ordered states. rif Є {1…Mf} Since each ordinal variable can have a different number of states, therefore map the range of each variable onto {0-1}, so that each variable has equal weight. This can be achieved using the following formula. for each value rif in ordinal variable f , replace it by zif After calculating zif , calculate the distance using Euclidean distance formulas.

Edit Distance The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other. Equivalently, d(x,y) = |x| + |y| -2|LCS(x,y)|. LCS = longest common subsequence = longest string obtained both by deleting from x and deleting from y.

Example x = abcde ; y = bcduve. LCS(x,y) = bcde. D(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 = 3. What left? Normalize it in the range [0-1]. We will study normalization formulas in next lecture.

Back to k-Nearest Neighbor (Pseudo-code) Missing values Imputation using k-NN. Input: Dataset (D), size of K for each record (x) with at least on missing value in D. for each data object (y) in D. Take the Distance (x,y) Save the distance and y in array Similarity (S) array. Sort the array S in descending order Pick the top K data objects from S Impute the missing attribute value (s) of x on the basic of known values of S (use Mean/Median or MOD).

K-Nearest Neighbor Drawbacks The major drawbacks of this approach are the Choice of selecting exact distance functions. Considering all attributes when attempting to retrieve the similar type of examples. Searching through all the dataset for finding the same type of instances. Algorithm Cost: ?

Noisy Data Noise: Random error, Data Present but not correct. Data Transmission error Data Entry problem Removing noise Data Smoothing (rounding, averaging within a window). Clustering/merging and Detecting outliers. Data Smoothing First sort the data and partition it into (equi-depth) bins. Then the values in each bin using Smooth by Bin Means, Smooth by Bin Median, Smooth by Bin Boundaries, etc.

Noisy Data (Binning Methods) Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Noisy Data (Clustering) Outliers may be detected by clustering, where similar values are organized into groups or “clusters”. Values which falls outside of the set of clusters may be considered outliers.

References G. Batista and M. Monard, “The study of K-Nearest Neighbor as a Imputation Method”, 2002 . (I will placed at the course folder) “CS345 --- Lecture Notes”, by Jeff D Ullman at Stanford. http://www-db.stanford.edu/~ullman/cs345-notes.html Vipin Kumar’s course in data mining offered at University of Minnesota official text book slides of Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers, August 2000.