Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining – Intro.
Data Mining: Concepts and Techniques
Data Preprocessing.
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Chapter 4 Data Preprocessing
Spatial and Temporal Data Mining
Data Preprocessing.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
The Knowledge Discovery Process; Data Preparation & Preprocessing
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
CS685: Special Topics in Data Mining Jinze Liu September 9,
D ATA P REPROCESSING 1. C HAPTER 3: D ATA P REPROCESSING Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization.
Preprocessing of Lifelog September 18, 2008 Sung-Bae Cho 0.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —
1 Data Mining: Concepts and Techniques Data Preprocessing.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
November 24, Data Mining: Concepts and Techniques.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
2016年1月17日星期日 2016年1月17日星期日 2016年1月17日星期日 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
Motivation Data in the real world is dirty
February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Data Mining What is to be done before we get to Data Mining?
Bzupages.comData Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 3 — ©Jiawei Han and Micheline.
1 Web Mining Faculty of Information Technology Department of Software Engineering and Information Systems PART 4 – Data pre-processing Dr. Rakan Razouk.
Knowledge-based systems Sanaullah Manzoor CS&IT, Lahore Leads University
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Data Transformation: Normalization
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
Data Preprocessing CENG 514 June 17, 2018.
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 1 Data Preprocessing
©Jiawei Han and Micheline Kamber
Data Transformations targeted at minimizing experimental variance
Data Mining Data Preprocessing
Chapter 7: Transformations
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Presentation transcript:

Data Preprocessing Contents of this Chapter Introduction Data cleaning Data integration Data transformation Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 741, Fall 2009, Martin Ester 27

SFU, CMPT 741, Fall 2009, Martin Ester Introduction Motivation Data mining is based on existing databases different from typical Statistics approach Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest noisy: containing errors or outliers inconsistent: containing discrepancies or contradictions Quality of data mining results crucially depends on quality of input data Garbage in, garbage out! SFU, CMPT 741, Fall 2009, Martin Ester

Introduction Types of Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization, aggregation Data reduction Reduce number of records, attributes or attribute values SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Cleaning Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data were not considered important at the time of collection data format / contents of database changes in the course of the time changes with the corresponding enterprise organization SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Cleaning Handling Missing Data Ignore the record: usually done when class label is missing Fill in missing value manually: tedious and infeasible? Use a default to fill in the missing value: e.g., “unknown”, a new class, . . . Use the attribute mean to fill in the missing value for classification: mean for all records of the same class Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Cleaning Noisy Data Noise: random error or variance in a measured attribute Noisy attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Cleaning Handling Noisy Data Binning sort data and partition into (equi-depth) bins smooth by bin means, bin median, bin boundaries, etc. Regression smooth by fitting a regression function Clustering detect and remove outliers Combined computer and human inspection detect suspicious values automatically and check by human SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Cleaning Binning Equal-width binning Divides the range into N intervals of equal size Width of intervals: Simple Outliers may dominate result Equal-depth binning Divides the range into N intervals, each containing approximately same number of records Skewed data is also handled well SFU, CMPT 741, Fall 2009, Martin Ester

Data Cleaning Binning for Data Smoothing Example: Sorted price values 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into three (equi-depth) bins - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Cleaning Regression Replace noisy or missing values by predicted values Requires model of attribute dependencies (maybe wrong!) Can be used for data smoothing or for handling missing data y Y1 Y1’ y = x + 1 X1 x SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Integration Overview Purpose Combine data from multiple sources into a coherent database Schema integration Integrate metadata from different sources Attribute identification problem: “same” attributes from multiple data sources may have different names Instance integration Integrate instances from different sources For the same real world entity, attribute values from different sources maybe different Possible reasons: different representations, different styles, different scales, errors SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Integration Approach Identification Detect corresponding tables from different sources manual Detect corresponding attributes from different sources may use correlation analysis e.g., A.cust-id  B.cust-# Detect duplicate records from different sources involves approximate matching of attribute values e.g. 3.14283  3.1, Schwartz  Schwarz Treatment Merge corresponding tables Use attribute values as synonyms Remove duplicate records data warehouses are already integrated SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Transformation Overview Normalization To make different records comparable Discretization To allow the application of data mining methods for discrete attribute values Attribute/feature construction New attributes constructed from the given ones (derived attributes) pattern may only exist for derived attributes e.g., change of profit for consecutive years Mapping into vector space To allow the application of standard data mining methods SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Transformation Normalization min-max normalization z-score normalization normalization by decimal scaling SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Transformation Discretization Three types of attributes Nominal (categorical) — values from an unordered set Ordinal — values from an ordered set Continuous (numerical) — real numbers Motivation for discretization Some data mining algorithms only accept categorical attributes May improve understandability of patterns SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Transformation Discretization Task Reduce the number of values for a given continuous attribute by partitioning the range of the attribute into intervals Interval labels replace actual attribute values Methods Binning Cluster analysis Entropy-based discretization SFU, CMPT 741, Fall 2009, Martin Ester

Data Transformation Entropy-Based Discretization For classification tasks Given a training data set S If S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is Binary discretization: the boundary that minimizes the entropy function over all possible boundaries Recursive partitioning of the obtained partitions until some stopping criterion is met, e.g., SFU, CMPT 741, Fall 2009, Martin Ester

Data Transformation Mapping into Vector Space Text documents not represented as records / tuples Choose attributes (relevant terms / dimensions of vector space) Calculate attribute values (frequencies) Map object to vector in this space data Clustering is one of the generic data mining tasks. One of the most important algorithms . . . algorithm mining SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Reduction Motivation Improved Efficiency Runtime of data mining algorithms is (super-)linear w.r.t. number of records and number of attributes Improved Quality Removal of noisy attributes improves the quality of the discovered patterns  Reduce number of records and / or number of attributes reduced representation should produce similar results SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Reduction Feature Selection Goal Select relevant subset of all attributes (features) For classification: Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the class distribution given the values of all features Problem 2d possible subsets of set of d features Need heuristic feature selection methods SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Reduction Feature Selection Feature selection methods Feature independence assumption: choose features independently by their significance Greedy bottom-up feature selection: The best single-feature is picked first Then next best feature condition to the first, ... Greedy top-down feature elimination: Repeatedly eliminate the worst feature Branch and bound Returns optimal set of features Requires monotone structure of the feature space SFU, CMPT 741, Fall 2009, Martin Ester

Data Reduction Principal Component Analysis (PCA) Task Properties Given N data vectors from d-dimensional space, find orthogonal vectors that can be best used to represent data Data representation by projection onto the c resulting vectors Best fit: minimal squared error error = difference between original and transformed vectors Properties Resulting c vectors are the directions of the maximum variance of original data These vectors are linear combinations of the original attributes maybe hard to interpret! Works for numeric data only SFU, CMPT 741, Fall 2009, Martin Ester

Data Reduction Example: Principal Component Analysis X2 Y1 Y2 X1 SFU, CMPT 741, Fall 2009, Martin Ester

Data Reduction Principal Component Analysis representing the training data a vector of projection weights (defines resulting vectors) to be minimized d x d covariance matrix of the training data First principal component: eigenvector of the largest eigenvalue of V Second principal component: eigenvector of the second largest eigenvalue of V . . . Choose the first k principal components or enough principal components so that the resulting error is bounded by some threshold SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Reduction Sampling Task Choose a representative subset of the data records Problem Random sampling may overlook small (but important) groups Advanced sampling methods Stratified sampling Draw random samples independently from each given stratum (e.g. age group) Cluster sampling Draw random samples independently from each given cluster (e.g. customer segment) SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Reduction Sampling: Examples SRSWOR (simple random sample without replacement) SRSWR Raw Data SFU, CMPT 741, Fall 2009, Martin Ester

SFU, CMPT 741, Fall 2009, Martin Ester Data Reduction Sampling: Examples Original Data Cluster/Stratified Sample SFU, CMPT 741, Fall 2009, Martin Ester