Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Similar presentations


Presentation on theme: "Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc."— Presentation transcript:

1 Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

2 CSE 572: Data Mining by H. Liu
Data preprocessing A necessary step for serious, effective, real-world data mining It’s often omitted in “academic” DM, but can’t be over-stressed in practical DM The need for pre-processing in DM Data reduction - too much data Data cleaning - noise Data integration and transformation 7/19/2019 CSE 572: Data Mining by H. Liu

3 CSE 572: Data Mining by H. Liu
Data reduction Data cube aggregation Feature selection and dimensionality reduction Sampling random sampling and others Instance selection (search based) Data compression PCA, Wavelet transformation Data discretization 7/19/2019 CSE 572: Data Mining by H. Liu

4 CSE 572: Data Mining by H. Liu
Feature selection The basic problem Finding a subset of original features that can learn the domain better or equally better What are the advantages of doing so? Curse of dimensionality From 1-d, 2-d, to 3-d: an illustration Another example – the wonders of reducing the number of features since # of instances available to learning is dependent on # of features 7/19/2019 CSE 572: Data Mining by H. Liu

5 CSE 572: Data Mining by H. Liu
The illustration of the difficulty of the problem Search space (an example with 4 features) Overfitting – are the features selected really good? How do we know? A standard procedure of feature selection Search SFS, SBS, Beam Search, Branch&Bound Optimality of a selected set of features Evaluation measures on goodness of selected features Accuracy, distance, inconsistency, 7/19/2019 CSE 572: Data Mining by H. Liu

6 CSE 572: Data Mining by H. Liu
Feature extraction The basic problem creating new features that are combinations of original features A common approach – PCA Dimensionality reduction via transformation D’ = DA, D is mean centered (NXn), A (nXm), so, D’ is (NXm) Its variants are used widely in text mining and web mining 7/19/2019 CSE 572: Data Mining by H. Liu

7 CSE 572: Data Mining by H. Liu
Discretization Motivation from Decision Tree Induction The concept of discretization Sort the values of a feature Group continuous values together Reassign values to each group The methods Equ-width Equ-frequency Entropy-based A possible problem: still too many intervals So, when to stop? 7/19/2019 CSE 572: Data Mining by H. Liu

8 CSE 572: Data Mining by H. Liu
Data cleaning Missing values ignore it fill in manually use a global value/mean/most frequent Noise smoothing (binning) outlier removal Inconsistency domain knowledge, domain constraints 7/19/2019 CSE 572: Data Mining by H. Liu

9 CSE 572: Data Mining by H. Liu
Data integration Data integration - combines data from multiple sources into a coherent data store Schema integration entity identification problem Redundancy an attribute may be derived from another table correlation analysis Data value conflicts 7/19/2019 CSE 572: Data Mining by H. Liu

10 CSE 572: Data Mining by H. Liu
Data transformation Data is transformed or consolidated into forms appropriate for mining Methods include smoothing aggregation generalization normalization (min-max) feature construction using neural networks Traditional transformation methods 7/19/2019 CSE 572: Data Mining by H. Liu

11 CSE 572: Data Mining by H. Liu
Summary Data preprocessing cannot be overstressed in real-world applications It is an important, difficult, and low-profile task There are different types of approaches for different preprocessing problems It should be considered with the mining algorithms to improve data mining effectiveness 7/19/2019 CSE 572: Data Mining by H. Liu


Download ppt "Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc."

Similar presentations


Ads by Google