Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Similar presentations


Presentation on theme: "Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc."— Presentation transcript:

1 Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

2 CSE 591: Data Mining by H. Liu
Data preprocessing A necessary step for serious, effective, real-world data mining It’s often omitted in “academic” DM, but can’t be over-stressed in practical DM The need for pre-processing in DM Data reduction - too much data Data cleaning - noise Data integration and transformation 7/14/2019 CSE 591: Data Mining by H. Liu

3 CSE 591: Data Mining by H. Liu
Data reduction Data cube aggregation Feature selection (dimensionality reduction) Sampling random sampling and others Instance selection (search based) Data compression PCA, Wavelet transformation Data discretization 7/14/2019 CSE 591: Data Mining by H. Liu

4 CSE 591: Data Mining by H. Liu
Feature selection The basic problem Finding a subset of original features The illustration of the difficulty of the problem A standard procedure of feature selection Search Evaluation measures on goodness of selected features 7/14/2019 CSE 591: Data Mining by H. Liu

5 CSE 591: Data Mining by H. Liu
Feature extraction The basic problem creating new features that are combinations of original features A common approach – PCA Its variants are used widely in text mining and web mining 7/14/2019 CSE 591: Data Mining by H. Liu

6 CSE 591: Data Mining by H. Liu
Discretization The concept The methods Equ-width Equ-frequency Entropy-based 7/14/2019 CSE 591: Data Mining by H. Liu

7 CSE 591: Data Mining by H. Liu
Data cleaning Missing values ignore it fill in manually use a global value/mean/most frequent Noise smoothing (binning) outlier removal Inconsistency domain knowledge, domain constraints 7/14/2019 CSE 591: Data Mining by H. Liu

8 CSE 591: Data Mining by H. Liu
Data integration Data integration - combines data from multiple sources into a coherent data store Schema integration entity identification problem Redundancy an attribute may be derived from another table correlation analysis Data value conflicts 7/14/2019 CSE 591: Data Mining by H. Liu

9 CSE 591: Data Mining by H. Liu
Data transformation Data is transformed or consolidated into forms appropriate for mining Methods include smoothing aggregation generalization normalization (min-max) feature construction using neural networks 7/14/2019 CSE 591: Data Mining by H. Liu


Download ppt "Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc."

Similar presentations


Ads by Google