Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

CSE 572: Data Mining by H. Liu
Data preprocessing A necessary step for serious, effective, real-world data mining It’s often omitted in “academic” DM, but can’t be over-stressed in practical DM The need for pre-processing in DM Data reduction - too much data Data cleaning - noise Data integration and transformation 7/19/2019 CSE 572: Data Mining by H. Liu

Data reduction Data cube aggregation Feature selection and dimensionality reduction Sampling random sampling and others Instance selection (search based) Data compression PCA, Wavelet transformation Data discretization 7/19/2019 CSE 572: Data Mining by H. Liu

Feature selection The basic problem Finding a subset of original features that can learn the domain better or equally better What are the advantages of doing so? Curse of dimensionality From 1-d, 2-d, to 3-d: an illustration Another example – the wonders of reducing the number of features since # of instances available to learning is dependent on # of features 7/19/2019 CSE 572: Data Mining by H. Liu

The illustration of the difficulty of the problem Search space (an example with 4 features) Overfitting – are the features selected really good? How do we know? A standard procedure of feature selection Search SFS, SBS, Beam Search, Branch&Bound Optimality of a selected set of features Evaluation measures on goodness of selected features Accuracy, distance, inconsistency, 7/19/2019 CSE 572: Data Mining by H. Liu

Feature extraction The basic problem creating new features that are combinations of original features A common approach – PCA Dimensionality reduction via transformation D’ = DA, D is mean centered (NXn), A (nXm), so, D’ is (NXm) Its variants are used widely in text mining and web mining 7/19/2019 CSE 572: Data Mining by H. Liu

Discretization Motivation from Decision Tree Induction The concept of discretization Sort the values of a feature Group continuous values together Reassign values to each group The methods Equ-width Equ-frequency Entropy-based A possible problem: still too many intervals So, when to stop? 7/19/2019 CSE 572: Data Mining by H. Liu

Data cleaning Missing values ignore it fill in manually use a global value/mean/most frequent Noise smoothing (binning) outlier removal Inconsistency domain knowledge, domain constraints 7/19/2019 CSE 572: Data Mining by H. Liu

Data integration Data integration - combines data from multiple sources into a coherent data store Schema integration entity identification problem Redundancy an attribute may be derived from another table correlation analysis Data value conflicts 7/19/2019 CSE 572: Data Mining by H. Liu

Data transformation Data is transformed or consolidated into forms appropriate for mining Methods include smoothing aggregation generalization normalization (min-max) feature construction using neural networks Traditional transformation methods 7/19/2019 CSE 572: Data Mining by H. Liu

Summary Data preprocessing cannot be overstressed in real-world applications It is an important, difficult, and low-profile task There are different types of approaches for different preprocessing problems It should be considered with the mining algorithms to improve data mining effectiveness 7/19/2019 CSE 572: Data Mining by H. Liu

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Similar presentations

Presentation on theme: "Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Similar presentations

Presentation on theme: "Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc."— Presentation transcript:

Similar presentations

About project

Feedback