Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
CSE 591: Data Mining by H. Liu Data preprocessing A necessary step for serious, effective, real-world data mining It’s often omitted in “academic” DM, but can’t be over-stressed in practical DM The need for pre-processing in DM Data reduction - too much data Data cleaning - noise Data integration and transformation 7/14/2019 CSE 591: Data Mining by H. Liu
CSE 591: Data Mining by H. Liu Data reduction Data cube aggregation Feature selection (dimensionality reduction) Sampling random sampling and others Instance selection (search based) Data compression PCA, Wavelet transformation Data discretization 7/14/2019 CSE 591: Data Mining by H. Liu
CSE 591: Data Mining by H. Liu Feature selection The basic problem Finding a subset of original features The illustration of the difficulty of the problem A standard procedure of feature selection Search Evaluation measures on goodness of selected features 7/14/2019 CSE 591: Data Mining by H. Liu
CSE 591: Data Mining by H. Liu Feature extraction The basic problem creating new features that are combinations of original features A common approach – PCA Its variants are used widely in text mining and web mining 7/14/2019 CSE 591: Data Mining by H. Liu
CSE 591: Data Mining by H. Liu Discretization The concept The methods Equ-width Equ-frequency Entropy-based 7/14/2019 CSE 591: Data Mining by H. Liu
CSE 591: Data Mining by H. Liu Data cleaning Missing values ignore it fill in manually use a global value/mean/most frequent Noise smoothing (binning) outlier removal Inconsistency domain knowledge, domain constraints 7/14/2019 CSE 591: Data Mining by H. Liu
CSE 591: Data Mining by H. Liu Data integration Data integration - combines data from multiple sources into a coherent data store Schema integration entity identification problem Redundancy an attribute may be derived from another table correlation analysis Data value conflicts 7/14/2019 CSE 591: Data Mining by H. Liu
CSE 591: Data Mining by H. Liu Data transformation Data is transformed or consolidated into forms appropriate for mining Methods include smoothing aggregation generalization normalization (min-max) feature construction using neural networks 7/14/2019 CSE 591: Data Mining by H. Liu