Download presentation
Presentation is loading. Please wait.
Published byTracy Bell Modified over 9 years ago
1
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J. Han, Isabelle Guyon, Martin Bachler
2
Continuous Attribute Temperature
3
Data Mining: Concepts and Techniques3 Discretization Three types of attributes: Nominal — values from an unordered set Example: attribute “outlook” from weather data Values: “sunny”,”overcast”, and “rainy” Ordinal — values from an ordered set Example: attribute “temperature” in weather data Values: “hot” > “mild” > “cool” Continuous — real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Supervised (entropy) vs. Unsupervised (binning)
4
Data Mining: Concepts and Techniques4 Simple Discretization Methods: Binning Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. The most straightforward But outliers may dominate presentation: Skewed data is not handled well. Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples
5
Data Mining: Concepts and Techniques5 Histograms A popular data reduction technique Divide data into buckets and store average (sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems.
6
Data Mining: Concepts and Techniques6 Supervised Method: Entropy-Based Discretization Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is The boundary T that minimizes the entropy function over all possible boundaries is selected as a binary discretization. Greedy Method: the process is recursively applied when T goes from smallest to largest value of attribute A, until some stopping criterion is met, e.g., for some user-given
7
Data Mining: Concepts and Techniques7 How to Calculate ent(S)? Given two classes Yes and No, in a set S, Let p1 be the proportion of Yes Let p2 be the proportion of No, p1 + p2 = 100% Entropy is: ent(S) = -p1*log(p1) – p2*log(p2) When p1=1, p2=0, ent(S)=0, When p1=50%, p2=50%, ent(S)=maximum!
8
Data Mining: Concepts and Techniques8 Transformation: Normalization min-max normalization z-score normalization normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1
9
Data Mining: Concepts and Techniques9 Transforming Ordinal to Boolean Simple transformation allows to code ordinal attribute with n values using n-1 boolean attributes Example: attribute “temperature” How many binary attributes shall we introduce for nominal values such as “Red” vs. “Blue” vs. “Green”? Temperature Cold Medium Hot Temperature > coldTemperature > medium False TrueFalse True Original dataTransformed data
10
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques10 Data Sampling
11
Data Mining: Concepts and Techniques11 Sampling Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew (uneven) classes Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data
12
Data Mining: Concepts and Techniques12 Sampling SRSWOR (simple random sample without replacement) SRSWR Raw Data
13
Data Mining: Concepts and Techniques13 Sampling Example Raw Data Cluster/Stratified Sample
14
Data Mining: Concepts and Techniques14 Summary Data preparation is a big issue for data mining Data preparation includes transformation, which are: Data sampling and feature selection Discretization Missing value handling Incorrect value handling Feature Selection and Feature Extraction
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.