Download presentation
Presentation is loading. Please wait.
Published byHorace Reeves Modified over 9 years ago
1
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7 I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7
2
2 Fayyad’s KDD Methodology data Target data Processed data Transformed data Patterns Knowledge Selection Preprocessing & cleaning Transformation & feature selection Data Mining Interpretation Evaluation
3
3 Contents Data Selection Data Selection Data Preprocessing Data Preprocessing Data Transformation Data Transformation
4
4 Data Selection Goal Goal Understanding the data Understanding the data Explore the data: Explore the data: possible attributes possible attributes their values their values distribution, outliers distribution, outliers
5
5 Getting to know the data Simple visualization tools are very useful for identifying problems Simple visualization tools are very useful for identifying problems Nominal attributes: histograms (Distribution consistent with background knowledge?) Nominal attributes: histograms (Distribution consistent with background knowledge?) Numeric attributes: graphs (Any obvious outliers?) Numeric attributes: graphs (Any obvious outliers?) 2-D and 3-D visualizations show dependencies 2-D and 3-D visualizations show dependencies Domain experts need to be consulted Domain experts need to be consulted Too much data to inspect? Take a sample! Too much data to inspect? Take a sample!
6
6 Data preprocessing Problem: different data sources (e.g. sales department, customer billing department, …) Problem: different data sources (e.g. sales department, customer billing department, …) Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors Data must be assembled, integrated, cleaned up Data must be assembled, integrated, cleaned up “Data warehouse”: consistent point of access “Data warehouse”: consistent point of access External data may be required (“overlay data”) External data may be required (“overlay data”) Critical: type and level of data aggregation Critical: type and level of data aggregation
7
7 Data Preprocessing Choose data structure (table, tree or set of tables) Choose data structure (table, tree or set of tables) Choose attributes with enough information Choose attributes with enough information Decide on a first representation of the attributes (numeric or nominal) Decide on a first representation of the attributes (numeric or nominal) Decide on missing values Decide on missing values Decide on inaccurate data (cleansing) Decide on inaccurate data (cleansing)
8
8 Attribute types used in practice Most schemes accommodate just two levels of measurement: nominal and ordinal Most schemes accommodate just two levels of measurement: nominal and ordinal Nominal attributes are also called “categorical”, “enumerated”, or “discrete” Nominal attributes are also called “categorical”, “enumerated”, or “discrete” But: “enumerated” and “discrete” imply order But: “enumerated” and “discrete” imply order Special case: dichotomy (“boolean” attribute) Special case: dichotomy (“boolean” attribute) Ordinal attributes are called “numeric”, or “continuous” Ordinal attributes are called “numeric”, or “continuous” But: “continuous” implies mathematical continuity But: “continuous” implies mathematical continuity
9
9 The ARFF format % ARFF file for weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes...
10
10 Attribute types ARFF supports numeric and nominal attributes ARFF supports numeric and nominal attributes Interpretation depends on learning scheme Interpretation depends on learning scheme Numeric attributes are interpreted as Numeric attributes are interpreted as ordinal scales if less-than and greater-than are used ordinal scales if less-than and greater-than are used ratio scales if distance calculations are performed ratio scales if distance calculations are performed (normalization/standardization may be required) (normalization/standardization may be required) Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise) Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise) Integers: nominal, ordinal, or ratio scale? Integers: nominal, ordinal, or ratio scale?
11
11 Nominal vs. ordinal Attribute “age” nominal If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Attribute “age” nominal If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) If age pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) If age pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft
12
12 Missing values Frequently indicated by out-of-range entries Frequently indicated by out-of-range entries Types: unknown, unrecorded, irrelevant Types: unknown, unrecorded, irrelevant Reasons: malfunctioning equipment, changes in experimental design, collation of different datasets, measurement not possible Reasons: malfunctioning equipment, changes in experimental design, collation of different datasets, measurement not possible Missing value may have significance in itself (e.g. missing test in a medical examination) Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume that is not the case “missing” may need to be coded as additional value Most schemes assume that is not the case “missing” may need to be coded as additional value
13
13 Inaccurate values Reason: data has not been collected for mining it Reason: data has not been collected for mining it Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer) Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer) Typographical errors in nominal attributes values need to be checked for consistency Typographical errors in nominal attributes values need to be checked for consistency Typographical and measurement errors in numeric attributes outliers need to be identified Typographical and measurement errors in numeric attributes outliers need to be identified Errors may be deliberate (e.g. wrong zip codes) Errors may be deliberate (e.g. wrong zip codes) Other problems: duplicates, stale data Other problems: duplicates, stale data
14
14 Transformation Attribute selection Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performance Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performance Problem: attribute selection based on smaller and smaller amounts of data Problem: attribute selection based on smaller and smaller amounts of data IBL is also very susceptible to irrelevant attributes IBL is also very susceptible to irrelevant attributes Number of training instances required increases exponentially with number of irrelevant attributes Number of training instances required increases exponentially with number of irrelevant attributes Naïve Bayes doesn’t have this problem. Naïve Bayes doesn’t have this problem. Relevant attributes can also be harmful Relevant attributes can also be harmful
15
15 Scheme-independent selection Filter approach: assessment based on general characteristics of the data Filter approach: assessment based on general characteristics of the data One method: find subset of attributes that is enough to separate all the instances One method: find subset of attributes that is enough to separate all the instances Another method: use different learning scheme (e.g. C4.5, 1R) to select attributes Another method: use different learning scheme (e.g. C4.5, 1R) to select attributes IBL-based attribute weighting techniques can also be used (but can’t find redundant attributes) IBL-based attribute weighting techniques can also be used (but can’t find redundant attributes) CFS: uses correlation-based evaluation of subsets CFS: uses correlation-based evaluation of subsets
16
16 Attribute subsets for weather data
17
17 Searching the attribute space Number of possible attribute subsets is exponential in the number of attributes Number of possible attribute subsets is exponential in the number of attributes Common greedy approaches: forward selection and backward elimination Common greedy approaches: forward selection and backward elimination More sophisticated strategies: More sophisticated strategies: Bidirectional search Bidirectional search Best-first search: can find the optimum solution Best-first search: can find the optimum solution Beam search: approximation to best-first search Beam search: approximation to best-first search Genetic algorithms Genetic algorithms
18
18 Scheme-specific selection Wrapper approach: attribute selection implemented as wrapper around learning scheme Wrapper approach: attribute selection implemented as wrapper around learning scheme Evaluation criterion: cross-validation performance Evaluation criterion: cross-validation performance Time consuming: adds factor k 2 even for greedy approaches with k attributes Time consuming: adds factor k 2 even for greedy approaches with k attributes Linearity in k requires prior ranking of attributes Linearity in k requires prior ranking of attributes Scheme-specific attribute selection essential for learning decision tables Scheme-specific attribute selection essential for learning decision tables Can be done efficiently for DTs and Naïve Bayes Can be done efficiently for DTs and Naïve Bayes
19
19 Discretizing numeric attributes Can be used to avoid making normality assumption in Naïve Bayes and Clustering Can be used to avoid making normality assumption in Naïve Bayes and Clustering Simple discretization scheme is used in 1R Simple discretization scheme is used in 1R C4.5 performs local discretization C4.5 performs local discretization Global discretization can be advantageous because it’s based on more data Global discretization can be advantageous because it’s based on more data Learner can be applied to discretized attribute or Learner can be applied to discretized attribute or It can be applied to binary attributes coding the cut points in the discretized attribute It can be applied to binary attributes coding the cut points in the discretized attribute
20
20 Unsupervised discretization Unsupervised discretization generates intervals without looking at class labels Unsupervised discretization generates intervals without looking at class labels Only possible way when clustering Only possible way when clustering Two main strategies: Two main strategies: Equal-interval binning Equal-interval binning Equal-frequency binning (also called histogram equalization) Equal-frequency binning (also called histogram equalization) Inferior to supervised schemes in classification tasks Inferior to supervised schemes in classification tasks
21
21 Entropy-based discretization Supervised method that builds a decision tree with pre-pruning on the attribute being discretized Supervised method that builds a decision tree with pre-pruning on the attribute being discretized Entropy used as splitting criterion Entropy used as splitting criterion MDLP used as stopping criterion MDLP used as stopping criterion State-of-the-art discretization method State-of-the-art discretization method Application of MDLP: Application of MDLP: “Theory” is the splitting point (log 2 [N-1] bits) plus class distribution in each subset “Theory” is the splitting point (log 2 [N-1] bits) plus class distribution in each subset DL before/after adding splitting point is compared DL before/after adding splitting point is compared
22
22 Example: temperature attribute
23
23 Formula for MDLP N instances and N instances and k classes and entropy E in original set k classes and entropy E in original set k 1 classes and entropy E 1 in first subset k 1 classes and entropy E 1 in first subset k 2 classes and entropy E 2 in first subset k 2 classes and entropy E 2 in first subset Doesn’t result in any discretization intervals for the temperature attribute Doesn’t result in any discretization intervals for the temperature attribute
24
24 Other discretization methods Top-down procedure can be replaced by bottomup method Top-down procedure can be replaced by bottomup method MDLP can be replaced by chi-squared test MDLP can be replaced by chi-squared test Dynamic programming can be used to find optimum k-way split for given additive criterion Dynamic programming can be used to find optimum k-way split for given additive criterion Requires time quadratic in number of instances if entropy is used as criterion Requires time quadratic in number of instances if entropy is used as criterion Can be done in linear time if error rate is used as evaluation criterion Can be done in linear time if error rate is used as evaluation criterion
25
25 Transformation WEKA provides a lot of filters that can help you transforming and selecting your attributes! WEKA provides a lot of filters that can help you transforming and selecting your attributes! Use them to build a promising model for the caravan data! Use them to build a promising model for the caravan data!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.