1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Introduction to Data Mining with XLMiner

Lecture Notes for Chapter 2 Introduction to Data Mining
Exploratory Data Mining and Data Preparation
Slides for “Data Mining” by I. H. Witten and E. Frank
Decision Trees Chapter 18 From Data to Knowledge.
Input: Concepts, Attributes, Instances. 2 Module Outline  Terminology  What’s a concept?  Classification, association, clustering, numeric prediction.
Classification.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Chapter 6 Decision Trees
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
CSCI 347 / CS 4206: Data Mining Module 05: WEKA Topic 04: Data Preparation Tools.
An Exercise in Machine Learning
Module 04: Algorithms Topic 07: Instance-Based Learning
Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Department of Computer Science, University of Waikato, New Zealand Geoff Holmes WEKA project and team Data Mining process Data format Preprocessing Classification.
by B. Zadrozny and C. Elkan
Appendix: The WEKA Data Mining Software
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 2: Input: Concepts, Instances and Attributes Rodney Nielsen Many of these slides were.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Basic Data Mining Technique
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining By I. H. Witten, E. Frank and M. A. Hall.
W E K A Waikato Environment for Knowledge Aquisition.
An Exercise in Machine Learning
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H. Witten,
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Data Mining What is to be done before we get to Data Mining?
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
Department of Computer Science, University of Waikato, New Zealand Geoff Holmes WEKA project and team Data Mining process Data format Preprocessing Classification.
Data Mining Practical Machine Learning Tools and Techniques
Data Mining – Input: Concepts, instances, attributes
Machine Learning with Spark MLlib
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H.
Data Science Algorithms: The Basic Methods
Data preprocessing and transformation
Waikato Environment for Knowledge Analysis
Clustering.
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning in Practice Lecture 23
Data Transformations targeted at minimizing experimental variance
Chapter 7: Transformations
Group 9 – Data Mining: Data
Feature Selection Methods
Data Pre-processing Lecture Notes for Chapter 2
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7 I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7

2 Fayyad’s KDD Methodology data Target data Processed data Transformed data Patterns Knowledge Selection Preprocessing & cleaning Transformation & feature selection Data Mining Interpretation Evaluation

3 Contents Data Selection Data Selection Data Preprocessing Data Preprocessing Data Transformation Data Transformation

4 Data Selection Goal Goal Understanding the data Understanding the data Explore the data: Explore the data: possible attributes possible attributes their values their values distribution, outliers distribution, outliers

5 Getting to know the data Simple visualization tools are very useful for identifying problems Simple visualization tools are very useful for identifying problems Nominal attributes: histograms (Distribution consistent with background knowledge?) Nominal attributes: histograms (Distribution consistent with background knowledge?) Numeric attributes: graphs (Any obvious outliers?) Numeric attributes: graphs (Any obvious outliers?) 2-D and 3-D visualizations show dependencies 2-D and 3-D visualizations show dependencies Domain experts need to be consulted Domain experts need to be consulted Too much data to inspect? Take a sample! Too much data to inspect? Take a sample!

6 Data preprocessing Problem: different data sources (e.g. sales department, customer billing department, …) Problem: different data sources (e.g. sales department, customer billing department, …) Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors Data must be assembled, integrated, cleaned up Data must be assembled, integrated, cleaned up “Data warehouse”: consistent point of access “Data warehouse”: consistent point of access External data may be required (“overlay data”) External data may be required (“overlay data”) Critical: type and level of data aggregation Critical: type and level of data aggregation

7 Data Preprocessing Choose data structure (table, tree or set of tables) Choose data structure (table, tree or set of tables) Choose attributes with enough information Choose attributes with enough information Decide on a first representation of the attributes (numeric or nominal) Decide on a first representation of the attributes (numeric or nominal) Decide on missing values Decide on missing values Decide on inaccurate data (cleansing) Decide on inaccurate data (cleansing)

8 Attribute types used in practice Most schemes accommodate just two levels of measurement: nominal and ordinal Most schemes accommodate just two levels of measurement: nominal and ordinal Nominal attributes are also called “categorical”, “enumerated”, or “discrete” Nominal attributes are also called “categorical”, “enumerated”, or “discrete” But: “enumerated” and “discrete” imply order But: “enumerated” and “discrete” imply order Special case: dichotomy (“boolean” attribute) Special case: dichotomy (“boolean” attribute) Ordinal attributes are called “numeric”, or “continuous” Ordinal attributes are called “numeric”, or “continuous” But: “continuous” implies mathematical continuity But: “continuous” implies mathematical continuity

9 The ARFF format % ARFF file for weather data with some numeric features outlook {sunny, overcast, temperature humidity windy {true, play? {yes, sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes...

10 Attribute types ARFF supports numeric and nominal attributes ARFF supports numeric and nominal attributes Interpretation depends on learning scheme Interpretation depends on learning scheme Numeric attributes are interpreted as Numeric attributes are interpreted as ordinal scales if less-than and greater-than are used ordinal scales if less-than and greater-than are used ratio scales if distance calculations are performed ratio scales if distance calculations are performed (normalization/standardization may be required) (normalization/standardization may be required) Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise) Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise) Integers: nominal, ordinal, or ratio scale? Integers: nominal, ordinal, or ratio scale?

11 Nominal vs. ordinal Attribute “age” nominal If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Attribute “age” nominal If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) If age  pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) If age  pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

12 Missing values Frequently indicated by out-of-range entries Frequently indicated by out-of-range entries Types: unknown, unrecorded, irrelevant Types: unknown, unrecorded, irrelevant Reasons: malfunctioning equipment, changes in experimental design, collation of different datasets, measurement not possible Reasons: malfunctioning equipment, changes in experimental design, collation of different datasets, measurement not possible Missing value may have significance in itself (e.g. missing test in a medical examination) Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume that is not the case  “missing” may need to be coded as additional value Most schemes assume that is not the case  “missing” may need to be coded as additional value

13 Inaccurate values Reason: data has not been collected for mining it Reason: data has not been collected for mining it Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer) Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer) Typographical errors in nominal attributes  values need to be checked for consistency Typographical errors in nominal attributes  values need to be checked for consistency Typographical and measurement errors in numeric attributes  outliers need to be identified Typographical and measurement errors in numeric attributes  outliers need to be identified Errors may be deliberate (e.g. wrong zip codes) Errors may be deliberate (e.g. wrong zip codes) Other problems: duplicates, stale data Other problems: duplicates, stale data

14 Transformation Attribute selection Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performance Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performance Problem: attribute selection based on smaller and smaller amounts of data Problem: attribute selection based on smaller and smaller amounts of data IBL is also very susceptible to irrelevant attributes IBL is also very susceptible to irrelevant attributes Number of training instances required increases exponentially with number of irrelevant attributes Number of training instances required increases exponentially with number of irrelevant attributes Naïve Bayes doesn’t have this problem. Naïve Bayes doesn’t have this problem. Relevant attributes can also be harmful Relevant attributes can also be harmful

15 Scheme-independent selection Filter approach: assessment based on general characteristics of the data Filter approach: assessment based on general characteristics of the data One method: find subset of attributes that is enough to separate all the instances One method: find subset of attributes that is enough to separate all the instances Another method: use different learning scheme (e.g. C4.5, 1R) to select attributes Another method: use different learning scheme (e.g. C4.5, 1R) to select attributes IBL-based attribute weighting techniques can also be used (but can’t find redundant attributes) IBL-based attribute weighting techniques can also be used (but can’t find redundant attributes) CFS: uses correlation-based evaluation of subsets CFS: uses correlation-based evaluation of subsets

16 Attribute subsets for weather data

17 Searching the attribute space Number of possible attribute subsets is exponential in the number of attributes Number of possible attribute subsets is exponential in the number of attributes Common greedy approaches: forward selection and backward elimination Common greedy approaches: forward selection and backward elimination More sophisticated strategies: More sophisticated strategies: Bidirectional search Bidirectional search Best-first search: can find the optimum solution Best-first search: can find the optimum solution Beam search: approximation to best-first search Beam search: approximation to best-first search Genetic algorithms Genetic algorithms

18 Scheme-specific selection Wrapper approach: attribute selection implemented as wrapper around learning scheme Wrapper approach: attribute selection implemented as wrapper around learning scheme Evaluation criterion: cross-validation performance Evaluation criterion: cross-validation performance Time consuming: adds factor k 2 even for greedy approaches with k attributes Time consuming: adds factor k 2 even for greedy approaches with k attributes Linearity in k requires prior ranking of attributes Linearity in k requires prior ranking of attributes Scheme-specific attribute selection essential for learning decision tables Scheme-specific attribute selection essential for learning decision tables Can be done efficiently for DTs and Naïve Bayes Can be done efficiently for DTs and Naïve Bayes

19 Discretizing numeric attributes Can be used to avoid making normality assumption in Naïve Bayes and Clustering Can be used to avoid making normality assumption in Naïve Bayes and Clustering Simple discretization scheme is used in 1R Simple discretization scheme is used in 1R C4.5 performs local discretization C4.5 performs local discretization Global discretization can be advantageous because it’s based on more data Global discretization can be advantageous because it’s based on more data Learner can be applied to discretized attribute or Learner can be applied to discretized attribute or It can be applied to binary attributes coding the cut points in the discretized attribute It can be applied to binary attributes coding the cut points in the discretized attribute

20 Unsupervised discretization Unsupervised discretization generates intervals without looking at class labels Unsupervised discretization generates intervals without looking at class labels Only possible way when clustering Only possible way when clustering Two main strategies: Two main strategies: Equal-interval binning Equal-interval binning Equal-frequency binning (also called histogram equalization) Equal-frequency binning (also called histogram equalization) Inferior to supervised schemes in classification tasks Inferior to supervised schemes in classification tasks

21 Entropy-based discretization Supervised method that builds a decision tree with pre-pruning on the attribute being discretized Supervised method that builds a decision tree with pre-pruning on the attribute being discretized Entropy used as splitting criterion Entropy used as splitting criterion MDLP used as stopping criterion MDLP used as stopping criterion State-of-the-art discretization method State-of-the-art discretization method Application of MDLP: Application of MDLP: “Theory” is the splitting point (log 2 [N-1] bits) plus class distribution in each subset “Theory” is the splitting point (log 2 [N-1] bits) plus class distribution in each subset DL before/after adding splitting point is compared DL before/after adding splitting point is compared

22 Example: temperature attribute

23 Formula for MDLP N instances and N instances and k classes and entropy E in original set k classes and entropy E in original set k 1 classes and entropy E 1 in first subset k 1 classes and entropy E 1 in first subset k 2 classes and entropy E 2 in first subset k 2 classes and entropy E 2 in first subset Doesn’t result in any discretization intervals for the temperature attribute Doesn’t result in any discretization intervals for the temperature attribute

24 Other discretization methods Top-down procedure can be replaced by bottomup method Top-down procedure can be replaced by bottomup method MDLP can be replaced by chi-squared test MDLP can be replaced by chi-squared test Dynamic programming can be used to find optimum k-way split for given additive criterion Dynamic programming can be used to find optimum k-way split for given additive criterion Requires time quadratic in number of instances if entropy is used as criterion Requires time quadratic in number of instances if entropy is used as criterion Can be done in linear time if error rate is used as evaluation criterion Can be done in linear time if error rate is used as evaluation criterion

25 Transformation WEKA provides a lot of filters that can help you transforming and selecting your attributes! WEKA provides a lot of filters that can help you transforming and selecting your attributes! Use them to build a promising model for the caravan data! Use them to build a promising model for the caravan data!