Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.

Slides:



Advertisements
Similar presentations
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Advertisements

CHAPTER 9: Decision Trees
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Trees with Numeric Tests
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Classification Techniques: Decision Tree Learning

Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Exploratory Data Mining and Data Preparation
Slides for “Data Mining” by I. H. Witten and E. Frank
Sparse vs. Ensemble Approaches to Supervised Learning
Decision Tree Algorithm
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Feature Selection Lecture 5
Classification.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Module 04: Algorithms Topic 07: Instance-Based Learning
Issues with Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
by B. Zadrozny and C. Elkan
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Basic Data Mining Technique
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
CIS 335 CIS 335 Data Mining Classification Part I.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Transformation: Normalization
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
Data Science Algorithms: The Basic Methods
Data preprocessing and transformation
Computational Intelligence: Methods and Applications
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning in Practice Lecture 23
Machine Learning in Practice Lecture 22
Data Transformations targeted at minimizing experimental variance
Chapter 7: Transformations
A task of induction to find patterns
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation

Just apply a learner? No! Algorithms are biased No free lunch theorem: considering all possible data distributions, no algorithm is better than another Algorithms make assumptions about data Conditionally independent features (naive Bayes) All features relevant (e.g., kNN, C4.5) All features discrete (e.g., 1R) Little/no noise (many regression algorithms) Given data: Choose/adapt algorithm to data (selection/parameter tuning) Adapt data to algorithm (data engineering)

Data Engineering  Attribute selection (feature selection)  Remove features with little/no predictive information  Attribute discretization  Convert numerical attributes to nominal ones  Data transformations (feature generation)  Transform data to another representation

Irrelevant features can ‘confuse’ algorithms kNN: curse of dimensionality  # training instances required increases exponentially with # (irrelevant) attributes  Distance between neighbors increases with every new dimension  C4.5: data fragmentation problem  select attributes on less and less data after every split  Even random attributes can look good on small samples  Partially corrected by pruning

Attribute selection Other benefits  Speed: irrelevant attributes often slow down algorithms  Interpretability: e.g. avoids huge decision trees 2 types of Attribute Selection:  Feature Ranking: rank by relevancy metric, cut-off  Feature (Subset) Selection: search for optimal subset

Attribute selection 2 approaches (besides manual removal): Filter approach: Learner independent, based on data properties or simple models built by other learners Wrapper approach: Learner dependent, rerun learner with different attributes, select based on performance filterlearner wrapper wrapper learner

Filters Basic: find smallest feature set that separates data Expensive, often causes overfitting Better: use another learner as filter Many models show importance of features e.g. C4.5, 1R,... Recursive: select 1 attribute, remove, repeat Produces ranking: cut-off defined by user

Filters Using C4.5 select feature(s) tested in top-level node(s) ‘Decision stump ’ (1 node) sufficient Select feature ‘outlook’, remove, repeat

Filters Using 1R select the 1R feature, repeat Rule: If (outlook==sunny) play=no, else play=yes Select feature ‘outlook’, remove, repeat

Filters Using kNN: weight features by capability to separate classes same class: reduce weight of features with ≠ value (irrelevant) other class: increase weight of features with ≠ value (relevant) Different classes: increase weight of a 1 ∝ d 1 increase weight of a 2 ∝ d 2 d2d2 d1d1

Direct filtering: use data properties Correlation-based Feature Selection (CFS) Symmetric Uncertainty: Select attributes with high class correlation, little intercorrelation Select subset by aggregating over attributes A j for class C Ties broken in favor of smaller subsets Fast, default in WEKA Filters H(): Entropy B: class attribute A: any attribute

Wrappers Learner-dependent (selection for specific learner) Wrapper around learner Select features, evaluate learner (e.g., cross-validation) Expensive Greedy search: O(k 2 ) for k attributes When using a prior ranking (only find cut-off): O(k)

Search attribute subset space E.g. weather data: Wrappers: search

Greedy search Forward selection (add one, select best) Backward elimination (remove one, select best) Wrappers: search

Other search techniques (besides greedy search): Bidirectional search Best-first search: keep sorted list of subsets, backtrack until optimum solution found Beam search: Best-first search keeping only k best nodes Genetic algorithms: ‘ evolve ’ good subset by random perturbations in list of candidate subsets Still expensive... Wrappers: search

Preprocessing with WEKA Attribute subset selection: ClassifierSubsetEval: Use another learner as filter CfsSubsetEval: Correlation-based Feature Selection WrapperSubsetEval: Choose learner to be wrapped (with search) Attribute ranking approaches (with ranker): GainRatioAttributeEval, InfoGainAttributeEval C4.5-based: rank attributes by gain ratio/information gain ReliefFAttributeEval: kNN-based: attribute weighting OneRAttributeEval, SVMAttributeEval Use 1R or SVM as filter for attributes, with recursive feat. elim.

The ‘Select attributes’ tab Select attribute selection approach Select search strategySelect class attribute Selected attributes or ranked list

The ‘Preprocess’ tab Use attribute selection feedback to remove unnecessary attributes (manually) OR: select ‘ AttributeSelection ’ as ‘ filter ’ and apply it (will remove irrelevant attributes and rank the rest)

Data Engineering  Attribute selection (feature selection)  Remove features with little/no predictive information  Attribute discretization  Convert numerical attributes to nominal ones  Data transformations (feature generation)  Transform data to another representation

Attribute discretization Some learners cannot handle numeric data ‘Discretize ’ values in small intervals Always looses information: try to preserve as much as possible Some learners can handle numeric values, but are: Naive (Naive Bayes assumes normal distribution) Local (C4.5 discretizes in nodes, on less and less data) Discretization: Transform into one k-valued nominal attribute Replace with k–1 new binary attributes values a,b,c: a → {0,0}, b → {1,0}, c → {1,1}

Unsupervised Discretization Determine intervals without knowing class labels In unsupervised settings the only possible option Strategies: Equal-interval binning : create intervals of fixed width often creates bins with many or very few examples

Unsupervised Discretization Strategies: Equal-frequency binning : create bins of equal size also called histogram equalization Proportional k-interval discretization equal-frequency binning with # bins = sqrt(dataset size)

Supervised Discretization Supervised approach usually works better Better if all/most examples in a bin have same class Correlates better with class attribute (less predictive info lost) Different approaches Entropy-based Bottom-up merging...

Entropy-based Discretization Split data in the same way C4.5 would: each leaf = bin Use entropy as splitting criterion H(p) = – plog(p) – (1–p)log(1–p)

Example: temperature attribute Play Temperature YesNoYesYesYesNoNoYesYesYesNoYesYesNo info([1,0], [8,5]) = 0.9 bits Choose cut-off with highest gain Define threshold halfway between values: (83+85)/2 = 84

Repeat by further subdividing the intervals Optimization: only split where class changes Always optimal (proven) Example: temperature attribute Play Temperature YesNoYesYesYesNoNoYesYesYesNoYesYesNo

Make data ‘numeric’ Inverse problem, some algorithms assume numeric features e.g. kNN Classification You could just number nominal values 1..k (a=0, b=1, c=2,...) However, there isn ’ t always a logical order Replace attribute with k nominal values by k binary attributes ( ‘ indicator attributes ’ ) Value ‘ 1 ’ if example has nominal value corresponding to that indicator attribute ‘ 0 ’ otherwise: A → {1,0}, B → {0,1} A a b AaAb 10 01

Discretization: Unsupervised: Discretize: Equal-width or equal-frequency PKIDiscretize: equal-frequency with #bins=sqrt(#values) Supervised: Discretize: Entropy-based discretization Discretization with Weka

WEKA: Discretization Filter Select (un)supervised > attribute > Discretize

Data Engineering  Attribute selection (feature selection)  Remove features with little/no predictive information  Attribute discretization  Convert numerical attributes to nominal ones  Data transformations (feature generation)  Transform data to another representation

Data transformations  Often, a data transformation can lead to new insights in the data and better performance  Simple transformations:  Subtract two ‘date’ attributes to get ‘age’ attribute  If linear relationship is suspected between numeric attributes A and B: add attribute A/B  Clustering the data  add one attribute recording the cluster of each instance  add k attributes with membership of each cluster

Data transformations  Other transformations:  Add noise to data (to test robustness of algorithm)  Obfuscate the data (to preserve privacy)

Data transformation filters Select unsupervised > attribute > …

Some WEKA implementations Simple transformations: AddCluster: clusters data and adds attribute with resulting cluster for each data point ClusterMembership: clusters data and adds k attributes with membership of each data point in each of k clusters AddNoise: changes a percentage of attribute’s values Obfuscate: renames attribute names and nominal/string values to random name