Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Classification Techniques: Decision Tree Learning
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Decision Trees.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Constructing Decision Trees. A Decision Tree Example The weather data example. ID codeOutlookTemperatureHumidityWindyPlay abcdefghijklmnabcdefghijklmn.
Evaluation.
Classification: Decision Trees
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Decision Trees an Introduction.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
K Nearest Neighbor Classification Methods Qiang Yang.
Algorithms for Classification: The Basic Methods.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
Evaluation of Learning Models
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Evaluation – next steps
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Decision-Tree Induction & Decision-Rule Induction
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
Evaluating Results of Learning Blaž Zupan
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
K Nearest Neighbor Classification Methods. Training Set.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Science Credibility: Evaluating What’s Been Learned
DECISION TREES An internal node represents a test on an attribute.
Evaluating Classifiers
Performance Evaluation 02/15/17
Evaluating Results of Learning
Decision Tree Saed Sayad 9/21/2018.
Data Mining Classification: Alternative Techniques
Features & Decision regions
Machine Learning Techniques for Data Mining
Learning Algorithm Evaluation
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Classification II

2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually split into two parts Splitting position is determined using the idea of information gain Consider the following sorted temperature values and their class labels Yes no yes yes yes no no yes no yes yes no yes yes –We could have created a split at any of the 11 positions between the numbers –For each split compute the information gain –Select the split that gives the highest information gain –E.g. temp<71.5 produces 4 yes and 2 no & –Temp>71.5 produces 5 yes and 3 no –Entropy([4,2],[5,3])=6/14*Entropy(4/6,2/6)+8/14*Entropy(5/8,3/8) = bits

3 Missing Values Easy solution is to treat missing values as a new possible value for the attribute –E.g if Outlook has a missing value we have rainy, overcast, sunny and missing as its possible values –This makes sense if the fact that the attribute is missing is significant (e.g. missing medical test results) –Also the learning method needs no modification More complex solution is to let the missing value receive a proportion of value from each of the known values of the attribute –The proportion is estimated based on the proportions of instances with known value at a node –E.g. if Outlook has a missing value in one instance and has 4 instances with rainy, 2 with overcast and 3 with sunny then the missing value is {4/9 rainy, 2/9 overcast, 3/9 sunny} –All the computations (such as information gain) are performed using these weighted values

4 Overfitting When decision trees are grown until each leaf node is pure they might learn unnecessary details from the training data –This is called overfitting –Unnecessary details could be just noise or because the training data is not representative Overfitting makes the classifier perform poorly on independent test data Two solutions –Stop growing the tree earlier, before it overfits training data – hard to estimate when to stop in practice –prune overfitted parts of the decision tree by evaluating the utility of pruning nodes from the tree using part of the training data as validation data

5 Naïve Bayes Classifier A simple classifier using observed probabilities Assumes that all the attributes contribute towards classification –Equally importantly –Independently For some data sets this classifier achieves better results than decision trees Makes use of Bayes rule of conditional probability Bayes rule: If H is a hypothesis and E is its evidence, then P(H|E) = (P(E|H)P(H))/P(E) P(H|E) – conditional probability – probability of H given E

6 Example Data OutlookTemperatureHumidityWindyPlay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno Class Attribute

7 Weather Data Counts and Probabilities OutlookTemperatureHumidityWindy Play noyesno yes false true yes no high normal hot mild cool sunny overcast rainy false true high normal hot mild cool sunny overcast rainy 2/93/5 4/90/5 3/92/5 2/92/5 4/92/5 3/91/5 3/94/5 6/91/5 6/92/5 3/93/5 9/145/14

8 Naïve Bayes Example Given the training weather data, this algorithm computes probabilities shown on the previous slide For a test instance with {sunny, cool, high, true} the algorithm uses Bayes rule to compute probabilities of Play = yes and Play = no First let H be Play = yes Then from Bayes rule –P(yes|E) = (P(sunny|yes)*P(cool|yes)*P(high|yes)*P(true|yes)*P(yes))/P(E) –All the probabilities on the right hand side except P(E) are known from the previous slide –P(yes|E) = (2/9*3/9*3/9*3/9*9/14)/P(E) = /P(E) –P(no|E) = (P(sunny|no)*P(cool|no)*P(high|no)*P(true|no)*P(no))/P(E) –P(no|E) = (3/5*1/5*4/5*3/5*5/14)/P(E) = /P(E) –Because P(yes|E) + P(no|E) = 1, (0.0053|P(E)) + (0.0206|P(E)) = 1 –P(E) = –P(yes|E) = /( ) = 20.5% –P(no|E) = /( ) = 79.5%

9 K-nearest neighbour No model built explicitly – instances are stored verbatim For an unknown test instance, using a distance function the k nearest training instances are determined and their most common class is assigned For numeric attributes Euclidean distance is a natural distance function Consider two instances – I1 having attributes a 1, a 2, …, a n and –I2 having attributes a 1 ’, a 2 ’,…, a n ’ –Euclidean distance between the two instances is √((a 1 -a 1 ’) 2 +(a 2 -a 2 ’) 2 +…+(a n -a n ’) 2 ) Because different attributes have different scales, you need to normalize attribute values to lie between 0 and 1 –a i = (v i –min(v i )) /(max(v i )-min(v i )) Finding nearest neighbours is computationally expensive – because a rudimentary approach is to compute distances to all the instances –speeding techniques exist but not studied here

10 Classifier’s Performance Metrics Error rate of a classifier measures its overall performance –Error rate = proportion of errors = number of misclassifications/total number of test instances Error rate does not discriminate between different types of errors A binary classifier (yes and no) makes two kinds of errors –Calling an instance of ‘no’ as an instance of ‘yes’ False positives –Calling an instance of ‘yes’ as an instance of ‘no’ False negatives In practice false positives and false negatives have different associated costs –Cost of lending to a defaulter is larger than lost-business cost of refusing loan to a non-defaulter –Cost of failing to detect fire is larger than the cost of a false alarm

11 Confusion Matrix The four possible outcomes of a binary classifier are usually shown in a confusion matrix –TP – True Positives –TN – True Negatives –FP – False Positives –FN – False Negatives –P – Total Positives (TP+FN) –N – Total Negatives (FP+TN) A number of performance metrics defined using these counts YesNo Yes No Actual Class Predicted Class TP TNFP FN P N P’ N’

12 Performance Metrics Derived from Confusion Matrix True Positive rate, TPR = TP/P = TP/(TP+FN) –Also known as sensitivity and recall False Positive rate, FPR = FP/N=FP/(FP+TN) Accuracy = (TN+TP)/(TP+TN+FP+FN) Error Rate = 1 – Accuracy Specificity = 1 – FPR Positive Predictive Value = TP/P’ = TP/(TP+FP) –Also known as precision Negative Predictive Value = TN/N’ = TN(TN+FN)

13 ROC - Receiver Operating Characteristic Metrics derived from confusion matrix useful for comparing classifiers Particularly a plot of TPR on y- axis against FPR on x axis is known as ROC A, B, C, D and E are five classifiers with different TPR and FPR values A is the ideal classifier because it has TPR = 1.0 and FPR = 0 E is on the diagonal which stands for random guess C performs worse than random guess –But inverse of C which is B is better than D Classifiers should aim to be in the northwest 1.0 A B C D 0 E TPR FPR better worse

14 Testing Options 1 Testing the classifier on training data is not useful –Performance figures from such testing will be optimistic –Because the classifier is trained from the very data Ideally, a new data set called ‘test set’ needs to be used for testing –If test set is large performance figures will be more realistic –Creating test set needs experts’ time and therefore creating large test sets is expensive –After testing, test set is combined with training data to produce a new classifier –Sometimes, a third data set called ‘validation data’ used for fine tuning a classifier or to select a classifier among many In practice several strategies used to make up for lack of test data –Holdout procedure – a certain proportion of training data is held as test data and remaining used for training –Cross-validation –Leave-one-out cross-validation

15 Testing Options 2 Cross Validation –Partition the data into a fixed number of folds –Use data from each of the partitions for testing while using the remaining for training –Every instance is used for testing once –10-fold cross-validation is standard, particularly repeating it 10 times Leave-one-out –Is n-fold cross-validation, where n is the data size –One instance is held for testing while using the remaining for training –Results from single instance tests are averaged to obtain the final test result –Maximum utilization of data for training –No sampling of data for testing, each instance is systematically used for testing –High costs involved because classifier is trained n times –Hard to ensure representative data for training