Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Model Evaluation Instructor: Qiang Yang
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Ensemble Learning: An Introduction
Model Evaluation Instructor: Qiang Yang
Evaluating Hypotheses
ROC Curves.
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Evaluation of Learning Models
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Classification II (continued) Model Evaluation
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
CpSc 810: Machine Learning Evaluation of Classifier.
Model Evaluation. CRISP-DM CRISP-DM Phases Business Understanding – Initial phase – Focuses on: Understanding the project objectives and requirements.
Evaluating Results of Learning Blaž Zupan
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ Reduced-error pruning : –breaks the samples into a training set and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Evaluating Classification Performance
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of learned models Kurt Driessens again with slides stolen from Evgueni Smirnov and Hendrik Blockeel.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten and E. Frank.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
7. Performance Measurement
Evaluating Classifiers
Evaluation – next steps
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Performance evaluation
Machine Learning Techniques for Data Mining
Learning Algorithm Evaluation
CSCI N317 Computation for Scientific Applications Unit Weka
Model Evaluation and Selection
Slides for Chapter 5, Evaluation
Presentation transcript:

Evaluation of Learning Models Evgueni Smirnov

Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data Mining Schemes Costs in Data Mining –Cost-Sensitive Classification and Learning –Lift Charts –ROC Curves

Motivation It is important to evaluate classifier’s generalization performance in order to: –Determine whether to employ the classifier; (For example: when learning the effectiveness of medical treatments from a limited-size data, it is important to estimate the accuracy of the classifiers.) –Optimize the classifier. (For example: when post-pruning decision trees we must evaluate the accuracy of the decision trees on each pruning step.)

data Target data Processed data Transformed data Patterns Knowledge Selection Preprocessing & cleaning Transformation & feature selection Data Mining Interpretation Evaluation Model’s Evaluation in the KDD Process

How to evaluate the Classifier’s Generalization Performance? Predicted class Actual class PosNeg PosTPFN NegFPTN Assume that we test a classifier on some test set and we derive at the end the following confusion matrix: P N

Metrics for Classifier’s Evaluation Predicted class Actual class PosNeg PosTPFN NegFPTN Accuracy = (TP+TN)/(P+N) Error = (FP+FN)/(P+N) Precision = TP/(TP+FP) Recall/TP rate = TP/P FP Rate = FP/N P N

How to Estimate the Metrics? We can use: –Training data; –Independent test data; –Hold-out method; –k-fold cross-validation method; –Leave-one-out method; –Bootstrap method; –And many more…

Estimation with Training Data The accuracy/error estimates on the training data are not good indicators of performance on future data. –Q: Why? –A: Because new data will probably not be exactly the same as the training data! The accuracy/error estimates on the training data measure the degree of classifier’s overfitting. Training set Classifier Training set

Estimation with Independent Test Data Estimation with independent test data is used when we have plenty of data and there is a natural way to forming training and test data. For example: Quinlan in 1987 reported experiments in a medical domain for which the classifiers were trained on data from 1985 and tested on data from Training set Classifier Test set

Hold-out Method The hold-out method splits the data into training data and test data (usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data. The hold-out method is usually use when we have thousands of instances, including several hundred instances from each class. Training set Classifier Test set Data

Classification: Train, Validation, Test Split Data Predictions Y N Results Known Training set Validation set Classifier Builder Evaluate Classifier Final Test Set Final Evaluation Model Builder The test data can’t be used for parameter tuning!

Making the Most of the Data Once evaluation is complete, all the data can be used to build the final classifier. Generally, the larger the training data the better the classifier (but returns diminish). The larger the test data the more accurate the error estimate.

Stratification The holdout method reserves a certain amount for testing and uses the remainder for training. –Usually: one third for testing, the rest for training. For “unbalanced” datasets, samples might not be representative. –Few or none instances of some classes. Stratified sample: advanced version of balancing the data. –Make sure that each class is represented with approximately equal proportions in both subsets.

Repeated Holdout Method Holdout estimate can be made more reliable by repeating the process with different subsamples. –In each iteration, a certain proportion is randomly selected for training (possibly with stratification). –The error rates on the different iterations are averaged to yield an overall error rate. This is called the repeated holdout method.

Repeated Holdout Method, 2 Still not optimum: the different test sets overlap, but we would like all our instance from the data to be tested at least ones. Can we prevent overlapping? witten & eibe

k-Fold Cross-Validation k-fold cross-validation avoids overlapping test sets: –First step: data is split into k subsets of equal size; –Second step: each subset in turn is used for testing and the remainder for training. The subsets are stratified before the cross-validation. The estimates are averaged to yield an overall estimate. Classifier Data train testtraintesttraintesttrain

More on Cross-Validation Standard method for evaluation: stratified 10-fold cross- validation. Why 10? Extensive experiments have shown that this is the best choice to get an accurate estimate. Stratification reduces the estimate’s variance. Even better: repeated stratified cross-validation: –E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance).

Leave-One-Out Cross-Validation Leave-One-Out is a particular form of cross-validation: –Set number of folds to number of training instances; –I.e., for n training instances, build classifier n times. Makes best use of the data. Involves no random sub-sampling. Very computationally expensive.

Leave-One-Out Cross-Validation and Stratification A disadvantage of Leave-One-Out-CV is that stratification is not possible: –It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example - random dataset split equally into two classes: –Best inducer predicts majority class; –50% accuracy on fresh data; –Leave-One-Out-CV estimate is 100% error!

Bootstrap Method Cross validation uses sampling without replacement: –The same instance, once selected, can not be selected again for a particular training/test set The bootstrap uses sampling with replacement to form the training set: –Sample a dataset of n instances n times with replacement to form a new dataset of n instances; –Use this data as the training set; –Use the instances from the original dataset that don’t occur in the new training set for testing.

Bootstrap Method The bootstrap method is also called the bootstrap: –A particular instance has a probability of 1–1/n of not being picked; –Thus its probability of ending up in the test data is: –This means the training data will contain approximately 63.2% of the instances and the test data will contain approximately 36.8% of the instances.

Estimating Error with the Bootstrap Method The error estimate on the test data will be very pessimistic because the classifier is trained on just ~63% of the instances. Therefore, combine it with the training error: The training error gets less weight than the error on the test data. Repeat process several times with different replacement samples; average the results.

Confidence Intervals for Estimates of the Metrics for the Classification Performance Assume that the error error S (h) of the classifier h estimated by the 10-fold cross validation is 25%. How close is the estimated error error S (h) to the true error error D (h) ?

Confidence intervals … If  test data contain n examples, drawn independently of each other,  n  30 Then with approximately N% probability, error D (h) lies in the interval where N%:50%68%80%90%95%98%99% z N :

Metric Evaluation Summary: Use test sets and the hold-out method for “large” data; Use the cross-validation method for “middle- sized” data; Use the leave-one-out and bootstrap methods for small data; Don’t use test data for parameter tuning - use separate validation data.

Comparing Data-Mining Classifiers Intuitive approach: train and test your classifiers using cross validation or bootstrap, and then rank the classifiers according to their performance. A more complicated approach is employed in Weka based on the t-test.

Counting the Costs In practice, different types of classification errors often incur different costs Examples: –¨ Terrorist profiling “Not a terrorist” correct 99.99% of the time –Loan decisions –Fault diagnosis –Promotional mailing

Cost Matrices PosNeg PosTP CostFP Cost NegFN CostTN Cost Usually, TP Cost and TN Cost are set equal to 0! Hypothesized class True class

Cost-Sensitive Classification If the classifier outputs probability for each class, it can be adjusted to minimize the expected costs of the predictions. Expected cost is computed as dot product of vector of class probabilities and appropriate column in cost matrix. PosNeg PosTP CostFP Cost NegFN CostTN Cost Hypothesized class True class

Cost Sensitive Classification Assume that the classifier returns for an instance probs p pos = 0.6 and p neg = 0.4. Then, the expected cost if the instance is classified as positive is 0.6 * * 10 = 4. The expected cost if the instance is classified as negative is 0.6 * * 0 = 3. To minimize the costs the instance is classified as negative. PosNeg Pos05 Neg100 Hypothesized class True class

Cost Sensitive Learning Simple methods for cost sensitive learning: –Resampling of instances according to costs; –Weighting of instances according to costs. In Weka Cost Sensitive Classification and Learning can be applied for any classifier using the meta scheme: CostSensitiveClassifier. PosNeg Pos05 Neg100 Hypothesized class True class

Lift Charts In practice, decisions are usually made by comparing possible scenarios taking into account different costs. Example: Promotional mailout to 1,000,000 households. If we mail to all households, we get 0.1% respond (1000). Data mining tool identifies (a) subset of 100,000 households with 0.4% respond (400); or (b) subset of 400,000 households with 0.2% respond (800); Depending on the costs we can make final decision using lift charts! A lift chart allows a visual comparison.

Generating a Lift Chart Instances are sorted according to their predicted probability of being a true positive: RankPredicted probabilityActual class 10.95Yes 20.93Yes 30.93No 40.88Yes ……… In lift chart, x axis is sample size and y axis is number of true positives.

Hypothetical Lift Chart

ROC Curves and Analysis True Predicted posneg pos6040 neg2080 True Predicted posneg pos7030 neg50 True Predicted posneg pos4060 neg3070 Classifier 1 TPr = 0.4 FPr = 0.3 Classifier 2 TPr = 0.7 FPr = 0.5 Classifier 3 TPr = 0.6 FPr = 0.2

ROC Space Ideal classifier chance always negative always positive True Negative Rate False Negative Rate

Dominance in the ROC Space Classifier A dominates classifier B if and only if TPr A > TPr B and FPr A < FPr B.

ROC Convex Hull (ROCCH) ROCCH is determined by the dominant classifiers; Classifiers on ROCCH achieve the best accuracy; Classifiers below ROCCH are always sub-optimal.

Convex Hull Any performance on a line segment connecting two ROC points can be achieved by randomly choosing between them; The classifiers on ROCCH can be combined to form a hybrid.

Iso-Accuracy Lines Iso-accuracy line connects ROC points with the same accuracy A: P*TPr + N*(1–FPr) = A; TPr = (A – N)/P + N/P*FPr. Iso-accuracy lines have slope N/P. Higher iso-accuracy lines are better.

Iso-Accuracy Lines For uniform class distribution, C4.5 is optimal and achieves about 82% accuracy.

Iso-Accuracy Lines With for times as many positives as negatives SVM is optimal and achieves about 84% accuracy.

Iso-Accuracy Lines With for times as many negatives as positives CN2 is optimal and achieves about 86% accuracy.

Iso-Accuracy Lines With less than 9% positives, AlwaysNeg is optimal. With less than 11% negatives, AlwaysPos is optimal.

How to Construct ROC Curve for one Classifier Sort the instances according to their P pos. Move a threshold on the sorted instances. For each threshold define a classifier with confusion matrix. Plot the TPr and FPr rates of the classfiers. P pos True Class 0.99pos 0.98pos 0.7neg 0.6pos 0.43neg True Predicted posneg pos21 neg11

ROC for one Classifier Good separation between the classes, convex curve.

ROC for one Classifier Reasonable separation between the classes, mostly convex.

ROC for one Classifier Fairly poor separation between the classes, mostly convex.

ROC for one Classifier Poor separation between the classes, large and small concavities.

ROC for one Classifier Random performance.

The AUC Metric The area under ROC curve (AUC) assesses the ranking in terms of separation of the classes. AUC estimates that randomly chosen positive instance will be ranked before randomly chosen negative instances.

Note To generate ROC curves or Lift charts we need to use some evaluation methods considered in this lecture. ROC curves and Lift charts can be used for internal optimization of classifiers.

Summary In this lecture we have considered: –Metrics for Classifier’s Evaluation –Methods for Classifier’s Evaluation –Comparing Data Mining Schemes –Costs in Data Mining Cost-Sensitive Classification and Learning Lift Charts ROC Curves