Model Evaluation Instructor: Qiang Yang

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Model Evaluation Instructor: Qiang Yang
Evaluation and Credibility How much should we believe in what was learned?
Evaluation and Credibility
Evaluation – next steps
2015/7/15University of Waikato1 Significance tests Significance tests tell us how confident we can be that there really is a difference Null hypothesis:
INTRODUCTION TO Machine Learning 3rd Edition
Evaluation of Learning Models
Data Mining – Credibility: Evaluating What’s Been Learned
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Evaluating Classifiers
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Learning from Observations Chapter 18 Through
CpSc 810: Machine Learning Evaluation of Classifier.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten and E. Frank.
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
Data Science Credibility: Evaluating What’s Been Learned
Evaluation – next steps
9. Credibility: Evaluating What’s Been Learned
Data Mining Practical Machine Learning Tools and Techniques
November 2017 Dr. Dale Parson
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning Techniques for Data Mining
Learning Algorithm Evaluation
Model Evaluation and Selection
Evaluation and Its Methods
Data Mining Practical Machine Learning Tools and Techniques
Slides for Chapter 5, Evaluation
Presentation transcript:

Model Evaluation Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang@cs.ust.hk Thanks: Eibe Frank and Jiawei Han Lecture 1

Constructing a Classifier Model The goal is to maximize the accuracy on new cases that have similar class distribution. Since new cases are not available at the time of construction, the given examples are divided into the testing set and the training set. The classifier is built using the training set and is evaluated using the testing set. The goal is to be accurate on the testing set. It is essential to capture the “structure” shared by both sets. Must prune overfitting rules that work well on the training set, but poorly on the testing set.

Example Classification Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Example (Conted) Tenured? Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

Evaluation Criteria Accuracy on test set Error Rate on test set the rate of correct classification on the testing set. E.g., if 90 are classified correctly out of the 100 testing cases, accuracy is 90%. Error Rate on test set The percentage of wrong predictions on test set Confusion Matrix For binary class values, “yes” and “no”, a matrix showing true positive, true negative, false positive and false negative rates Speed and scalability the time to build the classifier and to classify new cases, and the scalability with respect to the data size. Robustness: handling noise and missing values Predicted class Yes No Actual class True positive False negative False positive True negative

Evaluation Techniques Holdout: the training set/testing set. Good for a large set of data. k-fold Cross-validation: divide the data set into k sub-samples. In each run, use one distinct sub-sample as testing set and the remaining k-1 sub-samples as training set. Evaluate the method using the average of the k runs. This method reduces the randomness of training set/testing set.

Cross Validation: Holdout Method Break up data into groups of the same size Hold aside one group for testing and use the rest to build model Repeat iteration Test 7

Cross validation Natural performance measure for classification problems: error rate #Success: instance’s class is predicted correctly #Error: instance’s class is predicted incorrectly %Error rate: proportion of errors made over the whole set of instances Confidence 2% error in 100 tests 2% error in 10000 tests Which one do you trust more?

Confidence intervals Assume the estimated error rate (f) is 25%. How close is this to the true error rate p? Depends on the amount of test data We can say: p lies within a certain specified interval with a certain specified confidence Example: S=750 successes in N=1000 trials Estimated success rate: f=75% How close is this to true success rate p? Answer: with 80% confidence p[73.2,76.7] Another example: S=75 and N=100 Estimated success rate: 75% With 80% confidence p[69.1,80.1]

Mean and variance For large enough N, p follows a normal distribution c% confidence interval [-z  X  z] for random variable X with 0 mean is given by:

Confidence limits Confidence limits for the normal distribution with 0 mean and a variance of 1: Thus: To use this we have to reduce our random variable p to have 0 mean and unit variance Pr[Xz] z 0.1% 3.09 0.5% 2.58 1% 2.33 5% 1.65 10% 1.28 20% 0.84 40% 0.25

Transforming f Transformed value for f: (i.e. subtract the mean and divide by the standard deviation) Resulting equation: Solving for p:

Examples f=75%, N=1000, c=80% (so that z=1.28): Note that normal distribution assumption is only valid for large N (i.e. N > 100) f=75%, N=10, c=80% (so that z=1.28):

More on cross-validation Standard method for evaluation: stratified ten-fold cross-validation If M is the model, D is training data, N is the # of folds, then M is a function of D/N. M(D/N) gives an error  M(D)?? Why setting N=10? Extensive experiments have shown that this is the best choice to get an accurate estimate There is also some theoretical evidence for this Stratification (each fold is close to real distribution) reduces the estimate’s variance

Leave-one-out cross-validation Leave-one-out cross-validation is a particular form of cross-validation: The number of folds is set to the number of training instances I.e., a classifier has to be built n times, where n is the number of training instances Makes maximum use of the data No random sampling involved Very computationally expensive

LOO-CV and stratification Another disadvantage of LOO-CV: stratification is not possible It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: completely random dataset with two classes and equal proportions for both of them Best classifier predicts majority class (results in 50% on fresh data from this domain) LOO-CV estimate on error rate for this classifier will be 100%!

The bootstrap CV uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test set The bootstrap is an estimation method that uses sampling with replacement to form the training set A dataset of n instances is sampled n times with replacement to form a new dataset of n instances This data is used as the training set The instances from the original dataset that don’t occur in the new training set are used for testing

The 0.632 bootstrap This method is also called the 0.632 bootstrap A particular instance has a probability of 1-1/n of not being picked Thus its probability of ending up in the test data (not selected) is: This means the training data will contain approximately 63.2% of the instances

Comparing data mining methods Frequent situation: we want to know which one of two learning schemes performs better Obvious way: compare 10-fold CV estimates Problem: variance in estimate Variance can be reduced using repeated CV However, we still don’t know whether the results are reliable Solution: include confidence interval

Taking costs into account The confusion matrix: There many other types of costs! E.g.: cost of collecting training data, test data Predicted class Yes No Actual class True positive False negative False positive True negative

Lift charts In practice, ranking may be important Question: Decisions are usually made by comparing possible scenarios Sort the likelihood of x being +ve class from high to low Question: How do we know if one ranking is better than the other?

Example Example: promotional mailout Situation 1: classifier A predicts that 0.1% of all one million households will respond Situation 2: classifier B predicts that 0.4% of the 10,000 most promising households will respond Which one is better? Suppose to mail out a package, it costs 1 dollar But to get a response, we obtain 1000 dollars A lift chart allows for a visual comparison

Generating a lift chart Instances are sorted according to their predicted probability of being a true positive: In lift chart, x axis is sample size and y axis is number of true positives Rank Predicted probability Actual class 1 0.95 Yes 2 0.93 3 No 4 0.88 …

Steps in Building a Lift Chart 1. First, produce a ranking of the data, using your learned model (classifier, etc): Rank 1 means most likely in + class, Rank n means least likely in + class 2. For each ranked data instance, label with Ground Truth label: This gives a list like Rank 1, + Rank 2, -, Rank 3, +, Etc. 3. Count the number of true positives (TP) from Rank 1 onwards Rank 1, +, TP = 1 Rank 2, -, TP = 1 Rank 3, +, TP=2, 4. Plot # of TP against % of data in ranked order (if you have 10 data instances, then each instance is 10% of the data): 10%, TP=1 20%, TP=1 30%, TP=2, … This gives a lift chart.

A hypothetical lift chart True positives

ROC curves ROC curves are similar to lift charts “ROC” stands for “receiver operating characteristic” Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences from a lift chart: y axis shows percentage of true positives in sample (rather than absolute number) x axis shows percentage of false positives in sample (rather than sample size)

A sample ROC curve

Cost-sensitive learning Most learning schemes do not perform cost-sensitive learning They generate the same classifier no matter what costs are assigned to the different classes Example: standard decision tree learner Simple methods for cost-sensitive learning: Resampling of instances according to costs Weighting of instances according to costs Some schemes are inherently cost-sensitive, e.g. naïve Bayes

Measures in information retrieval Percentage of retrieved documents that are relevant: precision=TP/TP+FP Percentage of relevant documents that are returned: recall =TP/TP+FN Precision/recall curves have hyperbolic shape Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall) F-measure=(2recallprecision)/(recall+precision)

Summary of measures Domain Plot Explanation Lift chart Marketing TP Subset size (TP+FP)/ (TP+FP+TN+FN) ROC curve Communications TP rate FP rate TP/(TP+FN) FP/(FP+TN) Recall-precision curve Information retrieval Recall Precision TP/(TP+FP)

Evaluating numeric prediction Same strategies: independent test set, cross-validation, significance tests, etc. Difference: error measures Actual target values: a1, a2,…,an Predicted target values: p1, p2,…,pn Most popular measure: mean-squared error Easy to manipulate mathematically

Other measures The root mean-squared error: The mean absolute error is less sensitive to outliers than the mean-squared error: Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when predicting 500)

The MDL principle MDL stands for minimum description length The description length is defined as: space required to describe a theory + space required to describe the theory’s mistakes In our case the theory is the classifier and the mistakes are the errors on the training data Aim: we want a classifier with minimal DL MDL principle is a model selection criterion

Model selection criteria Model selection criteria attempt to find a good compromise between: The complexity of a model Its prediction accuracy on the training data Reasoning: a good model is a simple model that achieves high accuracy on the given data Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts

Elegance vs. errors Theory 1: very simple, elegant theory that explains the data almost perfectly Theory 2: significantly more complex theory that reproduces the data without mistakes Theory 1 is probably preferable Classical example: Kepler’s three laws on planetary motion Less accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles

MDL and compression The MDL principle is closely related to data compression: It postulates that the best theory is the one that compresses the data the most I.e. to compress a dataset we generate a model and then store the model and its mistakes We need to compute (a) the size of the model and (b) the space needed for encoding the errors (b) is easy: can use the informational loss function For (a) we need a method to encode the model

Discussion of the MDL principle Advantage: makes full use of the training data when selecting a model Disadvantage 1: appropriate coding scheme/prior probabilities for theories are crucial 2: no guarantee that the MDL theory is the one which minimizes the expected error Note: Occam’s Razor is an axiom! Occam's razor    the principle that you should not make more assumptions than are necessary when you are explaining something 奧康姆的剃刀(認為在解釋某事時若無必要不應做太多的臆斷的原則)