Model Evaluation. CRISP-DM CRISP-DM Phases Business Understanding – Initial phase – Focuses on: Understanding the project objectives and requirements.

Slides:

Advertisements

Similar presentations

Evaluating Classifiers

Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.

Learning Algorithm Evaluation

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Model generalization Test error Bias, variance and complexity

Model assessment and cross-validation - overview

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

Model Evaluation Metrics for Performance Evaluation

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.

CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Ensemble Learning: An Introduction

SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.

Evaluation and Credibility How much should we believe in what was learned?

Experimental Evaluation

Evaluation and Credibility

Evaluation of Learning Models

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

Today Evaluation Measures Accuracy Significance Testing

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,

Evaluating Classifiers

More on Data Mining KDnuggets Datanami ACM SIGKDD

1 CSI 5388: ROC Analysis (Based on ROC Graphs: Notes and Practical Considerations for Data Mining Researchers by Tom Fawcett, (Unpublished) January 2003.

2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

CLassification TESTING Testing classifier accuracy

Evaluation – next steps

Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.

Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Experimental Evaluation of Learning Algorithms Part 1.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

CpSc 810: Machine Learning Evaluation of Classifier.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Evaluating Results of Learning Blaž Zupan

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.

Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.

1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.

CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages

Classification Ensemble Methods 1

Evaluating Classification Performance

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

Validation methods.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

Evaluation of learned models Kurt Driessens again with slides stolen from Evgueni Smirnov and Hendrik Blockeel.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.

Data Science Credibility: Evaluating What’s Been Learned

7. Performance Measurement

Evaluating Results of Learning

9. Credibility: Evaluating What’s Been Learned

Introduction to Data Mining, 2nd Edition by

Data Mining Classification: Alternative Techniques

Machine Learning Techniques for Data Mining

Learning Algorithm Evaluation

Model Evaluation and Selection

Roc curves By Vittoria Cozza, matr

Presentation transcript:

Model Evaluation

CRISP-DM

CRISP-DM Phases Business Understanding – Initial phase – Focuses on: Understanding the project objectives and requirements from a business perspective Converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives Data Understanding – Starts with an initial data collection – Proceeds with activities aimed at: Getting familiar with the data Identifying data quality problems Discovering first insights into the data Detecting interesting subsets to form hypotheses for hidden information

CRISP-DM Phases Data Preparation – Covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data – Data preparation tasks are likely to be performed multiple times, and not in any prescribed order – Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools Modeling – Various modeling techniques are selected and applied, and their parameters are calibrated to optimal values – Typically, there are several techniques for the same data mining problem type – Some techniques have specific requirements on the form of data, therefore, stepping back to the data preparation phase is often needed

CRISP-DM Phases Evaluation – At this stage, a model (or models) that appears to have high quality, from a data analysis perspective, has been built – Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives – A key objective is to determine if there is some important business issue that has not been sufficiently considered – At the end of this phase, a decision on the use of the data mining results should be reached

CRISP-DM Phases Deployment – Creation of the model is generally not the end of the project – Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it – Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process – In many cases it will be the customer, not the data analyst, who will carry out the deployment steps – However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created models

Evaluating Classification Systems Two issues – What evaluation measure should we use? – How do we ensure reliability of our model?

EVALUATION How do we ensure reliability of our model?

How do we ensure reliability? Heavily dependent on training

Data Partitioning Randomly partition data into training and test set Training set – data used to train/build the model. – Estimate parameters (e.g., for a linear regression), build decision tree, build artificial network, etc. Test set – a set of examples not used for model induction. The model’s performance is evaluated on unseen data. Aka out-of-sample data. Generalization Error: Model error on the test data. Set of training examples Set of test examples

Complexity and Generalization S train (  ) S test (  ) Complexity = degrees of freedom in the model (e.g., number of variables) Score Function e.g., squared error Optimal model complexity

Holding out data The holdout method reserves a certain amount for testing and uses the remainder for training – Usually: one third for testing, the rest for training For “unbalanced” datasets, random samples might not be representative – Few or none instances of some classes Stratified sample: – Make sure that each class is represented with approximately equal proportions in both subsets 12

Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples – In each iteration, a certain proportion is randomly selected for training (possibly with stratification) – The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method 13

Cross-validation Most popular and effective type of repeated holdout is cross-validation Cross-validation avoids overlapping test sets – First step: data is split into k subsets of equal size – Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before the cross- validation is performed 14

Cross-validation example: 15

More on cross-validation Standard data-mining method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation – E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the sampling variance) Error estimate is the mean across all repetitions 16

Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: – Set number of folds to number of training instances – I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Computationally expensive, but good performance 17

Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible – It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: random dataset split equally into two classes – Best model predicts majority class – 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error! 18

Three way data splits One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward. If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split: – Training set: examples used for learning – Validation set: used to tune parameters – Test set: never used in the model fitting process, used at the end for unbiased estimate of hold out error

The Bootstrap The Statistician Brad Efron proposed a very simple and clever idea for mechanically estimating confidence intervals: The Bootstrap The idea is to take multiple resamples of your original dataset. Compute the statistic of interest on each resample you thereby estimate the distribution of this statistic!

Sampling with Replacement Draw a data point at random from the data set. Then throw it back in Draw a second data point. Then throw it back in… Keep going until we ’ ve got 1000 data points. You might call this a “ pseudo ” data set. This is not merely re-sorting the data. Some of the original data points will appear more than once; others won ’ t appear at all.

Sampling with Replacement In fact, there is a chance of (1-1/1000) 1000 ≈ 1/e ≈.368 that any one of the original data points won ’ t appear at all if we sample with replacement 1000 times.  any data point is included with Prob ≈.632 Intuitively, we treat the original sample as the “ true population in the sky ”. Each resample simulates the process of taking a sample from the “ true ” distribution.

Bootstrapping & Validation This is interesting in its own right. But bootstrapping also relates back to model validation. Along the lines of cross-validation. You can fit models on bootstrap resamples of your data. For each resample, test the model on the ≈.368 of the data not in your resample. Will be biased, but corrections are available. Get a spectrum of ROC curves.

Closing Thoughts The “ cross-validation ” approach has several nice features: – Relies on the data, not likelihood theory, etc. – Comports nicely with the lift curve concept. – Allows model validation that has both business & statistical meaning. – Is generic  can be used to compare models generated from competing techniques… – … or even pre-existing models – Can be performed on different sub-segments of the data – Is very intuitive, easily grasped.

Closing Thoughts B ootstrapping has a family resemblance to cross- validation: – Use the data to estimate features of a statistic or a model that we previously relied on statistical theory to give us. – Classic examples of the “ data mining ” (in the non-pejorative sense of the term!) mindset: Leverage modern computers to “ do it yourself ” rather than look up a formula in a book! Generic tools that can be used creatively. – Can be used to estimate model bias & variance. – Can be used to estimate (simulate) distributional characteristics of very difficult statistics. – Ideal for many actuarial applications.

METRICS What evaluation measure should we use?

Evaluation of Classification Accuracy = (a+d) / (a+b+c+d) – Not always the best choice Assume 1% fraud, model predicts no fraud What is the accuracy? 10 1 ab 0 cd predicted outcome actual outcome FraudNo Fraud Fraud00 No Fraud10990 Predicted Class Actual Class

Evaluation of Classification Other options: – recall or sensitivity (how many of those that are really positive did you predict?): a/(a+c) – precision (how many of those predicted positive really are?) a/(a+b) Precision and recall are always in tension – Increasing one tends to decrease another 10 1 ab 0 cd predicted outcome actual outcome

Evaluation of Classification Yet another option: – recall or sensitivity (how many of the positives did you get right?): a/(a+c) – Specificity (how many of the negatives did you get right?) d/(b+d) Sensitivity and specificity have the same tension Different fields use different metrics 10 1 ab 0 cd predicted outcome actual outcome

Evaluation for a Thresholded Response Many classification models output probabilities These probabilities get thresholded to make a prediction. Classification accuracy depends on the threshold – good models give low probabilities to Y=0 and high probabilities to Y=1.

Test Data predicted probabilities actual outcome predicted outcome Suppose we use a cutoff of 0.5…

actual outcome predicted outcome Suppose we use a cutoff of 0.5… sensitivity: = 100% specificity: = 75% we want both of these to be high

actual outcome predicted outcome Suppose we use a cutoff of 0.8… sensitivity: = 75% specificity: = 83%

Note there are 20 possible thresholds Plotting all values of sensitivity vs. specificity gives a sense of model performance by seeing the tradeoff with different thresholds Note if threshold = minimum c=d=0 so sens=1; spec=0 If threshold = maximum a=b=0 so sens=0; spec=1 If model is perfect sens=1; spec=1 a b c d actual outcome

ROC curve plots sensitivity vs. (1-specificity) – also known as false positive rate Always goes from (0,0) to (1,1) The more area in the upper left, the better Random model is on the diagonal “Area under the curve” (AUC) is a common measure of predictive performance

ROC CURVES

Receiver Operating Characteristic curve ROC curves were developed in the 1950's as a by-product of research into making sense of radio signals contaminated by noise. More recently it's become clear that they are remarkably useful in decision-making. They are a performance graphing method. True positive and False positive fractions are plotted as we move the dividing threshold. They look like:

ROC Space ROC graphs are two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis. An ROC graph depicts relative trade- offs between benefits (true positives) and costs (false positives). Figure shows an ROC graph with five classifiers labeled A through E. A discrete classier is one that outputs only a class label. Each discrete classier produces an (fp rate, tp rate) pair corresponding to a single point in ROC space. Classifiers in figure are all discrete classifiers.

Several Points in ROC Space Lower left point (0, 0) represents the strategy of never issuing a positive classification; – such a classier commits no false positive errors but also gains no true positives. Upper right corner (1, 1) represents the opposite strategy, of unconditionally issuing positive classifications. Point (0, 1) represents perfect classification. – D's performance is perfect as shown. Informally, one point in ROC space is better than another if it is to the northwest of the first – tp rate is higher, fp rate is lower, or both.

Specific Example Test Result Pts with disease Pts without the disease

Test Result Call these patients “ negative ” Call these patients “ positive ” Threshold

Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease True Positives Some definitions...

Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease False Positives

Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease True negatives

Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease False negatives

Test Result without the disease with the disease ‘‘ - ’’ ‘‘ + ’’ Moving the Threshold: right

Test Result without the disease with the disease ‘‘ - ’’ ‘‘ + ’’ Moving the Threshold: left

True Positive Rate (sensitivity) 0% 100% False Positive Rate (1-specificity) 0% 100% ROC curve

True Positive Rate 0% 100% False Positive Rate 0%0% 100% True Positive Rate 0% 100% False Positive Rate 0% 100% A good test: A poor test: ROC curve comparison

Best Test: Worst test: True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % The distributions don ’ t overlap at all The distributions overlap completely ROC curve extremes

How to Construct ROC Curve for one Classifier Sort the instances according to their P pos. Move a threshold on the sorted instances. For each threshold define a classifier with confusion matrix. Plot the TPr and FPr rates of the classfiers. P pos True Class 0.99pos 0.98pos 0.7neg 0.6pos 0.43neg True Predicted posneg pos21 neg11

Creating an ROC Curve A classifier produces a single ROC point. If the classifier has a “ sensitivity ” parameter, varying it produces a series of ROC points (confusion matrices). Alternatively, if the classifier is produced by a learning algorithm, a series of ROC points can be generated by varying the class ratio in the training set.

ROC for one Classifier Good separation between the classes, convex curve.

ROC for one Classifier Reasonable separation between the classes, mostly convex.

ROC for one Classifier Fairly poor separation between the classes, mostly convex.

ROC for one Classifier Poor separation between the classes, large and small concavities.

ROC for one Classifier Random performance.

The AUC Metric The area under ROC curve (AUC) assesses the ranking in terms of separation of the classes. AUC estimates that randomly chosen positive instance will be ranked before randomly chosen negative instances.

Comparing Models Highest AUC wins But pay attention to ‘Occam’s Razor’ – ‘the best theory is the smallest one that describes all the facts’ – Also known as the ‘parsimony principle’ – If two models are similar, pick the simpler one