Classification Evaluation
Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches – Training set – Hold-out set (2 variants) – Cross-validation
Estimating with Training Set Simplest approach – Build model from training set – Compute accuracy on training set Pros and cons – Easy – Likely to overestimate (think overfitting)
Estimating with Hold-out Set (1) Method 1 – Two distinct data sets are made available a priori – One is used to build the model – The other is used to test the model Pros and cons – No “bias” – Not always feasible
Estimating with Hold-out Set (2) Method 2: – Randomly partition data into training and test set – Training set used to train/build the model – Test set used evaluate the model Pros and cons – Easy – Less likely to overfit – Reduces amount of training data
Holding out data The holdout method reserves a certain amount for testing and uses the remainder for training – Usually: one third for testing, the rest for training For “unbalanced” datasets, random samples might not be representative – Few or none instances of some classes Stratified sample: – Make sure that each class is represented with approximately equal proportions in both subsets
Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples – In each iteration, a certain proportion is randomly selected for training (possibly with stratification) – The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method
Cross-validation Most popular and effective type of repeated holdout is cross-validation Cross-validation avoids overlapping test sets – First step: data is split into k subsets of equal size – Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before the cross- validation is performed
Cross-validation example: 9
More on cross-validation Standard data-mining method for evaluation: stratified ten-fold cross-validation Why ten? – Good choice to get an accurate estimate Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation – E.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the sampling variance) Error estimate is the mean across all repetitions
Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: – Set number of folds to number of training instances – I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Computationally expensive, but good performance
Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible – It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: random dataset split equally into two classes – Best model predicts majority class – 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error!
Three-way Data Splits One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward. If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split: – Training set: examples used for learning – Validation set: used to tune parameters – Test set: never used in the model fitting process, used at the end for unbiased estimate of hold out error Nested Cross-validation
Issues with Accuracy Measuring accuracy – Is 99% accuracy good? Is 20% accuracy bad? – Can be excellent, good, mediocre, poor, terrible Why? – Depends on problem complexity – Depends on base accuracy (i.e., majority learner) – Depends on cost of error (e.g., ICU, etc.) Problem: assumes equal cost for all errors
Confusion Matrix Predicted Output True Output (Target) True Positive (TP) Hits False Negative (FN) Misses True Negative (TN) Correct Rejections False Positive (FP) False Alarm Accuracy = (TP+TN)/(TP+TN+FP+FN) Single number: loses information
Precision Predicted Output True Output (Target) True Positive (TP) Hits False Negative (FN) Misses True Negative (TN) Correct Rejections False Positive (FP) False Alarm Precision = TP/(TP+FP) The percentage of predicted true positives that are target true positives (of those I predict true, how many are actually true)
Recall Predicted Output True Output (Target) True Positive (TP) Hits False Negative (FN) Misses True Negative (TN) Correct Rejections False Positive (FP) False Alarm Recall = TP/(TP+FN) The percentage of target true positives that were predicted as true positives (of those that are true, how many do I predict are true
P/R Trade-off (I) ICU monitoring: – Is precision the goal? Not so much, rather not miss any – Recall is the goal: Don’t want to miss any, and would rather err towards accepting some false positives (check patient when unnecessary) and minimize false negatives (not check on a needy patient) Google search: – Is recall the goal? Not really, because we never get to the millionth page, rather get a few very good results early – Precision is the goal: Don’t want to see irrelevant documents (false positives), can tolerate missing some (false negatives), there are plenty of sites anyways and we don’t need to get all Trade-off: – Easy to maximize precision – only classify the one or few most confident candidates as true – Easy to maximize recall – classify everything as true – Neither is particularly useful!
P/R Trade-off (II) Complete P/R curve Breakeven Point defined by P=R Alternatively, F-measure:
Other Measures Sensitivity (Recall): – TP / (TP + FN) Specificity: – TN / (TN + FP) Positive Predictive Value (Precision): – TP / (TP + FP) Negative Predictive Value: – TN / (TN + FN)
ROC Curves Receiver Operating Characteristic Curve – Developed in WWII to statistically model false positive and false negative detections of radar operators Standard measure in medicine and biology Graphs true positive rate (sensitivity) vs. false positive rate (1- specificity) Goal: Maximize TPR and minimize FPR – Max TPR: classify everything positive – Min FPR: classify everything negative – Neither is acceptable, of course!
Several Points in ROC Space Lower left point (0, 0) represents the strategy of never issuing a positive classification; – No FP but also no TP Upper right corner (1, 1) represents the opposite strategy, of unconditionally issuing positive classifications. Point (0, 1) represents perfect classification. – D's performance is perfect as shown Informally, one point in ROC space is better than another if it is to the northwest of the first – TP rate is higher, FP rate is lower, or both.
ROC Curves and AUC (II) Each point on the ROC curve represents a different tradeoff (cost ratio) between TPR and FPR AUC is area under the curve: represents performance averaged over all possible cost ratios Single summary number Perfect model has AUC = 1.0 Random model has AUC = 0.5
Specific Example Test Result Pts with disease Pts without the disease
Test Result Call these patients “ negative ” Call these patients “ positive ” Threshold
Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease True Positives Some definitions...
Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease False Positives
Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease True negatives
Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease False negatives
How to Construct ROC Curve for one Classifier Sort the instances according to their P pos Move a threshold on the sorted instances For each threshold define a classifier with confusion matrix Plot the TPR and FP of the classifier P pos True Class 0.99pos 0.98pos 0.7neg 0.6pos 0.43neg True Predicted posneg pos21 neg11
ROC Properties AUC properties – Perfect prediction –.9 - Excellent –.7 - Mediocre –.5 - Random ROC Curve properties – If two ROC curves do not intersect then one method dominates the other – If they do intersect then one method is better for some cost ratios, and is worse for others Blue alg better for precision, yellow alg for recall, red neither Can choose method and balance based on goals
Lift (I) In some situations, we are not interested in the accuracy over the entire data set – Accurate predictions for 5%, 10%, or 20% of data – Don’t care about the rest Prototypical application: direct marketing – Baseline: random targeting of population – Can we do better? Want to know how much better a targeted offer is on a fraction of the population
Lift (II) Predicted Output True Output (Target) True Positive (TP) Hits False Negative (FN) Misses True Negative (TN) Correct Rejections False Positive (FP) False Alarm Lift = [TP / (TP+TN)] / [(TP+FP) / (TP+TN+FP+FN)] How much better a model is over random predictions
Lift (III) Lift(t) = CR(t) / t E.g., Lift (25%) = CR(25) / 25 = 62 / 25 = 2.5 If we select 25% of prospects using our model, they are 2.5 times more likely to respond than if we selected them randomly Can vary t to make decisions (e.g., cost/benefit analysis)
Summary Several measures – Single value vs. range of thresholds Restricted to binary classification – Could always cast problem as a set of two class problems but that can be inconvenient Accuracy handles multi-class outputs Key point: – The measure you optimize makes a difference – The measure you report makes a difference – Measure what you want to optimize/report (i.e., use measure appropriate to task/domain)