CSI5388: A Critique of our Evaluation Practices in Machine Learning

CSI5388: A Critique of our Evaluation Practices in Machine Learning

Observations The way in which Evaluation is conducted in Machine learning/Data Mining has not been a primary concern in the community. This is very different from the way Evaluation is approached in other applied fields such as: Economics, Psychology and Sociology. In such fields, researchers have been more concerned with the meaning and validity of their results than in ours.

The Problem The objective value of our advances in Machine Learning may be different from what we believe it is. Our conclusions may be flawed or meaningless. ML methods may get undue credit or not get sufficiently recognized. The field may start stagnating. Practitioners in other fields or potential business partners may dismiss our approaches/results. We hope that with better evaluation practices, we can help the field of machine learning focus on more effective research and encourage more cross-discipline or cross-purposes exchanges.

Organization of theLecture
A review of the shortcomings of current evaluation methods: Problems with Performance Evaluation Problems with Confidence Estimation Problems with Data Sets

Recommended Steps for Proper Evaluation
Identify the “interesting” properties of the classifier. Choose an evaluation metric accordingly Choose a confidence estimation method . Check that all the assumptions made by the evaluation metric and confidence estimator are verified. Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. Interpret the results with respect to the domain.

Commonly Followed Steps of Evaluation
Identify the “interesting” properties of the classifier. Choose an evaluation metric accordingly Choose a confidence estimation method . Check that all the assumptions made by the evaluation metric and confidence estimator are verified. Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. Interpret the results with respect to the domain. These steps are typically considered, but only very lightly .

Overview What happens when bad choices of performance evaluation metrics are made? (Steps 1 and 2 are considered too lightly) Accuracy Precision/Recall ROC Analysis Note: each metric solves the problem of the previous one, but introduces new shortcomings (usually caught by the previous metrics) What happens when bad choices of confidence estimators are made and the assumptions underlying these confidence estimator are not respected (Steps 3 is considered lightly and Step 4 is disregarded). The t-test E.g., E.g.,

A Short Review I: Confusion Matrix / Common Performance evaluation Metrics
Accuracy = (TP+TN)/(P+N) Precision = TP/(TP+FP) Recall/TP rate = TP/P FP Rate = FP/N ROC Analysis moves the threshold between the positive and negative class from a small FP rate to a large one. It plots the value of the Recall against that of the FP Rate at each FP Rate considered. True class  Hypothesized | class V Pos Neg Yes TP FP No FN TN P=TP+FN N=FP+TN A Confusion Matrix

A Short Review II: Confidence Estimation / The t-Test
The most commonly used approach to confidence estimation in Machine learning is: To run the algorithm using 10-fold cross-validation and to record the accuracy at each fold. To compute a confidence interval around the average of the difference between these reported accuracies and a given gold standard, using the t-test, i.e., the following formula: δ +/- tN,9 * sδ where δ is the average difference between the reported accuracy and the given gold standard, tN,9 is a constant chosen according to the degree of confidence desired, sδ = sqrt(1/90 Σi=110 (δi – δ)2) where δi represents the difference between the reported accuracy and the given gold standard at fold i.

What’s wrong with Accuracy?
True class  Pos Neg Yes 200 100 No 300 400 P=500 N=500 True class  Pos Neg Yes 400 300 No 100 200 P=500 N=500 Both classifiers obtain 60% accuracy They exhibit very different behaviours: On the left: weak positive recognition rate/strong negative recognition rate On the right: strong positive recognition rate/weak negative recognition rate

What’s wrong with Precision/Recall?
True class  Pos Neg Yes 200 100 No 300 400 P=500 N=500 True class  Pos Neg Yes 200 100 No 300 P=500 N=100 Both classifiers obtain the same precision and recall values of 66.7% and 40% They exhibit very different behaviours: Same positive recognition rate Extremely different negative recognition rate: strong on the left / nil on the right Note: Accuracy has no problem catching this!

What’s wrong with ROC Analysis
What’s wrong with ROC Analysis? (We consider single points in space: not the entire ROC Curve) True class  Pos Neg Yes 200 10 No 300 4,000 P=500 N=4,010 True class  Pos Neg Yes 500 1,000 No 300 400,000 P=800 N=401,000 ROC Analysis and Precision yield contradictory results In terms of ROC Analysis, the classifier on the right is a significantly better choice than the one on the left. [the point representing the right classifier is on the same vertical line but 22.25% higher than the point representing the left classifier] Yet, the classifier on the right has ridiculously low precision (33.3%) while the classifier on the left has excellent precision (95.24%).

What’s wrong with the t-test?
Fold 1 2 3 4 5 6 7 8 9 10 C 1 +5% -5% C 2 +10% 0% Classifiers 1 and 2 yield the same average mean and confidence interval. Yet, Classifier 1 is relatively stable, while Classifier 2 is not. Problem: the t-test assumes a normal distribution. The difference in accuracy between classifier 2 and the gold-standard is not normally distributed

Discussion There is nothing intrinsically wrong with any of the performance evaluation measures or confidence tests discussed. It’s all a matter of thinking about which one to use when, and what the results means (both in terms of added value and limitations). Simple conceptualization of the Problem with current evaluation practices: Evaluation Metrics and Confidence Measures summarize the results  ML Practitioners must understand the terms of these summarizations and verify that their assumptions are verified. In certain cases, however, it is necessary to look further and, eventually, borrow practices from other disciplines. In, yet, other cases, it pays to devise our own methods. Both instances are discussed in what follows.

CSI5388: A Critique of our Evaluation Practices in Machine Learning

Similar presentations

Presentation on theme: "CSI5388: A Critique of our Evaluation Practices in Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSI5388: A Critique of our Evaluation Practices in Machine Learning

Similar presentations

Presentation on theme: "CSI5388: A Critique of our Evaluation Practices in Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback