Experimental Evaluation of Learning Algorithms Part 1.

Experimental Evaluation of Learning Algorithms Part 1

2 Motivation Evaluating the performance of learning systems is important because: Evaluating the performance of learning systems is important because: Learning systems are usually designed to predict the class of “future” unlabeled data points. Learning systems are usually designed to predict the class of “future” unlabeled data points. In some cases, evaluating hypotheses is an integral part of the learning process (example, when pruning a decision tree) In some cases, evaluating hypotheses is an integral part of the learning process (example, when pruning a decision tree)

3 Recommended Steps for Proper Evaluation 1. Identify the “interesting” properties of the classifier. 2. Choose an evaluation metric accordingly. 3. Choose the learning algorithms to involve in the study along with the domain(s) on which the various systems will be pitted against. 4. Choose a confidence estimation method. 5. Check that all the assumptions made by the evaluation metric and confidence estimator are verified. 6. Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. 7. Interpret the results with respect to the domain(s).

4 Choose Learning Algorithm(s) to Evaluate Select Performance Measure of Interest Select Datasets for Comparison Select Error- Estimation/Sampling Method Select Statistical Test The Classifier Evaluation Procedure Perform Evaluation Means knowledge of 1 is necessary for 2 Means feedback from 1 should be used to adjust 2 1111 2222

5 Typical (but not necessarily optimal) Choices Identify the “interesting” properties of the classifier. Identify the “interesting” properties of the classifier. Choose an evaluation metric accordingly Choose an evaluation metric accordingly Choose a confidence estimation method. Choose a confidence estimation method. Check that all the assumptions made by the evaluation metric and confidence estimator are verified. Check that all the assumptions made by the evaluation metric and confidence estimator are verified. Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. Interpret the results with respect to the domain. Interpret the results with respect to the domain.

6 Typical (but not necessarily optimal) Choices I: 1.Identify the “interesting” properties of the classifier. 2.Choose an evaluation metric accordingly 3.Choose a confidence estimation method. 4.Check that all the assumptions made by the evaluation metric and confidence estimator are verified. 5.Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. 6.Interpret the results with respect to the domain. Almost only the highlighted part is performed. These steps are typically considered, but only very lightly

7 Typical (but not necessarily optimal) Choices II: Typical choices for Performance Evaluation: Typical choices for Performance Evaluation: Accuracy Accuracy Precision/Recall Precision/Recall Typical choices for Sampling Methods: Typical choices for Sampling Methods: Train/Test Sets (Why is this necessary?) Train/Test Sets (Why is this necessary?) K-Fold Cross-validation K-Fold Cross-validation Typical choices for significance estimation Typical choices for significance estimation t-test (often a very bad choice, in fact!) t-test (often a very bad choice, in fact!)

8 Confusion Matrix / Common Performance evaluation Metrics True class  Hypothesized | class V PosNeg YesTPFP NoFNTN P=TP+FNN=FP+TN Accuracy = (TP+TN)/(P+N) Accuracy = (TP+TN)/(P+N) Precision = TP/(TP+FP) Precision = TP/(TP+FP) Recall/TP rate = TP/P Recall/TP rate = TP/P FP Rate = FP/N FP Rate = FP/N A Confusion Matrix

9 Sampling and Significance Estimation: Questions Considered Given the observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples? Given the observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples? Given that one hypothesis outperforms another over some sample data, how probable is it that this hypothesis is more accurate, in general? Given that one hypothesis outperforms another over some sample data, how probable is it that this hypothesis is more accurate, in general? When data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy? When data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy?

10 k-Fold Cross-Validation 1. Partition the available data D 0 into k disjoint subsets T 1, T 2, …, T k of equal size, where this size is at least 30. 2. For i from 1 to k, do use T i for the test set, and the remaining data for training set S i S i <- {D 0 - T i } S i <- {D 0 - T i } h A <- L A (S i ) h A <- L A (S i ) h B <- L B (S i ) h B <- L B (S i )  i <- error Ti (h A )-error Ti (h B )  i <- error Ti (h A )-error Ti (h B ) 3. Return the value avg(  ), where. avg(  ) = 1/k  i=1 k  i

11 Confidence of the k-fold Estimate The most commonly used approach to confidence estimation in Machine learning is: The most commonly used approach to confidence estimation in Machine learning is: To run the algorithm using 10-fold cross-validation and to record the accuracy at each fold. To run the algorithm using 10-fold cross-validation and to record the accuracy at each fold. To compute a confidence interval around the average of the difference between these reported accuracies and a given gold standard, using the t-test, i.e., the following formula: To compute a confidence interval around the average of the difference between these reported accuracies and a given gold standard, using the t-test, i.e., the following formula: δ +/- t N,9 * s δ where  δ is the average difference between the reported accuracy and the given gold standard,  t N,9 is a constant chosen according to the degree of confidence desired,  s δ = sqrt(1/90 Σ i=1 10 (δ i – δ) 2 ) where δ i represents the difference between the reported accuracy and the given gold standard at fold i.

12 What’s wrong with Accuracy? True class  PosNeg Yes200100 No300400 P=500N=500 PosNegYes400300 No100200 P=500N=500 Both classifiers obtain 60% accuracy Both classifiers obtain 60% accuracy They exhibit very different behaviours: They exhibit very different behaviours: On the left: weak positive recognition rate/strong negative recognition rate On the left: weak positive recognition rate/strong negative recognition rate On the right: strong positive recognition rate/weak negative recognition rate On the right: strong positive recognition rate/weak negative recognition rate

13 What’s wrong with Precision/Recall? True class  PosNeg Yes200100 No300400 P=500N=500 PosNegYes200100 No3000 P=500N=100 Both classifiers obtain the same precision and recall values of 66.7% and 40% Both classifiers obtain the same precision and recall values of 66.7% and 40% They exhibit very different behaviours: They exhibit very different behaviours: Same positive recognition rate Same positive recognition rate Extremely different negative recognition rate: strong on the left / nil on the right Extremely different negative recognition rate: strong on the left / nil on the right Note: Accuracy has no problem catching this! Note: Accuracy has no problem catching this!

14 What’s wrong with the t-test? Classifiers 1 and 2 yield the same average mean and confidence interval. Classifiers 1 and 2 yield the same average mean and confidence interval. Yet, Classifier 1 is relatively stable, while Classifier 2 is not. Yet, Classifier 1 is relatively stable, while Classifier 2 is not. Problem: the t-test assumes a normal distribution. The difference in accuracy between classifier 2 and the gold-standard is not normally distributed Problem: the t-test assumes a normal distribution. The difference in accuracy between classifier 2 and the gold-standard is not normally distributed Fold1Fold2Fold3Fold4Fold5Fold6Fold7Fold8Fold9Fold10 C 1 +5%-5%+5%-5%+5%-5%+5%-5%+5%-5% C 2 +10%-5%-5%0%0%0%0%0%0%0%

15 So what can be done? Think about evaluation carefully prior to starting your experiments. Think about evaluation carefully prior to starting your experiments. Use performance measures other than accuracy and precision recall. E.g., ROC Analysis, combinations of measures. Also, think about the best measure for your problem. Use performance measures other than accuracy and precision recall. E.g., ROC Analysis, combinations of measures. Also, think about the best measure for your problem. Use re-sampling methods other than cross-validation, when necessary: bootstrapping? Randomization? Use re-sampling methods other than cross-validation, when necessary: bootstrapping? Randomization? Use statistical tests other than the t-test: non-parametric tests; tests appropriate for many classifiers compared on many domains (the t-test is not appropriate for this case, which is the most common one). Use statistical tests other than the t-test: non-parametric tests; tests appropriate for many classifiers compared on many domains (the t-test is not appropriate for this case, which is the most common one).  We will try to discuss some of these issues next time.

Experimental Evaluation of Learning Algorithms Part 1.

Similar presentations

Presentation on theme: "Experimental Evaluation of Learning Algorithms Part 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Experimental Evaluation of Learning Algorithms Part 1.

Similar presentations

Presentation on theme: "Experimental Evaluation of Learning Algorithms Part 1."— Presentation transcript:

Similar presentations

About project

Feedback