Evaluating Classifiers

Evaluating Classifiers
Villanova University Machine Learning Project

Machine Learning Getting a computer to learn from data A type of Artificial Intelligence, where a computer does something "intelligent" Ability of a computer to improve what it does in a way that mimics how humans learn, like with repetition or experience Villanova University Machine Learning Project Evaluating Classifiers

Examples of Machine Learning
Face detection used by Facebook to help you automatically tag friends Spam filters that get better over time at identifying and trashing spam s Fraud detection that notices suspicious patterns of credit card use and you get a call Optical character recognition that reads the numbers written on a check you deposit Villanova University Machine Learning Project Evaluating Classifiers

When we use a machine learning tool to create a classifier, we want to use it to make a decision Is this spam? Should the car honk? Can my band successfully play this arrangement? We need to know how well it is doing We also need to know what kind of mistakes it is making. We want to evaluate our classifier. Note that the “car honking” is a current (2016) Google machine learning effort. See, for instance, Villanova University Machine Learning Project Evaluating Classifiers

In evaluating a classifier we care about two things: How many mistakes it makes What kind of mistakes it makes Accuracy tells us the percent of correct decisions The system correctly labels 90% of as spam or not spam Types of errors tell us which direction of mistakes No correct messages are labeled spam 10% of spam messages are labeled as not spam Villanova University Machine Learning Project Evaluating Classifiers

Confusion Matrix Typically, in evaluating a classifier we first create a confusion matrix For a yes/no decision, this gives four kinds of results: True positives. The system says yes, the truth is yes True negatives. The system says no, the truth is no False positives. The system says yes, the truth is no False negatives. The system says no, the truth is yes. Villanova University Machine Learning Project Evaluating Classifiers

Confusion Matrix Is it spam? System says it is spam System says it is not spam It is actually spam A: True positives B: False negatives C: False positives D: True negatives Note that yes and no are arbitrary, and depend on how the question is phrased. Is it spam vs Is it good mail? The application usually gives an obvious direction, depending on the action to be taken. Note that “positive” vs “negative” is arbitrary Villanova University Machine Learning Project Evaluating Classifiers

Confusion Matrix Is it spam? System says it is spam System says it is not spam It is actually spam A: 10 B: 10 It is actually not spam C: 0 D: 80 Note that “positive” vs “negative” is arbitrary Villanova University Machine Learning Project Evaluating Classifiers

Accuracy Percent of correctly classified instances simplest measure True positives + True Negatives / All instances (A + D) / (A + B + C + D) Other things being equal, higher accuracy is better. But: always choosing the most common class gives you the “majority classifier”. 80% of my messages are not spam, so if I label everything “not spam” my accuracy is 80%. Not helpful. You want accuracy better than the majority classifier. Villanova University Machine Learning Project Evaluating Classifiers

Thought Check For binary classifiers A and B, for balanced data: Which is better: A is 80% accurate, B is 60% accurate Would you use a spam filter that was 80% accurate? 90% Would you use a classifier for who needs major surgery that was 80% accurate? Would you ever use a two-class classifier that is 50% accurate? Spam filter: probably yes for both, if the false positives are 0. Then we never have to look at the spam folder, at least. Surgery: doubtful; this is isn’t accurate enough for such an important decision 50%: not normally. Guessing would do as well. There are two ideas to bring out here: : The acceptable level of accuracy depends on the importance of the decision The kind of mistake also matters: screening out 80% of spam is helpful; calling even 1% if real mail spam is not. Villanova University Machine Learning Project Evaluating Classifiers

Evaluation: Precision and Recall
We also need to look at kinds of errors. The confusion matrix directly identifies false negatives and false positives. False negative: “not spam” is actually spam and gets through the filter False positive: real mail gets labeled spam and is lost. Two additional measures which are useful: Recall: % of instances in a class which are correctly classified as that class: A/(A+B) Precision: % of instances classified in a class which are actually in that class: A/(A+C) Villanova University Machine Learning Project Evaluating Classifiers

Evaluation: Precision and Recall
Precision: % of instances classified in a class which are actually in that class: A/(A+C) correctly classified as i / total classified as i For spam matrix, 10/10. Everything we call spam really is, precision is 100%. Recall: % of instances in a class which are correctly classified as that class correctly classified as i / total which are i. A/(A+B) For spam matrix, 10/20. We have successfully classified half the spam. Villanova University Machine Learning Project Evaluating Classifiers

Evaluation: Sample Weka output
Partial output from the Weka Decision Tree machine learning tool: The task is to predict the recurrence of breast cancer based on a set of measurements. There are 286 cases. And a 10-fold cross-validation was used. Accuracy is 75.5% There are 201 cases with no recurrence and 85 with a recurrence So, treating “no recurrence” as positive, precision is 76% and recall is 96% Villanova University Machine Learning Project Evaluating Classifiers

Thought Check For binary classifiers A and B, for balanced data: Which is better: A is 80% accurate, B is 60% accurate Which is better: A has 90% precision, B has 70% precision Would you use a spam filter whose recall was 95% Whose precision was 95% Would you use a classifier for who needs major surgery with false positive rate of 20%, false negative of 0%? (Balanced data means each class has about 50% of the data. For which is better: without more information A is better in both cases. Spam filter: recall: without more information, no. This could just mean that we are labeling almost everything spam. Precision: probably, although even at 95% we are going to have to check our spam filters occasionally. Surgery: as a screening device, to be followed by more tests, yes. If it says no, then no more testing is needed. As the final decision, probably not. Villanova University Machine Learning Project Evaluating Classifiers

Evaluation: Overfitting
When we use a machine learning method, we are training a model to predict classification, based on a training set which already has the answers. The model may capture chance variations in this set This leads to overfitting -- the model is too closely matched to the exact data set it’s been given Villanova University Machine Learning Project Evaluating Classifiers

Evaluation: Overfitting
When we have overfitting in our model, the machine learning method has matched random variation in our model When the model is applied to new data its performance will be much less accurate. This is more likely when the training set has small training sets Large number of features Villanova University Machine Learning Project Evaluating Classifiers

Training Data and Test Data
Overfitting means we do not have a good picture of how well the classifier will perform with new data. So to get a true picture of how good the classifier is, we have to try it out on separate data, test data. This will tell us whether overfitting is a problem, and how well our system will actually do. Villanova University Machine Learning Project Evaluating Classifiers

Evaluating Classifying Systems
Standard methodology: Collect a large set of examples (all with correct classifications) Divide into two separate sets: training and test Train the classifier with the training set. Measure performance with the test set This applies to any machine learning method which is a classifier – ie, which is designed to separate cases into different groups. Villanova University Machine Learning Project Evaluating Classifiers

What do we use for a test set?
We could use all our cases. This is testing on the training data. If the classifier doesn’t do well on the training set, then your method isn’t working. You don’t have the right data, or you’re using the wrong machine-learning method However, this won’t give you information about whether it will do as well on new data. You may have a separately created test set This is expensive. And therefore unusual. It is typically expensive to collect training/test data with the decisions already made, since normally it means that someone must go through each case and make a decision. Separate test sets mostly get used in competitions and research, where groups are challenged to make the best decisions, but only given the training data to work with. Weka refers to a separate test set as “supplied test set”. Villanova University Machine Learning Project Evaluating Classifiers

Split Test Sets More typically? Split the cases we have Percentage split: randomly choose a subset to be test cases Easy, but will perform poorly if you don’t have enough data. Best with a large number of cases and few features want as many training cases as possible; typically 66% to 90% for training. In Weka, if you choose Percentage Split for evaluation, it will use 66^ training data by default. Villanova University Machine Learning Project Evaluating Classifiers

Cross-Validation Cross-validation uses the entire data set to train and test the system, but not all at the same time: Split instances into multiple subsets, run classifier multiple times, using one subset as for test and the rest for training each time average the results All instances are used each time, and each instance is used as a test instance once Splits are often called folds. 10-fold means divide the data into 10 sets run the classifier 10 times, using one set each time Weka will default to 10-fold cross-validation if no specific method is chosen. Villanova University Machine Learning Project Evaluating Classifiers

Cross-Validation Cross-validation is a good way to take maximum advantage of small datasets However, it is computationally expensive, since the system is trained once for each fold. For very large datasets, a percentage split is a better choice. Villanova University Machine Learning Project Evaluating Classifiers

Weka Test Options For a classifier, Weka offers all of these options. A test option must be chosen to run the classifier. The image shows the defaults. ZeroR is Weka’s name for the majority classifier. Just choose the most common class for everything. That is its default classifier. Villanova University Machine Learning Project Evaluating Classifiers

How Many Test Cases Want to train on as many cases as possible But need enough test cases to be meaningful Need to represent every class in the test set So need more test cases if large number of classes some very infrequent classes Typical is to use 10% of cases as test set Villanova University Machine Learning Project Evaluating Classifiers

One More Point on Evaluating Classifiers
We are training a classifier because there is some task we want to carry out. Is the classifier actually useful? Majority classifier: assign all cases to the most common class. Compare trained classifier to this. Especially relevant for very unbalanced classes Consider classifying x-rays into cancer/non-cancer, with a cancer rate of 5% We train a classifier, and get 95% accuracy. Is this valuable? Expect discussion to come up with “probably not, but depends on the confusion matrix; back to different kinds of errors. Villanova University Machine Learning Project Evaluating Classifiers

So Is My Classifier Good?
We need to decide if the classifier actually useful. There is no single answer. Some relevant questions: How are decisions made without the classifier? Can we save time or money using it? How important are mistakes? What is the cost of a mistake? What is good enough for: Spam filters: useful if no false positives , even if false negatives Medical screening test: useful if no false negatives, even if false positives Recommend a book: useful with accuracy greater than majority classifier Spam filter discussion: as long as we don’t have to look at the spam filter, any gain in identifying spam saves work. If we have false positives, we have to look at the filtered messages anyhow, so it doesn’t help much. Screening discussion separate people who are fine from people who need more testing. Do not want to miss a possible problem. Book: the alternative is usually to do nothing, so any recommendation that causes a book to be bought is probably valuable, and mistakes are minor. Villanova University Machine Learning Project Evaluating Classifiers

Summary Classifiers should be evaluated for usefulness. Typical measures are accuracy, precision, recall Confusion matrix gives information about kind of errors Evaluation should be done using separate test cases to detect overfitting The usefulness of a classifier depends on its accuracy, the kinds of mistakes it makes. and how it will be used Weka automatically carries out evaluation of classifiers. Villanova University Machine Learning Project Evaluating Classifiers

Evaluating Classifiers

Similar presentations

Presentation on theme: "Evaluating Classifiers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating Classifiers

Similar presentations

Presentation on theme: "Evaluating Classifiers"— Presentation transcript:

Similar presentations

About project

Feedback