Evaluating Classifiers

Slides:



Advertisements
Similar presentations
Chapter 4 Pattern Recognition Concepts: Introduction & ROC Analysis.
Advertisements

Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Evaluation and Credibility How much should we believe in what was learned?
ROC & AUC, LIFT ד"ר אבי רוזנפלד.
Evaluation and Credibility
Evaluation of Learning Models
Data Mining – Credibility: Evaluating What’s Been Learned
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
Classification II (continued) Model Evaluation
CLassification TESTING Testing classifier accuracy
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
CpSc 810: Machine Learning Evaluation of Classifier.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ Reduced-error pruning : –breaks the samples into a training set and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
1 Performance Measures for Machine Learning. 2 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall –F –Break Even Point.
Evaluating Classification Performance
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Information Organization: Evaluation of Classification Performance.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Classifiers
Evaluation – next steps
CSE 4705 Artificial Intelligence
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Performance evaluation
Data Mining Classification: Alternative Techniques
Evaluation and Its Methods
Learning Algorithm Evaluation
CSCI N317 Computation for Scientific Applications Unit Weka
Model Evaluation and Selection
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Information Organization: Evaluation of Classification Performance
Presentation transcript:

Evaluating Classifiers Esraa Ali Hassan Data Mining and Machine Learning group Center for Informatics Science Nile University Giza, Egypt

Agenda Introduction Evaluation Strategies Re-substitution Leave one out Hold out Cross Validation Bootstrap Sampling Classifiers Evaluation Metrics Binary classification confusion matrix Sensitivity and specificity ROC curve Precision & Recall Multiclass classification

Introduction Evaluation aims at selecting the most appropriate learning schema for a specific problem We evaluate its ability to generalize what it has been learned from the training set on the new unseen instances

Evaluation Strategies Different strategies to split the dataset into training ,testing and validation sets. Error on the training data is not a good indicator of performance on future data, Why? Because new data will probably not be exactly the same as the training data! The classifier might be fitting the training data too precisely (called Over-fitting)

Re-substitution Testing the model using the dataset used in training it Re-substitution error rate indicates only how good /bad are our results on the TRAINING data; expresses some knowledge about the algorithm used. The error on the training data is NOT a good indicator of performance on future data since it does not measure any not yet seen data.

Leave one out leave one instance out and train the data using all the other instances Repeat that for all instances compute the mean error

Hold out The dataset into two subsets use one of the training and the other for validation (In Large datasets) Common practice is to train the classifiers using 2 thirds of the dataset and test it using the unseen third. Repeated holdout method: process repeated several times with different subsamples error rates on the different iterations are averaged to give an overall error rate

K-Fold Cross Validation Divide the data to k equal parts (folds), One fold is used for testing while the others are used in training, The process is repeated for all the folds and the accuracy is averaged. 10 folds are most commonly used Even better, repeated Cross-validation

Bootstrap sampling CV uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test set The bootstrap uses sampling with replacement to form the training set Sample a dataset of n instances n times with replacement to form a new dataset of n instances Use this data as the training set Use the instances from the original dataset that don’t occur in the new training set for testing

Are Train/Test sets enough? Results Known Data + Training set + - - + Model Builder Evaluate Predictions + - Y N Testing set

Train, Validation, Test split Results Known Model Builder Data + Training set + - - + Evaluate Model Builder Predictions + - Y N Validation set + - Final Evaluation Final Test Set Final Model

Evaluation Metrics Accuracy: It assumes equal cost for all classes Misleading in unbalanced datasets It doesn’t differentiate between different types of errors Ex 1: Cancer Dataset: 10000 instances, 9990 are normal, 10 are ill , If our model classified all instances as normal accuracy will be 99.9 % Medical diagnosis: 95 % healthy, 5% disease. e-Commerce: 99 % do not buy, 1 % buy. Security: 99.999 % of citizens are not terrorists.

Binary classification Confusion Matrix

Binary classification Confusion Matrix True Positive Rate False Positive Rate Overall success rate (Accuracy) Error rate =1- success rate

Sensitivity & Specificity Measures the classifier ability to detect positive classes (its positivity) Specificity The specificity measures how accurate is the classifier in not detecting too many false positives (it measures its negativity) high specificity are used to confirm the results of sensitive A very sensitive classifier will find a number of false positives, but no true positives will be missed , the calculated number measures the classifier ability to detect positive classes (its positivity) The maximum value for sensitivity is 1 (very sensitive classifier) and the minimum is 0 (the classifier never predicts positive values) -- The specificity measures how accurate is the classifier in not detecting too many false positives, it is calculated by finding the percentage of true negatives from all negative instances (it measures its negativity) The maximum value for Specificity is 1 (very specific ,no false positives) and the minimum is 0 ( means that it classifies everything is positive ) high specificity are used to confirm the results of sensitive

ROC Curves Stands for “receiver operating characteristic” Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel y axis shows percentage of true positives in sample x axis shows percentage of false positives in sample Jagged curve—one set of test data Smooth curve—use cross-validation

moving threshold class 1 (positives) class 0 (negatives) Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter.

ROC curve (Cnt’d) Accuracy is measured by the area under the ROC curve. (AUC) An area of 1 represents a perfect test; an area of .5 represents a worthless test: .90-1 = excellent .80-.90 = good .70-.80 = fair .60-.70 = poor .50-.60 = fail

ROC curve characteristics

ROC curve (Cnt’d) For example, M1 is more conservative (more specific and less sensitive) Where M2 is more liberal (more sensitive and less specific For a small focused dataset M1 is better, For large dataset M2 might be a better model, After all this depends on the application we are applying these models in. For a small, focused sample, use method M1 For a larger one, use method M2 In between, choose between M1 and M2 with appropriate probabilities For example, M1 is more conservative (more specific and less sensitive) meaning that it tends to classify few false positives and more false negatives Where M2 is more liberal (more sensitive and less specific) meaning that it tends to classify many false positives with few false negatives (it tends to classify everything as positive) Finally, for a small focused dataset M1 is better to be selected, and for large dataset M2 might be a better model, but after all this depends on the application we are applying these models in.

Recall & Precision It is used by information retrieval researches to measure accuracy of a search engine, they define the recall as (number of relevant documents retrieved) divided by ( total number of relevant documents) Recall Precision

F-measure The F-measure is the harmonic-mean (average of rates) of precision and recall and takes account of both measures. It is biased towards all cases except the true negatives

Notes on Metrics As we can see the True Positive rate = Recall = Sensitivity all are measuring how good the classifier is in finding true positives. When FP rate increases, specificity & precision decreases & vice verse, It doesn't mean that specificity and precision are correlated, for example in unbalanced datasets the precision can be very low where the specificity is high cause the number of instances in the negative class is much higher than the number of positive instances

Multiclass classification For Multiclass prediction task, the result is usually displayed in confusion matrix where there is a row and a column for each class, Each matrix element shows the number of test instances for which the actual class is the row and the predicted class is the column Good results correspond to large numbers down the diagonal and small values (ideally zero) in the rest of the matrix Classified as  a b c A TPaa FNab FNac B FPab TNbb FNbc C FPac FNcb TNcc

Multiclass classification (Cont’d) For example in three classes task {a , b , c} with the confusion matrix below, if we selected a to be the class of interest then Note that we don’t care about the values (FNcb & FNbc) as we are considered with evaluating how the classifier is performing with class a, so the misclassifications between the other classes is out of our interest.

Multiclass classification (Cont’d) To calculate overall model performance, we take their weighted average to evaluate the overall performance of the classifier. Averaged per category (macro average) : Gives equal weight to each class, including rare ones

Multiclass classification (Cont’d) Micro Average: Obtained from true positives (TP), false positives (FP), and false negatives (FN) for each class, and F-measure is the harmonic mean of micro-averaged precision and recall Micro average gives equal weight to each sample regardless of its class. They are dominated by those classes with the large number of samples.

Example: Area under ROC

Example: F-Measure

Thank you :-)