Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Slides:



Advertisements
Similar presentations
Learning Algorithm Evaluation
Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Model generalization Test error Bias, variance and complexity
Lecture 22: Evaluation April 24, 2010.
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Evaluating Hypotheses. Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 2 Some notices Exam can be made in Artificial Intelligence (Department.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluation.
Experimental Evaluation
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Ensemble Learning (2), Tree and Forest
JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Today Evaluation Measures Accuracy Significance Testing
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
by B. Zadrozny and C. Elkan
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Experimental Evaluation of Learning Algorithms Part 1.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Learning from Observations Chapter 18 Through
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
A.N.N.C.R.I.P.S The Artificial Neural Networks for Cancer Research in Prediction & Survival A CSI – VESIT PRESENTATION Presented By Karan Kamdar Amit.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Classifiers
Evaluating Results of Learning
Roberto Battiti, Mauro Brunato
Experiments in Machine Learning
Learning Algorithm Evaluation
Model Evaluation and Selection
Somi Jacob and Christian Bach
Presentation transcript:

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering University of Ulster, N.Ireland, UK

Building prediction models Different models, tools and applications. The problem of prediction (classification). Data Prediction model Predictions Event Process Condition Properties Category Values Action Response P

Building prediction models Supervised learning methods Prediction model A, C C’ Training phase: A set of cases and their respective labels are used to build a classification model. Prediction model A, C C*

Prediction model A, (C) C* Test phase: the trained classifier is used to predict new cases. Prediction models, such as ANN, aim to achieve an ability to generalise: The capacity to correctly classify cases or problems unseen during training. Quality indicator: Accuracy during the test phase

Building prediction models – Assessing their quality A classifier will be able to generalise if: a) its architecture and learning parameters have been properly defined, and b) enough training data are available. The second condition is difficult to achieve due to resource and time constraints. Key limitations appear when dealing with small- data samples, which is a common feature observed in many studies. A small test data set may contribute to an inaccurate performance assessment.

Key questions How to measure classification quality? How can I select training and test cases ? How many experiments ? How to estimate prediction accuracy ? Effects on small – large data sets ?

What is Accuracy?

Accuracy = No. of correct predictions No. of predictions = TP + TN TP + TN + FP + FN

Examples (1) Clearly, B, C, D are all better than A Is B better than C, D? Is C better than B, D? Is D better than B, C? Accuracy may not tell the whole story

Examples (2)- Clearly, D is better than A Is B better than A, C, D?

What is Sensitivity (recall)? Sensitivity = No. of correct positive predictions No. of positives = TP TP + FN True positive rate True negative rate is termed specificity

What is Precision? Precision = No. of correct positive predictions No. of positives predictions = TP TP + FP wrt positives

Precision-Recall Trade-off A predicts better than B if A has better recall and precision than B There is a trade-off between recall and precision In some applications, once you reach a satisfactory precision, you optimize for recall In some applications, once you reach a satisfactory recall, you optimize for precision recall precision

Comparing Prediction Performance Accuracy is the obvious measure –But it conveys the right intuition only when the positive and negative populations are roughly equal in size Recall and precision together form a better measure –But what do you do when A has better recall than B and B has better precision than A?

Some Alternate measures F-Measure - Take the harmonic mean of recall and precision Adjusted Accuracy – weight ROC curve - Receiver Operating Characteristic analysis F = 2 * recall * precision recall + precision (wrt positives)

Adjusted Accuracy Weigh by the importance of the classes Adjusted accuracy =  * Sensitivity  * Specificity+ where  +  = 1 typically,  =  = 0.5 But values for ,  ?

ROC Curves By changing t, we get a range of sensitivities and specificities of a classifier A predicts better than B if A has better sensitivities than B at most specificities Leads to ROC curve that plots sensitivity vs. (1 – specificity) Then the larger the area under the ROC curve, the better sensitivity 1 – specificity

Key questions How to measure classification quality? How can I select training and test cases ? How many experiments ? How to estimate prediction accuracy ? Effects on small – large data sets ?

Data sampling techniques Main goals: Reduction of the estimation bias Reduction of the variance introduced by a small data set Too optimistic Too conservative

a) to establish differences between data sampling techniques when applied to small and larger datasets, b) to study the response of these methods to the size and number of train-test sets, and c) to discuss criteria for the selection of sampling techniques. Other important goals

Three Data Sampling Techniques cross-validation leave-one-out bootstrap.

k-fold cross validation N samples, p training samples, q test samples (q = N – p) Data Randomly divides the data into the training and test sets. This process is repeated k times and the classification performance is the average of the individual test estimates. Experiment 1 Data Experiment 2 Data Experiment k N

k-fold cross validation The classifier may not be able to accurately predict new cases if the amount of data used for training is too small. At the same time, the quality assessment may not be accurate if the portion of data used for testing is too small. p % q % ? ? Splitting procedure

The Leave-One-Out Method Given N cases available in a dataset, a classifier is trained on (N-1) cases, and then is tested on the case that was left out. This is repeated N times until every case in the dataset has been included once as a cross-validation instance. The results are averaged across the N test cases to estimate the classifier’s prediction performance. Data Experiment 1 Data Experiment 2 Data Experiment N N

The Bootstrap Method A training dataset is generated by sampling with replacement N times from the available N cases. The classifier is trained on this set and then tested on the original dataset. This process is repeated several times, and the classifier’s accuracy estimate is the average of these individual estimates. Data Case 1 Case 2 Case 3 Case 4 Case 5 Training (1) Case 1 Case 3 Case 5 Test (1) Case 1 Case 2 Case 3 Case 4 Case 5

An example 88 cases categorised into four classes: Ewing family of tumors (EWS, 30), rhabdomyosarcoma (RMS, 11), Burkitt lymphomas (BL, 19) and euroblastomas (NB, 28). Represented by the expression values of 2308 genes with suspected roles in processes relevant to these tumors. PCA was applied to reduce the dimensionality of the cases, the 10 dominant components per case were used to train the networks. All of the classifiers (BP-ANN) were trained using the same learning parameters. The BP-ANN architectures comprised 10 input nodes, 8 hidden nodes and 4 output nodes. Each output node encodes one of the tumor classes.

The cross-validation results were analysed for three different data splitting methods: a)50% of the available cases were used for training the classifiers and the remaining 50% for testing, b) 75% for training and 25% for testing, c) 95% for training and 5% for testing. Analysing the k-fold cross validation

Tumour classification A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs (interval size equal to 0.01), F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs. Cross-validation method based on a 50%-50% splitting.

Tumour classification Cross-validation method based on a 75%-25% splitting. A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs (interval size equal to 0.01), G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.

Tumour classification Cross-validation method based on a 95%-5% splitting. A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs (interval size equal to 0.01).

Tumour classification The 50%-50% cross-validation produced the most conservative accuracy estimates. The 95%-5% cross-validation method produced the most optimistic cross- validation accuracy estimates. The leave-one-out method produced the highest accuracy estimate for this dataset (0.79). The estimation of high accuracy values may be linked to an increase of the size of the training datasets.

Tumour classification Bootstrap method A: 100 train-test runs, B: 200 train-test runs, C: 300 train-test runs, D: 400 train-test runs, E: 500 train-test runs, F: 600 train-test runs, G: 700 train- test runs, H: 800 train-test runs, I: 900 train-test runs (interval size equal to 0.01), J: 1000 train-test runs.

Final remarks The problem of estimating prediction quality should be carefully addressed and deserves further investigations. Sampling techniques can be implemented to assess the classification quality factors (such accuracy) of classifiers (such as ANNs). In general there is variability among the three techniques. These experiments suggest that it is possible to achieve lower variance estimates for different numbers of train-test runs.

Furthermore, one may identify conservative and optimistic accuracy predictors, whose overall estimates may be significantly different. This effect is more distinguishable in small-sample applications. The predicted accuracy of a classifier is generally proportional to the size of the training dataset. The bootstrap method may be applied to generate conservative and robust accuracy estimates, based on a relatively small number of train-test experiments. Final remarks (II)

This presentation highlights the importance of performing more rigorous procedures on the selection of data and classification quality assessment. In general the application of more than one sampling technique may provide the basis for accurate and reliable predictions. Final remarks (III)