Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering University of Ulster, N.Ireland, UK
Building prediction models Different models, tools and applications. The problem of prediction (classification). Data Prediction model Predictions Event Process Condition Properties Category Values Action Response P
Building prediction models Supervised learning methods Prediction model A, C C’ Training phase: A set of cases and their respective labels are used to build a classification model. Prediction model A, C C*
Prediction model A, (C) C* Test phase: the trained classifier is used to predict new cases. Prediction models, such as ANN, aim to achieve an ability to generalise: The capacity to correctly classify cases or problems unseen during training. Quality indicator: Accuracy during the test phase
Building prediction models – Assessing their quality A classifier will be able to generalise if: a) its architecture and learning parameters have been properly defined, and b) enough training data are available. The second condition is difficult to achieve due to resource and time constraints. Key limitations appear when dealing with small- data samples, which is a common feature observed in many studies. A small test data set may contribute to an inaccurate performance assessment.
Key questions How to measure classification quality? How can I select training and test cases ? How many experiments ? How to estimate prediction accuracy ? Effects on small – large data sets ?
What is Accuracy?
Accuracy = No. of correct predictions No. of predictions = TP + TN TP + TN + FP + FN
Examples (1) Clearly, B, C, D are all better than A Is B better than C, D? Is C better than B, D? Is D better than B, C? Accuracy may not tell the whole story
Examples (2)- Clearly, D is better than A Is B better than A, C, D?
What is Sensitivity (recall)? Sensitivity = No. of correct positive predictions No. of positives = TP TP + FN True positive rate True negative rate is termed specificity
What is Precision? Precision = No. of correct positive predictions No. of positives predictions = TP TP + FP wrt positives
Precision-Recall Trade-off A predicts better than B if A has better recall and precision than B There is a trade-off between recall and precision In some applications, once you reach a satisfactory precision, you optimize for recall In some applications, once you reach a satisfactory recall, you optimize for precision recall precision
Comparing Prediction Performance Accuracy is the obvious measure –But it conveys the right intuition only when the positive and negative populations are roughly equal in size Recall and precision together form a better measure –But what do you do when A has better recall than B and B has better precision than A?
Some Alternate measures F-Measure - Take the harmonic mean of recall and precision Adjusted Accuracy – weight ROC curve - Receiver Operating Characteristic analysis F = 2 * recall * precision recall + precision (wrt positives)
Adjusted Accuracy Weigh by the importance of the classes Adjusted accuracy = * Sensitivity * Specificity+ where + = 1 typically, = = 0.5 But values for , ?
ROC Curves By changing t, we get a range of sensitivities and specificities of a classifier A predicts better than B if A has better sensitivities than B at most specificities Leads to ROC curve that plots sensitivity vs. (1 – specificity) Then the larger the area under the ROC curve, the better sensitivity 1 – specificity
Key questions How to measure classification quality? How can I select training and test cases ? How many experiments ? How to estimate prediction accuracy ? Effects on small – large data sets ?
Data sampling techniques Main goals: Reduction of the estimation bias Reduction of the variance introduced by a small data set Too optimistic Too conservative
a) to establish differences between data sampling techniques when applied to small and larger datasets, b) to study the response of these methods to the size and number of train-test sets, and c) to discuss criteria for the selection of sampling techniques. Other important goals
Three Data Sampling Techniques cross-validation leave-one-out bootstrap.
k-fold cross validation N samples, p training samples, q test samples (q = N – p) Data Randomly divides the data into the training and test sets. This process is repeated k times and the classification performance is the average of the individual test estimates. Experiment 1 Data Experiment 2 Data Experiment k N
k-fold cross validation The classifier may not be able to accurately predict new cases if the amount of data used for training is too small. At the same time, the quality assessment may not be accurate if the portion of data used for testing is too small. p % q % ? ? Splitting procedure
The Leave-One-Out Method Given N cases available in a dataset, a classifier is trained on (N-1) cases, and then is tested on the case that was left out. This is repeated N times until every case in the dataset has been included once as a cross-validation instance. The results are averaged across the N test cases to estimate the classifier’s prediction performance. Data Experiment 1 Data Experiment 2 Data Experiment N N
The Bootstrap Method A training dataset is generated by sampling with replacement N times from the available N cases. The classifier is trained on this set and then tested on the original dataset. This process is repeated several times, and the classifier’s accuracy estimate is the average of these individual estimates. Data Case 1 Case 2 Case 3 Case 4 Case 5 Training (1) Case 1 Case 3 Case 5 Test (1) Case 1 Case 2 Case 3 Case 4 Case 5
An example 88 cases categorised into four classes: Ewing family of tumors (EWS, 30), rhabdomyosarcoma (RMS, 11), Burkitt lymphomas (BL, 19) and euroblastomas (NB, 28). Represented by the expression values of 2308 genes with suspected roles in processes relevant to these tumors. PCA was applied to reduce the dimensionality of the cases, the 10 dominant components per case were used to train the networks. All of the classifiers (BP-ANN) were trained using the same learning parameters. The BP-ANN architectures comprised 10 input nodes, 8 hidden nodes and 4 output nodes. Each output node encodes one of the tumor classes.
The cross-validation results were analysed for three different data splitting methods: a)50% of the available cases were used for training the classifiers and the remaining 50% for testing, b) 75% for training and 25% for testing, c) 95% for training and 5% for testing. Analysing the k-fold cross validation
Tumour classification A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs (interval size equal to 0.01), F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs. Cross-validation method based on a 50%-50% splitting.
Tumour classification Cross-validation method based on a 75%-25% splitting. A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs (interval size equal to 0.01), G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs.
Tumour classification Cross-validation method based on a 95%-5% splitting. A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs (interval size equal to 0.01).
Tumour classification The 50%-50% cross-validation produced the most conservative accuracy estimates. The 95%-5% cross-validation method produced the most optimistic cross- validation accuracy estimates. The leave-one-out method produced the highest accuracy estimate for this dataset (0.79). The estimation of high accuracy values may be linked to an increase of the size of the training datasets.
Tumour classification Bootstrap method A: 100 train-test runs, B: 200 train-test runs, C: 300 train-test runs, D: 400 train-test runs, E: 500 train-test runs, F: 600 train-test runs, G: 700 train- test runs, H: 800 train-test runs, I: 900 train-test runs (interval size equal to 0.01), J: 1000 train-test runs.
Final remarks The problem of estimating prediction quality should be carefully addressed and deserves further investigations. Sampling techniques can be implemented to assess the classification quality factors (such accuracy) of classifiers (such as ANNs). In general there is variability among the three techniques. These experiments suggest that it is possible to achieve lower variance estimates for different numbers of train-test runs.
Furthermore, one may identify conservative and optimistic accuracy predictors, whose overall estimates may be significantly different. This effect is more distinguishable in small-sample applications. The predicted accuracy of a classifier is generally proportional to the size of the training dataset. The bootstrap method may be applied to generate conservative and robust accuracy estimates, based on a relatively small number of train-test experiments. Final remarks (II)
This presentation highlights the importance of performing more rigorous procedures on the selection of data and classification quality assessment. In general the application of more than one sampling technique may provide the basis for accurate and reliable predictions. Final remarks (III)