A Posteriori Corrections to Classification Methods Włodzisław Duch & Łukasz Itert Department of Informatics, Nicholas Copernicus University, Torun, Poland.
MotivationMotivation So you’ve got you model … but that’s not the end. Try to derive as much information from it as possible. In pathological cases NN, FS, RS and other systems lead to results below the base rates. How to avoid it? In controlled experiment the split was 50%-50%. In real life it is 5-95%. How to deal with it? So your model is accurate; that doesn’t impress me much. How about the costs? Confidence in results? Sensitivity? Specificity? Can you improve it quickly? A posteriori corrections may help and are (almost) for free.
Corrections increasing accuracy K classes, C K is the majority class. i = 0 gives majority classifier, i = 1 gives original one. Optimize i = 0. NN, kNN, FIS, RS & others do not estimate probabilities rigorously, but some estimations of p(C i |X) are obtained. Many systems do not optimize error functions. Idea: linear scaling of probabilities:
SoftmaxSoftmax This will flatten probabilities; for 2 classes: P(C|X) [(1+e ) ,(1+e +1 ) ] [0.27,0.73]. If i [0,1] then p. of the majority class may only grow. Solution: assume i [0,∞], and K =1, use softmax
Cost function P i (X) are “true” probabilities, if given, or 1 if the label of the training vector X is C i P i (X) = 0 otherwise kNNs, Kohonen nets, Decision Trees, many fuzzy and rough systems do not minimize such cost function. Alternative: stacking with linear perceptron.
Cost function with linear rescaling Due to normalization:
Minimum of E( ) - solution Elegant solution is found in the LMS sense.
Numerical example The primate splice-junction DNA gene sequences: 60 nucleotides, distinguish if there is an intron => exon, exon => intron boundary, or neither vectors (2000 training test) kNN (k=11, Manhattan) gave initial probabilities. Before correction: 85.8% (train), 85.7% (test) After correction: 86.4% (train), 86.9% (test) = 1,0282; = 0,8785 MSE improvement: better probabilities, even if not always correct answers.
Changes in the a priori class distribution A priori class distribution is different in training/test data. If data comes from the same process the densities p(X|C i ) =const, posteriors p(C i |X) change. Bayes theorem for training p t (C i |X) and test p(C i |X) :
Estimation of a priori probabilities How to estimate new p(C i ) ? Estimate confusion matrix on the training set p t (C i |C j ) (McLachlan and Basford 1988); estimate p test (C) from applying classifier to test data. Solve linear equations: Dataset50-50% p(Ci|Cj)p(Ci|Cj) True p(C i ) Diabetes Breast cancer Bupa Liver Experiment: use MLP on small training sample
What to optimize? Overall accuracy is not always the most important thing to optimize. Given a model M, confusion matrix for a class + and all other classes is (rows=true, columns=predicted by M):
Quantities derived from p(C i |C j ) Several quantities are used to evaluate classification models M created to distinguish C + class:
Error functions Best classifier selected using Recall (Precision) curves or ROC curves Sensitivity(1-Specificity), i.e. S + (1 S ) Confidence in M may be increased by rejecting some cases
Errors and costs Optimization with explicit costs: For = 0 this is equivalent to maximization of and for large to the maximization of
ConclusionsConclusions Applying a trained model in real world application does not end with classification, it may be only the beginning. 3 types of corrections to optimize the final model have been considered: a posteriori, improving accuracy by scaling probabilities restoring the balance between the training/test distributions improving confidence, selectivity or specificity of results. They are especially useful for optimization of logical rules. They may be combined, for example a posteriori corrections may be applied to accuracy for a chosen class (sensitivity), confidence, cost optimization etc.