Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología - Madrid (ex-LAEFF)
Great Workshop La Palma -June 2011 Problem ● Real world classification problems deal with imbalanced datasets ● Classifiers usually are biased towards the majority class
Great Workshop La Palma -June 2011 Problem: Misclassification Cost ● Most of the literature assumes that the minority class is more important. ● Miss-classification cost is usually less important for the majority class ● I.e. breast cancer detection
Great Workshop La Palma -June 2011 Problem: Astronomy ● But in star classification misclassification costs are the same for every class ● A class with very few instances can be very well represented
Great Workshop La Palma -June 2011 Problem: Not Only the Classifiers ● Feature selection, discretization and other preprocessing filters suffer the same problem
Great Workshop La Palma -June 2011 Multistage Classifier ● Several advantages ● Specialized classifiers ● Better selection of relevant features ● Combination of classification methods ● But there is a drawback ● Worsen the imbalanced problem
Great Workshop La Palma -June 2011 Evaluation
Great Workshop La Palma -June 2011 Evaluation ● Most used measure in classification: accuracy ● Accuracy= (TP+TN)/(TP+TN+FP+FN) ● We cannot say a classifier is good just by looking to the accuracy ● Example: when classifying a training set composed of 1000 instances labeled as A and 1 instance labeled as B is easy to get an “outstanding” 99,9% ● It can be useful for comparing classifiers
Great Workshop La Palma -June 2011 Evaluation ● Summarize performance over a range of tradeoffs between true positive and false positive error rates ● Useful if FN and FP errors have different cost
Great Workshop La Palma -June 2011 Evaluation ● Main goal for imbalanced datasets is to improve the recall without decreasing the precision ● F-value combines both measures ● (β is usually set to 1) Precision = TPTP TPTP FP
Great Workshop La Palma -June 2011 Solutions. Undersampling ● (Random) removal of instances belonging to the majority class ● Problems: we can lose important instances
Great Workshop La Palma -June 2011 Solutions. Oversampling ● Instances belonging to the minority class are replicated ● Problems: possible overfitting, does not increase the decision region for the class ● Advantage: fast
Great Workshop La Palma -June 2011 Solutions: SMOTE ● Synthetic Minority Oversampling Technique ● Generates new instances combining old ones. ● No overfitting ● Forces the minority class to be more general (broader decision region)
Great Workshop La Palma -June 2011 Smote - Warning ● "Real stupidity beats artificial intelligence every time." — Terry Pratchett (Hogfather) ● RV vs ALL ● Extreme imbalanced ratio: ● Can be so good?
Great Workshop La Palma -June 2011 RV vs all
Great Workshop La Palma -June 2011 RV Smotified
Great Workshop La Palma -June 2011 Solutions: Adding Weights ● Does not remove important examples ● Does not overfit ● But needs algorithms prepared to manage weights ● 10-fold-cv can be tricky
Great Workshop La Palma -June 2011 Solutions: Boosting ● Creates weak classifiers weighted for classifying hard instances. ● It maintains accuracy over the entire dataset
Great Workshop La Palma -June 2011 Experiment ● Hipparcos dataset ● 1661 instances ● 47 attributes + class ● 23 classes
Great Workshop La Palma -June 2011 Multistage Hierarchy ● Imbalanced ratio
Great Workshop La Palma -June 2011 Experiment – J48 ● Node 1: LPV vs. Other ● Imbalanced ratio: 4.3 ● Good classification in spite of imbalance ● Low margin for improvement
Great Workshop La Palma -June 2011 Experiment - J48 ● Node 3: Eclipsing vs Other ● Imbalanced ratio: 1.33 ● When dataset is balanced, adding new instances does not improve the classification
Great Workshop La Palma -June 2011 Experiment ● Node 5: GDOR vs. Other ● Imbalanced ratio: 28.07
Great Workshop La Palma -June 2011 Experiment ● Node 11: SPB+ACV vs. Other ● Imbalanced ratio 3.8
Great Workshop La Palma -June 2011 Results ● Using a balanced dataset improves the classification +10% ● FS is specially affected by the imbalance
Great Workshop La Palma -June 2011 Thank you ● Time to wake up