InCob A particle swarm based hybrid system for imbalanced medical data sampling Pengyi Yang School of Information Technologies The University of Sydney, NSW 2006, Australia NICTA, Australian Technology Park, Eveleigh NSW 2015, Australia
InCob Road map Imbalanced class distribution in medical data Sampling –Over-sampling –Under-sampling Convert feature selection techniques as sampling strategy System overview Results Conclusion
InCob Imbalanced class distribution in medical data Medical data are commonly with imbalanced class distribution Why? -Positive samples are special cases (rare) while negative samples are abundant (Reversely, only positive samples are collected) -Data contain subtypes each with limited samples subtypes
InCob Problem… Building classification model with imbalanced dataset will cause the under represented class been overlooked or even ignored. Yet, the rare classes often carry important biological implication. The difficulty becomes: how to remedy the imbalanced class distribution.
InCob Remedy Via sampling – before model building process –Over sampling: increase sample size of minority class (could introduce noise and redundancy) –Under sampling: decrease sample size of majority class (could remove representative samples) Via cost-sensitive learning – within model building process –Need to choose an appropriate cost-metric (hard to determine a prior)
InCob Current methods The most straightforward way – random over- sampling and under-sampling –Naive method but work well in different situations Clustering and sampling –Clustering dataset and sampling according to the characteristic of each cluster Synthesizing new examples –Most popular is “smote” which creates “artificial” samples to increase the size of minority class
InCob Our contribution – Proposing a novel sampling strategy Convert feature selection technique as sampling strategy –Selecting a subset of “optimal” samples from majority class Supervised sample selection (imbalanced dataset) (balanced dataset)
InCob classifierpredictionranking high low imbalanced balanced majority minority particle swarm optimization optimization test set minority sample majority sample train predict Conceptual representation
InCob Particle swarm optimization Problem encoding Each particle is a subset of samples from the majority class sample 1 sample 2 … sample m m is the sample size of majority class
InCob Final Schema
InCob Results (1) PSO achieved better classification results. classification results. (2) Different evaluation metrics could gives metrics could gives a different evaluation a different evaluation indication. indication.
InCob Results continue (3) Different classifiers also perform differently within the same sampling method
InCob Key observation The evaluation of data sampling strategy is compounded by the type of classifier applied and the evaluation metric used. Therefore, caution should be drawn when the conclusion is made on the basis of a single type of classifier or evaluation metric.
InCob Conclusion The study shows that with proper modification feature selection techniques can be applied to sampling of imbalanced data. The application of such technique to medical domain demonstrates it can help to increase the classification accuracy which is valuable to prediction or decision support systems.
InCob Publication P Yang, L Hsu, B. Zhou, Z. Zhang, A Zomaya, A particle swarm based hybrid system for imbalanced medical data sampling, accepted by BMC Genomics.
InCob Questions!