Robert Holte University of Alberta

Robert Holte University of Alberta holte@cs.ualberta.ca
Reflections Robert Holte University of Alberta September 8, 2019

“unbalanced” vs. “imbalanced”
Google: Searched the web for imbalanced. … about 53,800. Searched the web for unbalanced. … about 465,000. Shouldn’t we favour the minority class ??? September 8, 2019

Is “FP” meaningful ? BUT…
Elkan: individual examples have costs, so the number of misclassified positive examples is irrelevant Moreover, if the testing distribution can differ from the training distribution the FP measured on training may have no relation to FP later. BUT… September 8, 2019

Babies and Bathwater… Not every situation involves example-specific costs and drifting within-class distributions ROC curves are far better than accuracy and ROC curves are better than AUC or any scalar measure and cost curves are even better ?? September 8, 2019

And the question remains…
How to select the examples for training which give the best classifier for your circumstances ? (Foster’s budgeted learning problem) September 8, 2019

Within-class imbalance
Elkan: subpopulations in test distribution not evenly represented in training Other presenters: subpopulations in training are not equal size September 8, 2019

In Defense of studies of C4.5 and undersampling
Foster’s opening example (“budgeted learning”) is very common. Undersampling is a common technique (SAS manual) Different algorithms react differently to undersampling C4.5’s reaction is not necessarily intuitive Foster: appropriate sampling method depends on performance measure September 8, 2019

Endless Tweaking ? Definitely a danger
Overtuning Plethora of results/methods But exploratory research is valid once a clear need is established Some papers have presented specific hypotheses that can now be tested 1-class SVM outperforms 2-class SVM when… September 8, 2019

Size matters Having a small number of examples is a different problem than having an imbalance Both cause problems We should be careful to separate them in our experiments September 8, 2019

No problem ? Foster: problem diminishes when datasets get large
Are some learning algorithms insensitive ? Generative models ? SVMs ? (it seems not after today) Active learning, progressive sampling September 8, 2019

More problems ? Imbalance detrimental to feature selection
Imbalance detrimental to clustering September 8, 2019

ELKAN: Bogosity about learning with unbalanced data
The goal is yes/no classification. No: ranking, or probability estimation Often, P(c=minority|x) < 0.5 for all examples x Decision trees and C4.5 are well-suited No: model each class separately, then use Bayes’ rule P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] No: avoid small disjuncts With naïve Bayes: P(x|c) =  P(xi | c) Under/over-sampling are appropriate No: do cost-based example-specific sampling, then bagging ROC curves and AUC are important September 8, 2019

Robert Holte University of Alberta

Similar presentations

Presentation on theme: "Robert Holte University of Alberta"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Robert Holte University of Alberta

Similar presentations

Presentation on theme: "Robert Holte University of Alberta"— Presentation transcript:

Similar presentations

About project

Feedback