Download presentation
Presentation is loading. Please wait.
1
Robert Holte University of Alberta holte@cs.ualberta.ca
Reflections Robert Holte University of Alberta September 8, 2019
2
“unbalanced” vs. “imbalanced”
Google: Searched the web for imbalanced. … about 53,800. Searched the web for unbalanced. … about 465,000. Shouldn’t we favour the minority class ??? September 8, 2019
3
Is “FP” meaningful ? BUT…
Elkan: individual examples have costs, so the number of misclassified positive examples is irrelevant Moreover, if the testing distribution can differ from the training distribution the FP measured on training may have no relation to FP later. BUT… September 8, 2019
4
Babies and Bathwater… Not every situation involves example-specific costs and drifting within-class distributions ROC curves are far better than accuracy and ROC curves are better than AUC or any scalar measure and cost curves are even better ?? September 8, 2019
5
And the question remains…
How to select the examples for training which give the best classifier for your circumstances ? (Foster’s budgeted learning problem) September 8, 2019
6
Within-class imbalance
Elkan: subpopulations in test distribution not evenly represented in training Other presenters: subpopulations in training are not equal size September 8, 2019
7
In Defense of studies of C4.5 and undersampling
Foster’s opening example (“budgeted learning”) is very common. Undersampling is a common technique (SAS manual) Different algorithms react differently to undersampling C4.5’s reaction is not necessarily intuitive Foster: appropriate sampling method depends on performance measure September 8, 2019
8
Endless Tweaking ? Definitely a danger
Overtuning Plethora of results/methods But exploratory research is valid once a clear need is established Some papers have presented specific hypotheses that can now be tested 1-class SVM outperforms 2-class SVM when… September 8, 2019
9
Size matters Having a small number of examples is a different problem than having an imbalance Both cause problems We should be careful to separate them in our experiments September 8, 2019
10
No problem ? Foster: problem diminishes when datasets get large
Are some learning algorithms insensitive ? Generative models ? SVMs ? (it seems not after today) Active learning, progressive sampling September 8, 2019
11
More problems ? Imbalance detrimental to feature selection
Imbalance detrimental to clustering September 8, 2019
12
ELKAN: Bogosity about learning with unbalanced data
The goal is yes/no classification. No: ranking, or probability estimation Often, P(c=minority|x) < 0.5 for all examples x Decision trees and C4.5 are well-suited No: model each class separately, then use Bayes’ rule P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] No: avoid small disjuncts With naïve Bayes: P(x|c) = P(xi | c) Under/over-sampling are appropriate No: do cost-based example-specific sampling, then bagging ROC curves and AUC are important September 8, 2019
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.