1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M. Weiss KDD-09, Paris, France Department of Computer and Information Science † Also with the Office of Institutional Research Fordham University Fordham University, USA
2 KDD-09, Paris France Important Research Problem Distributions may change after model is induced Our research problem/scenario: –Class distribution changes but “concept” does not –Let x represent an example and y its label. We assume: P(y|x) is constant (i.e., concept does not change) P(y) changes (which means that P(x) must change) –Assume unlabeled data available from new class distribution (training and separate test)
3 KDD-09, Paris France Two research questions: – How can we maximize classifier performance when class distribution changes but is unknown? – How can we utilize unlabeled data from the changed class distribution to accomplish this? Our Goals – Outperform naïve methods that ignore these changes – Approach performance of “oracle” method which trains on labeled data from new distribution Research Questions and Goals
4 KDD-09, Paris France When Class Distribution Changes
5 KDD-09, Paris France Technical Approaches Quantification [Forman KDD 06 & DMKD 08] –Task of estimating a class distribution (CD) Much easier than classification –Adjust model to compensate for CD change [ Elkan 01, Weiss & Provost 03] –New examples not used directly in training –We call class distribution estimation (CDE) methods Semi-Supervised Learning (SSL) –Exploits unlabeled data, which are used for training Other approaches discussed later
6 KDD-09, Paris France CDE Methods CDE-Oracle (upper bound) –Determines new CD by peeking at class labels then adjusts model; CDE upper bound CDE-Iterate-n –Iterative algorithm because changes to class distribution will be underestimated 1. Builds model M on orig. training data (using last NEW CD ) 2. Labels new distribution to estimate NEW CD 3. Adjusts M using NEW CD estimate; Output M; 4. Increment n; Loop to step 1
7 KDD-09, Paris France CDE Methods CDE-AC –Based on Adjusted Count quantification –See [Forman KDD 06 and DMKD 08] for details –Adjusted Positive Rate pr * = (pr – fpr) / (tpr – fpr) pr is calculated from the predicted class labels fpr and tpr obtained via cross-validation of labeled training set Essentially compensates for fact that pr will underestimate changes to class distribution
8 KDD-09, Paris France SSL Methods SSL-Naïve 1.Build model from labeled training data 2.Label unlabeled data from new distribution 3.Build new model from predicted labels of new distr. –Note: Does not directly use original training data SSL-Self-Train –Similar to SSL-Naïve, but original training data used and examples from new distribution with most confident predictions (above median) – Iterates until all examples merged or max iterations (4)
9 KDD-09, Paris France Hybrid Method Combination of SSL-Self-Train and CDE-Iterate –Can view as SSL-Self-Train but at each iteration model adjusted to compensate for difference between CD of merged training data and model applied to new data
10 KDD-09, Paris France Experiment Methodology Use 5 relatively large UCI data sets Partition data to form “original” and “new” distributions –Original distribution made to be 50% positive –New distribution varied from 1% to 99% positive –Results averaged over 10 random runs Use WEKA’s J48 for experiments (like C4.5) Track accuracy and F-measure –F-measure places more emphasis on minority-class
11 KDD-09, Paris France Results: Accuracy (Adult Data Set)
12 KDD-09, Paris France Results: Accuracy (SSL-Naive)
13 KDD-09, Paris France Results: Accuracy (SSL-Self-Train)
14 KDD-09, Paris France Results: Accuracy (CDE-Iterate-1)
15 KDD-09, Paris France Results: Accuracy (CDE-Iterate-2)
16 KDD-09, Paris France Results: Accuracy (Hybrid)
17 KDD-09, Paris France Results: Accuracy (CDE-AC)
18 KDD-09, Paris France Results: Average Accuracy (99 pos rates)
19 KDD-09, Paris France Results: F-Measure (Adult Data Set)
20 KDD-09, Paris France Results: F-Measure (99 pos rates)
21 KDD-09, Paris France Why do Oracle Methods Perform Poorly? Oracle method: –Oracle trains only on new distribution –New distribution often very unbalanced –F-measure should do best with balanced data Weiss and Provost (2003) show balanced best for AUC CDE-Oracle method: –CDE-Iterate underestimates change in class distr. –May be helpful for F-measure since will better balance importance of minority class
22 KDD-09, Paris France Conclusion Can substantially improve performance by not ignoring changes to class distribution –Can exploit unlabeled data from new distribution, even if only to estimate NEW CD –Quantification methods can be very helpful and much better than semi-supervised learning alone
23 KDD-09, Paris France Future Work Problem reduced with well-calibrated probability models (Zadrozny & Elkan ’01) –Decision trees do not produce these –Evaluate methods that produce good estimates In our problem setting p(x) changes –Try methods that measure this change and compensate for it (e.g., via weighting the x’s) Experiment with initial distribution not 1:1 –Especially highly skewed distributions (e.g. diseases) Other issues: data streams/real time update
24 KDD-09, Paris France References [Forman 06] G. Forman, Quantifying trends accurately despite classifier error and class imbalance, KDD-06, [Forman 08] G. Forman, Quantifying counts and costs via classification, Data Mining and Knowledge Discovery, 17(2), [Weiss & Provost 03] G. Weiss & F. Provost, Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19: [Zadrozny & Elkan 01] B. Zadrozny & C. Elkan, Obtaining calibrated probability estimates from decision trees and naïve bayesian classifiers, ICML-01,