Download presentation
Presentation is loading. Please wait.
Published byAubrie Heath Modified over 9 years ago
1
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen 04-29-2009
2
Outline Introduction and Problem statement 2008 UC San Diego Data Ming Competition Task 1: Supervised Learning from Imbalanced Data Sets Over-sampling and Under-sampling Task 2: Semi-Supervised Learning from Only Positive and Unlabeled Data Two-step Strategy
3
Statement of Problems 2008 UC San Diego Data Ming Competition Task 1:Standard Binary Classification A binary classification task that involves 20 real-valued features from an experiment in the physical sciences. The training data consist of 40,000 examples, but there are roughly ten times as many negative examples as positive. The test set, however, is evenly distributed between positive and negative examples. Task 2:Positive-Only Semi-Supervised Task also a binary classification task, but most of the training examples are unlabeled. In fact, only a few of the positive examples have labels. There are both positive and negative unlabeled examples, but there are several times as many negative training examples as positive. This class distribution is reflected in the test sets.
4
Task 1: Learning from Imbalanced Data Class imbalance is prevalent in many applications: fraud/intrusion detection, risk management, text classification, medical diagnosis/monitoring, etc. Standard classifiers tend to be overwhelmed by the large classes and ignore the small ones, i.e., tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class
5
Solutions to Class Imbalance Problem At the data level (re-samplings) Over-sampling: increases the number of minority instances by over-sampling them Under-sampling: extract a smaller set of majority instances while preserving all the minority instances At the algorithmic level Cost-sensitive based: adjust the costs of the various classes so as to counter the class imbalance ……
6
Over-sampling SMOTE: Synthetic Minority Over-sampling Technique The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Over-sampling by duplicating the minority examples
7
Under-sampling Randomly select a subset from the majority class. The size of the subset is roughly equal to the size of minority class. After re-sampling, apply standard classifiers onto the rebalanced datasets, compare the accuracies. Decision Tree, Naïve Bayes, Neural Network(one hidden layer)
8
Results for Task 1 regularUSOSbDSMOTE DT0.7910.8280.7880.875 NB0.8340.827 0.838 NN0.8350.9090.9040.91 For Neural Network Classifiers, I experimented with different hidden units (5,11,15, 20), 11 gives the best accuracies.
9
My Ranking (52 th /199) ……
10
Conclusion for Task 1 For Naïve Bayes classifiers, re-sampling does not improve the accuracy significantly. For Decision Tree Classifiers, random under-sampling and over-sampling with SMOTE significantly improve the accuracy. For Neural Network, all three re-sampling techniques significantly improve the accuracy Neural Network classifier with over-sampling with SMOTE gives the best accuracy compared to other classifiers and re- sampling techniques.
11
Task 2: Learning from Only Positive and Unlabeled Data Positive examples: One has a set of examples of a class P, and Unlabeled set: also has a set U of unlabeled (or mixed) examples with instances from P and also not from P (negative examples). Build a classifier: Build a classifier to classify the examples in U and/or future (test) data. Key feature of the problem: no labeled negative training data. We call this problem, PU-learning.
12
Examples in Real Life Specialized molecular biology database. Defines a set of positive examples ( genes/proteins related to certain disease or function ) No info about examples that should not be included and it is unnatural to build such set. Learning user’s preference for web pages: – The user’s bookmarks can be considered as positive examples – All the rest web pages are unlabeled examples Direct marketing: company’s current list of customers as positive examples Text classification: labeling is labor intensive
13
Are Unlabeled Examples Helpful? Function known to be either x 1 0 Which one is it? x 1 < 0 x 2 > 0 + + + + + + + + + u u u u u u u u u u u “Not learnable” with only positive examples. However, addition of unlabeled examples makes it learnable.
14
Two-step strategy Step 1: Identifying a set of reliable negative examples from the unlabeled set. – S-EM [Liu et al, 2002] uses a Spy technique, – PEBL [Yu et al, 2002] uses a 1-DNF technique – Roc-SVM [Li & Liu, 2003] uses the Rocchio algorithm. – … Step 2: Building a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier. – S-EM uses the Expectation Maximization (EM) algorithm, with an error based classifier selection mechanism – PEBL uses SVM, and gives the classifier at convergence. I.e., no classifier selection. – Roc-SVM uses SVM with a heuristic method for selecting the final classifier.
15
Step 1 Step 2 positivenegative Reliable Negative (RN) Q =U - RN U P positive Using P, RN and Q to build the final classifier iteratively or Using only P and RN to build a classifier
16
Step 1: The Spy technique Sample a certain % of positive examples and put them into unlabeled set to act as “spies”. Run a classification algorithm assuming all unlabeled examples are negative, – We will know the behavior of those actual positive examples in the unlabeled set through the “spies”. – Use Expectation-Maximization (EM) algorithm to assign each unlabeled example a probabilistic class label We can then extract reliable negative examples from the unlabeled set more accurately.
17
Step 1: The Spy technique
18
Step 2: Building the final classifier Use Naïve Bayes classifiers to build the final classifier Use P as the positive class, use N (reliable negative examples) as the negative class
19
Results and Conclusion for Task 2 Use P as positive class, use U as the negative class, use SMOTE to over-sample P so that the size of P is roughly the same as U, the F1 score = 0.545 Two-step algorithm gives F1 score = 0.651 The highest score is F1=0.721 Only positive and unlabeled data is learnable with the two-step strategy.
20
Future Work For task 1, we can try Cost-sensitive based method For task 2, two-step strategy – Step 1: 1-DNF, Rocchio algorithm – Step2: SVM
21
References B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179–188, 2003. B. Liu, W.S.Lee, P.S. Wu, X. Li. Partially Classification of Text Documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 8-12, July 2002, Sydney, Australia. Wee Sun Lee, Bing Liu. Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA.Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Giang Hoang Nguyen, Abdesselam Bouzerdoum, Son Lam Phung: A supervised learning approach for imbalanced data sets. ICPR 2008: 1-4 Giang Hoang NguyenSon Lam PhungICPR 2008 Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Kotcz: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6(1): 1-6 (2004)Nitesh V. ChawlaNathalie JapkowiczSIGKDD Explorations 6 Nitesh V. Chawla et. al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. Vol.16, pp.321-357.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.