Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor : Dr. Hsu Reporter : Wen-Hsiang Hu Author : Bhavani Raskutti and Adam Kowalczyk Sigkdd Explorations
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Related Research Support Vector Machines Re-balancing of the Data Sample Balancing Weight Balancing Experimental Discussion Conclusion Personal Opinion
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation A standard recipe for two class discrimination is to take examples from both classes, then generate a model for discriminating them. However, there are many applications were obtaining examples of a second class is difficult. ─ e.g. classifying sites of “interest” to a web surfer There are situations when the data has heavily unbalanced representatives of the two classes of interest, ─ e.g. fraud detection and information filtering
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective Get better performance by one-class learners
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Related Research (1/2) Many solutions have been proposed to address the imbalance problem including sampling and weighting examples. ─ Typically, these methods focus on cases when the imbalance ratio of minority to majority class is around 10:90 In this paper, we focus on extreme imbalance in very high dimensional input spaces, where at the learning stage the minority class consists of around 1-3% of data.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Related Research (2/2) In both cases (image retrieval and document classification) ─ One-class models are much worse than the two-class models In this paper, we show that for certain problems such as the gene knock-out experiments for understanding AHR( 芳香巠基碳水化合物接受器 ) signalling pathway ─ minority one-class SVMs significantly outperform models learnt using examples from both classes.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Support Vector Machines (1/4) Given a training sequence (x i,y i ) of binary n-vectors and bipolar labels Our aim is to find a “good” discriminating function kernel machine:
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Support Vector Machines (2/4)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Support Vector Machines (3/4)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Support Vector Machines (4/4) If the kernel k satisfies the Mercer theorem assumptions[7;24;25] then for the minimiser of (2) we have where We shall be using the popular polynomial kernel
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Re-balancing of the Data - Sample Balancing aaaaaa 0:1
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Re-balancing of the Data - Weight Balancing a The case of “balanced proportions” achieved for B= 0. B= +1 representing the case of learning from positive examples only. Similarly, learning from negative class only is achieved for B= -1. is a parameter called a balance factor
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Experiments- Real World Data Collections AHR-data set used for task 2 of KDD Cup 2002 ─ 芳香巠基碳水化合物的資料集 ─ for cancer research ─ three class: change, control, nc Reuters data ─ documents
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Performance Measures We have used AROC, the Area under the Receiver Operating Characteristic (ROC) curve as our main performance measure. The trivial uniform random predictor has AROC of 0.5, while a perfect predictor has an AROC of 1. X i from the negative class X j from the positive class
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Experiments with Real World Data The sizes of the data split training:test were ─ 50%:50% for the Reuters data ─ 70%:30% for the AHR-data
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Impact of Regularization Constant positive 1-calss – – – – – – – balanced 2-class – ‧ – ‧ – ‧ – un-balanced 2-class …………… negative 1-class
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Experiments with Sample Balancing
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Impact of feature selection (1/2) feature selection methods: ─ DocFreq (Document frequency thresholding): 1 ─ ChiSqua(χ 2 ): The measures the lack of independence between a feature and a class of interest. ─ MutInfo (Mutual Information) ─ InfGain (Information gain): term goodness measure We have used all of the minority cases and sampled the majority cases at different mixture ratios (MajorityOnly sample balancing).
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Impact of feature selection (2/2) two
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Experiments with Weight Balancing In order to understand if the impact of negative examples may be reduced using the balance factor B in Equation (4) ─ Tests on AHR data ─ Tests on Reuters
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Tests on AHR data B= 0 : balanced 2-class B= +1 : positive 1-class B= -1 : negative 1-class
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 Tests on Reuters balanced 2-class positive 1-class
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Experiments with Synthetic Data S 1 : n inf =1; n noise =999 S 2 : n inf =10; n noise =990 S 3 : n inf =1; n noise =19 polynomial kernels: non-linear kernel two polynomial kernels : linear kernel
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Discussion
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Conclusion The Reuters dataset ─ provides quite good results but using both classes always produces better results The AHR data set ─ The positive one-class learners performing significantly better than two-class learners. One-class learning from positive class examples can be a very robust classification technique when dealing with very unbalanced data and high dimensional noisy feature space.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 Personal Opinion Strength ─ many experiments Weakness ─ equations are not clear Application ─ SVM document classification Image retrieval