Presentation is loading. Please wait.

Presentation is loading. Please wait.

國立雲林科技大學 National Yunlin University of Science and Technology Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach.

Similar presentations


Presentation on theme: "國立雲林科技大學 National Yunlin University of Science and Technology Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach."— Presentation transcript:

1 國立雲林科技大學 National Yunlin University of Science and Technology Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach From: Hongyu Guo, Herna L Viktor, Sigkdd Explorations. Volume 6, Issue 1,pp. 30-39. Presenter : Wei-Shen Tai Advisor : Professor Chung-Chian Hsu 2005/11/15

2 N.Y.U.S.T. I. M. Outline Introduction DataBoost-IM algorithm Identify Seed Examples Data Generation and Class Frequency Balancing Balancing the Training Weights Experiment results Conclusions Comments

3 N.Y.U.S.T. I. M. Motivation Learning from imbalanced data sets where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. Traditional machine learning algorithms may be biased towards the majority class, thus producing poor predictive accuracy over the minority class.

4 N.Y.U.S.T. I. M. Objective Boosting an ensemble-based learning algorithm, Data generation improve the predictive power of classifiers against imbalanced data sets consisting of two classes.

5 N.Y.U.S.T. I. M. Introduction Data imbalance problem corresponds to domains for which one class is represented by a large number of examples while the other is represented by only a few. machine learning algorithms tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class. Many real world applications such as fraud detection, telecommunications management, oil spill detection and text classification.

6 N.Y.U.S.T. I. M. Feasible solution Ensembles improve the performance of weak classification algorithms. Boosting is an ensemble method where the performance of weak classifiers is improved by focusing on hard examples which are difficult to classify. produces a series of classifiers and the outputs of these classifiers are combined using weighted voting in the final prediction of the model

7 N.Y.U.S.T. I. M. Basic concept In each step of the series, The training examples are re-weighted and selected based on the performance of earlier classifiers in the training series. Weights balance schema This produces a set of “easy” examples with low weights and a set of hard ones with high weights. This is achieved by concentrating on classifying the hard examples correctly.

8 N.Y.U.S.T. I. M. DataBoost-IM Combines data generation and boosting improve the predictive accuracies of both the majority and minority classes. Major steps 1.separately identify hard examples from. 2.generate synthetic examples with bias information toward the hard examples to prevent boosting overemphasizing the hard examples 3.the class frequencies in the new training set are rebalanced. (utilization of a reduced number of examples) 4.the total weights of the different classes in the new training set are rebalanced.

9 N.Y.U.S.T. I. M. DataBoost-IM pseudo-code Input: Sequence of m examples with labels y i  Y, = {1,…,k} Weak learning algorithm WeakLearn Integer T specifying number of iterations Initialize D 1 (i)= 1 / m for all i. Do for t = 1, 2, …, T 1. Identify hard examples from the original data set for different classes 2. Generate synthetic data to balance the training knowledge of different classes 3. Add synthetic data to the original training set to form a new training data set 4. Update and balance the total weights of the different classes in the new training data set 5. Call WeakLearn, providing it with the new training set with synthetic data and rebalanced weights 6. Get back a hypothesis h t : Y  X 7. Calculate the error of then set T = t – 1 and abort loop. 8. Set 9. Update distribution where Z t is a normalization constant (chosen so that D t+1 will be a distribution).

10 N.Y.U.S.T. I. M. DataBoost-IM algorithm This approach extends the original DataBoost algorithm as follows. (differ from DataBoost) Firstly, separately identify hard examples from and generate synthetic examples for different classes. Secondly, the class distribution and the total weights of different classes are rebalanced to alleviate the learning algorithms’ bias toward the majority class, by choosing a reduced number of representative (seed) examples from both classes.

11 N.Y.U.S.T. I. M. Major stages Firstly, each example of the original training set is assigned an equal weight. (assume weight = 1 ) The original training set is used to train the first classifier of the DataBoost-IM ensembles. Secondly, the hard examples (so-called seed examples) are identified and for each of these seed examples, a set of synthetic examples is generated. During the third stage of the algorithm, the synthetic examples are added to the original training set and the class distribution and the total weights of different classes are rebalanced. Iteration between stage 2 and 3 The second and third stages of the DataBoost-IM algorithm are re- executed until reaching a userspecified number of iterations or the current component classifier’s error rate is worse than a threshold value. Following the AdaBoostM1 ensemble method, this threshold is set to 0.5 (1/2).

12 N.Y.U.S.T. I. M. Seed examples Firstly, the examples in the training set (E train ) are sorted in descending order, based on their weights. The original training set E train contains N maj examples from the majority class and N min examples from the minority class. The number of examples that is considered to be hard (denoted by N s ) is calculated as (E train x Err), where Err is the error rate of the currently trained classifier. Next, the set E s, which contains the N s examples with the highest weights in E train, is created. The set E s consists of two subset of examples E smin and E smaj, i.e. examples from the minority and majority classes, respectively. Here, E smin and E smaj contain N smin and N smaj examples, where N smin < N min and N smaj < N maj. ** We select a number of seed examples of the majority class in E smaj by calculating M L, = min (N maj /N min, N smaj ). ** Correspondingly, a subset M S of the minority class examples in E smin, is selected as seeds, where M S = min ( (N maj x M L ) / N min, N smin ). These values of M l and M s were found, by inspection, to produce data generation set sizes which augment the original training set well. Experimental results show that for the seed examples, the number of higher weighted examples from the minority class is more.

13 N.Y.U.S.T. I. M. Example-Hepatitis data set Hepatitis data set contains 155 examples of Hepatitis patients, described by 19 continuous and discrete attributes. Of these cases, 123 corresponds to the patients who survived treatment (class ‘Live’) and 32 examples of mortalities (class ‘Die’). In the fifth iteration of the boosting, the current trained classifier’s error rate is 18%. (it is below 0.5) The set E s will consist of the 27 examples with the highest weights as selected from the sorted E train. Of these 27 hard examples, 2 correspond to the majority class ‘Live’, and 25 examples are of the class ‘Die’. M L is equal to 2, calculated as M L = min (2, 3) and E mai will thus contain 2 hard examples of the majority class ‘Live’. M s is equal to 8 and the set E min will consist of the 8 highest weighted examples of class ‘Die’.

14 N.Y.U.S.T. I. M. Seed examples and their weights of the Hepatitis Data set

15 N.Y.U.S.T. I. M. Data Generation and Class Frequency Balancing The aim of the data generation process generate additional synthesis instances to add to the original training set E train. New value generation constraints For Nominal attribute, The values are chosen to reflect the distribution of values contained in the original training attribute with respect to the particular class. For example, the attribute ‘GENDER’. Assume that for the class ‘Live’, the number of occurrences of ‘MALE’ is 16 and ‘FEMALE’ is 107. The data generation creates 16 occurrences of ‘MALE’ and 107 occurrences of ‘FEMALE’. These 123 values are randomly assigned to the 123 examples created during data generation. For Continuous attribute, The values are chosen by considering the range [min, max] of the original attribute values with respect to the seed class. Also, the distribution of original attribute values, in terms of the deviation and the mean, is used during data generation. For example, for the ‘ALBUMIN’ attribute, the 123 values for class ‘Live’ lies between 2.1 and 6.4, and the mean and deviation values are 3.817 and 0.652. The data generation randomly generates a total of 123 values between 2.1 and 6.4, following a mean value of 3.817 and a deviation value of 0.652. These 123 values are randomly assigned to the 123 examples created during data generation.

16 N.Y.U.S.T. I. M. Hepatitis example Recall from Table 1 that E maj contains 2 examples for the class ‘Live’ and E min contains 8 instances for the class ‘Die’. Data generation process 2 sets of examples for the class ‘Live’, each set contains 123 synthetic examples for each one of the seeds in E maj,. 8 sets of examples containing a total of 32 synthetic examples of the class ‘Die’, based on the 8 seed examples in E min.

17 N.Y.U.S.T. I. M. Balancing the Training Weights In the final step prior to re-training, the total weights of the examples in the different classes are rebalanced. During each of the iterations produce new classifiers that are better able to predict examples. This is achieved by concentrating on classifying the examples with high weights correctly. In an imbalanced data set, the difference of the total weights between the different classes is large. By rebalancing the total weights of the different classes, boosting is forced to focus on hard as well as rare examples.

18 N.Y.U.S.T. I. M. Initial weight of each example Before the generated data are added to the original data set each of the synthetic examples is assigned an initial weight. The initial weight of each example is calculated by dividing the weight of the seed example by the number of instances generated from it. In this way, the very high weights associated with the hard examples are balanced out. When the new training set is formed the total weights of the majority class examples (denoted by W maj ) and the minority class examples (denote by W min ) in the new training data are rebalanced as follows. If W maj > W min, the weight of each instance in the minority class is multiplied by W maj / W min, Otherwise, the weight of each instance in the majority class is multiplied by W min / W maj. In this way, the total weight of the majority and minority classes will be balanced. Note that, prior to training, the weights of the new training set will be renormalized, so that their sum equals one.

19 N.Y.U.S.T. I. M. Hepatitis example Assume that seed example x in E maj has a weight of 9.86 and seed example y has a weight of 9.62. (y in E maj ) This implies that each of the 123 synthetic examples generated based x is assigned a weight of 9.86/123 and those based on y are assigned weights of 9.62/123. Similarly, an initial weight are assigned to each of the synthetic examples generated based on the seed examples from E min. After adding the synthetic data to the Equally distributing original data set, the new training data set contains 369 examples of class ‘Live’ and 288 cases of the class ‘Die’. Assume that W maj is equal to 122.51 and W min equals 69.83. Since W maj > W min, each of the 288 examples describing the minority class is multiplied by a constant equal to 122.51/69.83. As a result, the total weights of the majority and minority classes in the new training set are equal to 122.51, thus equally distributing the balance of the two classes.

20 N.Y.U.S.T. I. M. Experiment design Performance evaluation the DataBoost-IM algorithm, in comparison with the C4.5 decision tree, AdaBoostM1, DataBoost, AdaCost, CSB2 and SMOTEBoost boosting algorithms. Evaluation index Overall accuracy G-Mean F-Measures

21 N.Y.U.S.T. I. M. F-measure & ROC curve The F-measure incorporates the recall and precision into a single number. Precision = TP / (TP + FP) and Recall = TP / (TP + FN). where ß corresponds to the relative importance of precision versus the recall and it is usually set to 1. It follows that the F-measure is high when both the recall and precision are high. This implies that the F-measure is able to measure the “goodness” of a learning algorithm on the current class of interest. ROC curve is a technique for summarizing a classifier’s performance over a range, by considering the tradeoffs between TP Rate and FP Rate. TP Rate = TP/(FN+TP) and FP Rate = FP/(FP+TN)

22 N.Y.U.S.T. I. M. G-mean G-mean is defined as where Positive Accuracy and Negative Accuracy are calculated as TP/(FN+TP) and TN/(TN+FP). This measure relates to a point on the ROC curve and the idea is to maximize the accuracy on each of the two classes while keeping these accuracies balanced. For instance, a high Positive accuracy by a low Negative accuracy will result in poor G-Mean.

23 N.Y.U.S.T. I. M. Summary of the data sets used

24 N.Y.U.S.T. I. M. Data sets Sixteen data sets from the UCI data repository as well as the Oil Spill data set. These data sets were carefully selected to ensure that they (a) are based on real-world problems, (b) varied in feature characteristics, and (c) vary extensively in size and class distribution.

25 N.Y.U.S.T. I. M. Methodology and results Results for the above data sets were averaged over five standard 10-fold cross validation experiments. For each 10-fold cross validation the data set was first partitioned into 10 equal sized sets and each set was then in turn used as the test set while the classifier trains on the other nine sets. A stratified sampling technique was applied here to ensure that each of the sets had the same proportion of different classes.

26 N.Y.U.S.T. I. M. ROC Curve of the Hepatitis Data Set Fig 3. ROC Curve of the Hepatitis Data Set (Average value) Fig 4. ROC Curve of ten iteration of the DataBoost-IM algorithm It conclude that the DataBoost-IM ensemble’s ROC curve is of a high quality.

27 N.Y.U.S.T. I. M. Conclusions A novel approach for learning from imbalanced data sets through combining boosting and data generation. Performance improvements 1.Additional synthetic data provide complementary knowledge for the learning process. 2.Rebalancing the class frequencies alleviates the classifiers’ learning bias toward the majority class. 3.Rebalancing the total weight distribution of different classes forces the boosting algorithm to focus on the hard examples as well as rare examples. 4.The synthetic data prevent boosting from over-emphasize the hard examples. This property is especially important when considering the minority class which contains few examples.

28 N.Y.U.S.T. I. M. Future work Optimal number of new seed examples to generate Experimenting with other component classifiers and considering the performance against noisy data. Weight-assignment methods will be further investigated. Voting mechanism of the boosting algorithm using different metrics such as the ROC curve. Multi-class learning problems Although the DataBoost-IM and the experiments addressed only two-class problems, we believe that a similar approach can be used in the frame of multi-class learning problems.

29 N.Y.U.S.T. I. M. Comments Performance improvement Its experiment results indicated that imbalanced data sets problem is resolved. Synthetic data It derived from seed sample data in accordance with those generation constrains. But they still should be regards as simulated data, even they seems like a real data. I doubt that it is reasonable for its meaning of classification accuracy improvement. Authors identify the synthetic data (generated from random value assigned) will be complementary knowledge, it is lack of sufficient proof to prove this view point.


Download ppt "國立雲林科技大學 National Yunlin University of Science and Technology Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach."

Similar presentations


Ads by Google