國立雲林科技大學 National Yunlin University of Science and Technology Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
國立雲林科技大學 National Yunlin University of Science and Technology Application of LVQ to novelty detection using outlier training data Hyoung-joo Lee, Sungzoon.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
國立雲林科技大學 National Yunlin University of Science and Technology Predicting adequacy of vancomycin regimens: A learning-based classification approach to improving.
Model Evaluation Metrics for Performance Evaluation
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Decision Tree Algorithm
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Ensemble Learning: An Introduction
Evaluating Hypotheses
ROC Curves.
Machine Learning: Ensemble Methods
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Evaluating Classifiers
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Benk Erika Kelemen Zsolt
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Looking inside self-organizing map ensembles with resampling.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A self-organizing neural network using ideas from the immune.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Evaluating Results of Learning Blaž Zupan
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Efficient Optimal Linear Boosting of a Pair of Classifiers.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Rival-Model Penalized Self-Organizing Map Yiu-ming Cheung.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Multiclass boosting with repartitioning Graduate : Chen,
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Class Imbalance in Text Classification
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Classification Ensemble Methods 1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Cost- sensitive boosting for classification of imbalanced.
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Recognizing Partially Occluded, Expression Variant Faces.
Classification and Regression Trees
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Intelligent Exploration for Genetic Algorithms Using Self-Organizing.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
7. Performance Measurement
Evolving Decision Rules (EDR)
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
iSRD Spam Review Detection with Imbalanced Data Distributions
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Machine Learning: Lecture 5
Presentation transcript:

國立雲林科技大學 National Yunlin University of Science and Technology Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach From: Hongyu Guo, Herna L Viktor, Sigkdd Explorations. Volume 6, Issue 1,pp Presenter : Wei-Shen Tai Advisor : Professor Chung-Chian Hsu 2005/11/15

N.Y.U.S.T. I. M. Outline Introduction DataBoost-IM algorithm Identify Seed Examples Data Generation and Class Frequency Balancing Balancing the Training Weights Experiment results Conclusions Comments

N.Y.U.S.T. I. M. Motivation Learning from imbalanced data sets where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. Traditional machine learning algorithms may be biased towards the majority class, thus producing poor predictive accuracy over the minority class.

N.Y.U.S.T. I. M. Objective Boosting an ensemble-based learning algorithm, Data generation improve the predictive power of classifiers against imbalanced data sets consisting of two classes.

N.Y.U.S.T. I. M. Introduction Data imbalance problem corresponds to domains for which one class is represented by a large number of examples while the other is represented by only a few. machine learning algorithms tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class. Many real world applications such as fraud detection, telecommunications management, oil spill detection and text classification.

N.Y.U.S.T. I. M. Feasible solution Ensembles improve the performance of weak classification algorithms. Boosting is an ensemble method where the performance of weak classifiers is improved by focusing on hard examples which are difficult to classify. produces a series of classifiers and the outputs of these classifiers are combined using weighted voting in the final prediction of the model

N.Y.U.S.T. I. M. Basic concept In each step of the series, The training examples are re-weighted and selected based on the performance of earlier classifiers in the training series. Weights balance schema This produces a set of “easy” examples with low weights and a set of hard ones with high weights. This is achieved by concentrating on classifying the hard examples correctly.

N.Y.U.S.T. I. M. DataBoost-IM Combines data generation and boosting improve the predictive accuracies of both the majority and minority classes. Major steps 1.separately identify hard examples from. 2.generate synthetic examples with bias information toward the hard examples to prevent boosting overemphasizing the hard examples 3.the class frequencies in the new training set are rebalanced. (utilization of a reduced number of examples) 4.the total weights of the different classes in the new training set are rebalanced.

N.Y.U.S.T. I. M. DataBoost-IM pseudo-code Input: Sequence of m examples with labels y i  Y, = {1,…,k} Weak learning algorithm WeakLearn Integer T specifying number of iterations Initialize D 1 (i)= 1 / m for all i. Do for t = 1, 2, …, T 1. Identify hard examples from the original data set for different classes 2. Generate synthetic data to balance the training knowledge of different classes 3. Add synthetic data to the original training set to form a new training data set 4. Update and balance the total weights of the different classes in the new training data set 5. Call WeakLearn, providing it with the new training set with synthetic data and rebalanced weights 6. Get back a hypothesis h t : Y  X 7. Calculate the error of then set T = t – 1 and abort loop. 8. Set 9. Update distribution where Z t is a normalization constant (chosen so that D t+1 will be a distribution).

N.Y.U.S.T. I. M. DataBoost-IM algorithm This approach extends the original DataBoost algorithm as follows. (differ from DataBoost) Firstly, separately identify hard examples from and generate synthetic examples for different classes. Secondly, the class distribution and the total weights of different classes are rebalanced to alleviate the learning algorithms’ bias toward the majority class, by choosing a reduced number of representative (seed) examples from both classes.

N.Y.U.S.T. I. M. Major stages Firstly, each example of the original training set is assigned an equal weight. (assume weight = 1 ) The original training set is used to train the first classifier of the DataBoost-IM ensembles. Secondly, the hard examples (so-called seed examples) are identified and for each of these seed examples, a set of synthetic examples is generated. During the third stage of the algorithm, the synthetic examples are added to the original training set and the class distribution and the total weights of different classes are rebalanced. Iteration between stage 2 and 3 The second and third stages of the DataBoost-IM algorithm are re- executed until reaching a userspecified number of iterations or the current component classifier’s error rate is worse than a threshold value. Following the AdaBoostM1 ensemble method, this threshold is set to 0.5 (1/2).

N.Y.U.S.T. I. M. Seed examples Firstly, the examples in the training set (E train ) are sorted in descending order, based on their weights. The original training set E train contains N maj examples from the majority class and N min examples from the minority class. The number of examples that is considered to be hard (denoted by N s ) is calculated as (E train x Err), where Err is the error rate of the currently trained classifier. Next, the set E s, which contains the N s examples with the highest weights in E train, is created. The set E s consists of two subset of examples E smin and E smaj, i.e. examples from the minority and majority classes, respectively. Here, E smin and E smaj contain N smin and N smaj examples, where N smin < N min and N smaj < N maj. ** We select a number of seed examples of the majority class in E smaj by calculating M L, = min (N maj /N min, N smaj ). ** Correspondingly, a subset M S of the minority class examples in E smin, is selected as seeds, where M S = min ( (N maj x M L ) / N min, N smin ). These values of M l and M s were found, by inspection, to produce data generation set sizes which augment the original training set well. Experimental results show that for the seed examples, the number of higher weighted examples from the minority class is more.

N.Y.U.S.T. I. M. Example-Hepatitis data set Hepatitis data set contains 155 examples of Hepatitis patients, described by 19 continuous and discrete attributes. Of these cases, 123 corresponds to the patients who survived treatment (class ‘Live’) and 32 examples of mortalities (class ‘Die’). In the fifth iteration of the boosting, the current trained classifier’s error rate is 18%. (it is below 0.5) The set E s will consist of the 27 examples with the highest weights as selected from the sorted E train. Of these 27 hard examples, 2 correspond to the majority class ‘Live’, and 25 examples are of the class ‘Die’. M L is equal to 2, calculated as M L = min (2, 3) and E mai will thus contain 2 hard examples of the majority class ‘Live’. M s is equal to 8 and the set E min will consist of the 8 highest weighted examples of class ‘Die’.

N.Y.U.S.T. I. M. Seed examples and their weights of the Hepatitis Data set

N.Y.U.S.T. I. M. Data Generation and Class Frequency Balancing The aim of the data generation process generate additional synthesis instances to add to the original training set E train. New value generation constraints For Nominal attribute, The values are chosen to reflect the distribution of values contained in the original training attribute with respect to the particular class. For example, the attribute ‘GENDER’. Assume that for the class ‘Live’, the number of occurrences of ‘MALE’ is 16 and ‘FEMALE’ is 107. The data generation creates 16 occurrences of ‘MALE’ and 107 occurrences of ‘FEMALE’. These 123 values are randomly assigned to the 123 examples created during data generation. For Continuous attribute, The values are chosen by considering the range [min, max] of the original attribute values with respect to the seed class. Also, the distribution of original attribute values, in terms of the deviation and the mean, is used during data generation. For example, for the ‘ALBUMIN’ attribute, the 123 values for class ‘Live’ lies between 2.1 and 6.4, and the mean and deviation values are and The data generation randomly generates a total of 123 values between 2.1 and 6.4, following a mean value of and a deviation value of These 123 values are randomly assigned to the 123 examples created during data generation.

N.Y.U.S.T. I. M. Hepatitis example Recall from Table 1 that E maj contains 2 examples for the class ‘Live’ and E min contains 8 instances for the class ‘Die’. Data generation process 2 sets of examples for the class ‘Live’, each set contains 123 synthetic examples for each one of the seeds in E maj,. 8 sets of examples containing a total of 32 synthetic examples of the class ‘Die’, based on the 8 seed examples in E min.

N.Y.U.S.T. I. M. Balancing the Training Weights In the final step prior to re-training, the total weights of the examples in the different classes are rebalanced. During each of the iterations produce new classifiers that are better able to predict examples. This is achieved by concentrating on classifying the examples with high weights correctly. In an imbalanced data set, the difference of the total weights between the different classes is large. By rebalancing the total weights of the different classes, boosting is forced to focus on hard as well as rare examples.

N.Y.U.S.T. I. M. Initial weight of each example Before the generated data are added to the original data set each of the synthetic examples is assigned an initial weight. The initial weight of each example is calculated by dividing the weight of the seed example by the number of instances generated from it. In this way, the very high weights associated with the hard examples are balanced out. When the new training set is formed the total weights of the majority class examples (denoted by W maj ) and the minority class examples (denote by W min ) in the new training data are rebalanced as follows. If W maj > W min, the weight of each instance in the minority class is multiplied by W maj / W min, Otherwise, the weight of each instance in the majority class is multiplied by W min / W maj. In this way, the total weight of the majority and minority classes will be balanced. Note that, prior to training, the weights of the new training set will be renormalized, so that their sum equals one.

N.Y.U.S.T. I. M. Hepatitis example Assume that seed example x in E maj has a weight of 9.86 and seed example y has a weight of (y in E maj ) This implies that each of the 123 synthetic examples generated based x is assigned a weight of 9.86/123 and those based on y are assigned weights of 9.62/123. Similarly, an initial weight are assigned to each of the synthetic examples generated based on the seed examples from E min. After adding the synthetic data to the Equally distributing original data set, the new training data set contains 369 examples of class ‘Live’ and 288 cases of the class ‘Die’. Assume that W maj is equal to and W min equals Since W maj > W min, each of the 288 examples describing the minority class is multiplied by a constant equal to / As a result, the total weights of the majority and minority classes in the new training set are equal to , thus equally distributing the balance of the two classes.

N.Y.U.S.T. I. M. Experiment design Performance evaluation the DataBoost-IM algorithm, in comparison with the C4.5 decision tree, AdaBoostM1, DataBoost, AdaCost, CSB2 and SMOTEBoost boosting algorithms. Evaluation index Overall accuracy G-Mean F-Measures

N.Y.U.S.T. I. M. F-measure & ROC curve The F-measure incorporates the recall and precision into a single number. Precision = TP / (TP + FP) and Recall = TP / (TP + FN). where ß corresponds to the relative importance of precision versus the recall and it is usually set to 1. It follows that the F-measure is high when both the recall and precision are high. This implies that the F-measure is able to measure the “goodness” of a learning algorithm on the current class of interest. ROC curve is a technique for summarizing a classifier’s performance over a range, by considering the tradeoffs between TP Rate and FP Rate. TP Rate = TP/(FN+TP) and FP Rate = FP/(FP+TN)

N.Y.U.S.T. I. M. G-mean G-mean is defined as where Positive Accuracy and Negative Accuracy are calculated as TP/(FN+TP) and TN/(TN+FP). This measure relates to a point on the ROC curve and the idea is to maximize the accuracy on each of the two classes while keeping these accuracies balanced. For instance, a high Positive accuracy by a low Negative accuracy will result in poor G-Mean.

N.Y.U.S.T. I. M. Summary of the data sets used

N.Y.U.S.T. I. M. Data sets Sixteen data sets from the UCI data repository as well as the Oil Spill data set. These data sets were carefully selected to ensure that they (a) are based on real-world problems, (b) varied in feature characteristics, and (c) vary extensively in size and class distribution.

N.Y.U.S.T. I. M. Methodology and results Results for the above data sets were averaged over five standard 10-fold cross validation experiments. For each 10-fold cross validation the data set was first partitioned into 10 equal sized sets and each set was then in turn used as the test set while the classifier trains on the other nine sets. A stratified sampling technique was applied here to ensure that each of the sets had the same proportion of different classes.

N.Y.U.S.T. I. M. ROC Curve of the Hepatitis Data Set Fig 3. ROC Curve of the Hepatitis Data Set (Average value) Fig 4. ROC Curve of ten iteration of the DataBoost-IM algorithm It conclude that the DataBoost-IM ensemble’s ROC curve is of a high quality.

N.Y.U.S.T. I. M. Conclusions A novel approach for learning from imbalanced data sets through combining boosting and data generation. Performance improvements 1.Additional synthetic data provide complementary knowledge for the learning process. 2.Rebalancing the class frequencies alleviates the classifiers’ learning bias toward the majority class. 3.Rebalancing the total weight distribution of different classes forces the boosting algorithm to focus on the hard examples as well as rare examples. 4.The synthetic data prevent boosting from over-emphasize the hard examples. This property is especially important when considering the minority class which contains few examples.

N.Y.U.S.T. I. M. Future work Optimal number of new seed examples to generate Experimenting with other component classifiers and considering the performance against noisy data. Weight-assignment methods will be further investigated. Voting mechanism of the boosting algorithm using different metrics such as the ROC curve. Multi-class learning problems Although the DataBoost-IM and the experiments addressed only two-class problems, we believe that a similar approach can be used in the frame of multi-class learning problems.

N.Y.U.S.T. I. M. Comments Performance improvement Its experiment results indicated that imbalanced data sets problem is resolved. Synthetic data It derived from seed sample data in accordance with those generation constrains. But they still should be regards as simulated data, even they seems like a real data. I doubt that it is reasonable for its meaning of classification accuracy improvement. Authors identify the synthetic data (generated from random value assigned) will be complementary knowledge, it is lack of sufficient proof to prove this view point.