Imbalanced Data Set Learning with Synthetic Examples

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

Class Imbalance vs. Cost-Sensitive Learning
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Model generalization Test error Bias, variance and complexity
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Evaluation.
Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham.
Evaluation.
Ensemble Learning: An Introduction
Evaluating Hypotheses
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Machine Learning: Ensemble Methods
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Radial Basis Function Networks
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
InCob A particle swarm based hybrid system for imbalanced medical data sampling Pengyi Yang School of Information Technologies.
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
Learning from Imbalanced Data
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
Issues with Data Mining
Active Learning for Class Imbalance Problem
A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data Author: Gustavo E. A. Batista Presenter: Hui Li University of Ottawa.
Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
Experimental Evaluation of Learning Algorithms Part 1.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Issues concerning the interpretation of statistical significance tests.
N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Class Imbalance Classification Implementation Group 4 WEI Lili, ZENG Gaoxiong,
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Two-Way (Independent) ANOVA. PSYC 6130A, PROF. J. ELDER 2 Two-Way ANOVA “Two-Way” means groups are defined by 2 independent variables. These IVs are typically.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Class Imbalance in Text Classification
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Validation methods.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Defect Prediction using Smote & GA 1 Dr. Abdul Rauf.
Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.
Virtual University of Pakistan
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
Objectives of the Course and Preliminaries
Data Mining Classification: Alternative Techniques
Classification of class-imbalanced data
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensemble learning.
Data Mining Class Imbalance
Robert Holte University of Alberta
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Imbalanced Data Set Learning with Synthetic Examples Benjamin X. Wang and Nathalie Japkowicz

The Class Imbalance Problem I Data sets are said to be balanced if there are, approximately, as many positive examples of the concept as there are negative ones. There exist many domains that do not have a balanced data set. Examples: Helicopter Gearbox Fault Monitoring Discrimination between Earthquakes and Nuclear Explosions Document Filtering Detection of Oil Spills Detection of Fraudulent Telephone Calls

The Class Imbalance Problem II The problem with class imbalances is that standard learners are often biased towards the majority class. That is because these classifiers attempt to reduce global quantities such as the error rate, not taking the data distribution into consideration. As a result examples from the overwhelming class are well-classified whereas examples from the minority class tend to be misclassified.

Some Generalities Evaluating the performance of a learning system on a class imbalance problem is not done appropriately with the standard accuracy/error rate measures.  ROC Analysis is typically used, instead. There is a parallel between research on class imbalances and cost-sensitive learning. There are four main ways to deal with class imbalances: re-sampling, re-weighing, adjusting the probabilistic estimate, one-class learning

Advantage of Resampling Re-sampling provides a simple way of biasing the generalization process. It can do so by: Generating synthetic samples accordingly biased Controlling the amount and placement of the new samples Note: this type of control can also be achieved by smoothing the classifier’s probabilistic estimate (e.g., Zadrozny & Elkan, 2001), but that type of control cannot be as localized as the one achieved with re-sampling techniques.

SMOTE: A State-of-the-Art Resampling Approach SMOTE stands for Synthetic Minority Oversampling Technique. It is a technique designed by Chawla, Hall, & Kegelmeyer in 2002. It combines Informed Oversampling of the minority class with Random Undersampling of the majority class. SMOTE currently yields the best results as far as re-sampling and modifying the probabilistic estimate techniques go (Chawla, 2003).

SMOTE’s Informed Oversampling Procedure II For each minority Sample Find its k-nearest minority neighbours Randomly select j of these neighbours Randomly generate synthetic samples along the lines joining the minority sample and its j selected neighbours (j depends on the amount of oversampling desired)

SMOTE’s Informed vs. Random Oversampling Random Oversampling (with replacement) of the minority class has the effect of making the decision region for the minority class very specific. In a decision tree, it would cause a new split and often lead to overfitting. SMOTE’s informed oversampling generalizes the decision region for the minority class. As a result, larger and less specific regions are learned, thus, paying attention to minority class samples without causing overfitting.

SMOTE’s Informed Oversampling Procedure I … But what if there is a majority sample Nearby? : Minority sample : Majority sample : Synthetic sample

SMOTE’s Shortcomings Overgeneralization Lack of Flexibility SMOTE’s procedure is inherently dangerous since it blindly generalizes the minority area without regard to the majority class. This strategy is particularly problematic in the case of highly skewed class distributions since, in such cases, the minority class is very sparse with respect to the majority class, thus resulting in a greater chance of class mixture. Lack of Flexibility The number of synthetic samples generated by SMOTE is fixed in advance, thus not allowing for any flexibility in the re-balancing rate.

SMOTE’s Tendency for Overgeneralization : Minority sample : Synthetic sample : Majority sample

Our Proposed Solution In order to avoid overgeneralization, we propose to use three techniques: Testing for data sparsity Clustering the minority class 2-class (rather than 1-class) sample generalization In order to avoid SMOTE’s lack of flexibility, we propose one technique: Multiple Trials/Feedback We call our Approach: Adaptive Synthetic Minority Oversampling Method (ASMO)

ASMO’s Strategy I Overfitting Avoidance I: Testing for data sparsity: For each minority sample m, if m’s g neighbours are majority samples, then the data set is sparse and ASMO should be used. Otherwise, SMOTE can be used. (As a default, we used g=20). Overgeneralization Avoidance II: Clustering We will use k-means or other such clustering systems on the minority class (for now, this step is done, but in a non-standard way)

ASMO’s Strategy II Overfitting Avoidance III: Synthetic sample generation using two classes: Rather than using the k-nearest neighbours of the minority class to generate new samples, we use the k nearest neighbours of the opposite class.

ASMO’s Strategy III: Overfitting avoidance: Overview - Clustering -2-class sample generation : Minority sample : Majority sample : Synthetic sample

ASMO’s Strategy III Flexibility Enhancement through Multiple Trials and Feedback: For each Cluster Ci, iterate through different rates of majority undersampling and synthetic minority generation. Keep the best combination subset Si. Merge the Si’s into a single training set S. Apply the classifier to S.

Discussion of our Technique I Assumption we made/Justification: the problem is decomposable. i.e., optimizing each subset will yield an optimal merged set. As long as the base classifier we use does some kind of local learning (not just global optimization), this assumption should hold. Question/Answer: Why did we use different oversampling and undersampling rates? It was previously shown that optimal sampling rates are problem dependent, and thus, are best set adaptively (Weiss & Provost, 2003, Estabrook & Japkowicz, 2001)

Experiment Setup I We tested our system on three different data sets: Lupus (thanks to James Malley of NIH) Minority class: 2.8% Dataset Size: 3839 Abalone-5 (UCI) Minority class: 2.75% Dataset Size: 4177 Connect-4 (UCI) Minority class: 9.5% Dataset Size: 11,258

Experiment Setup II ASMO was compared to two other techniques: SMOTE O-D [the Combination of Random Over- and Down (Under)- sampling; O-D was shown to outperform both Random Oversampling and Random Undersampling in preliminary experiments]. The base classifier in all experiments is SVM; k-NN was used in the syntactic generation process in order to identify the samples’ nearest neighbours (within the minority class or between the minority and majority class). The results are reported in the form of ROC Curves on 10-fold corss-validation experiments.

Results on Lupus

Results on Abalone-5

Results on Connect-4

Discussion of the Results On every domain, ASMO slightly outperforms both O-D and SMOTE. In the ROC areas where ASMO does not outperform the other two systems, its performance equals theirs. ASMO’s effect seems to be one of smoothening SMOTE’s ROC Curve. SMOTE’s performance is comparatively better in the two domains where the class imbalance is greater (Lupus, Abalone-5). We expect its relative performance to increase as the imbalance grows even more.

Summary We presented a few modifications to the State-of-the-art re-sampling system, SMOTE. These modifications had two goals: To correct for SMOTE’s tendency to overgeneralize To make SMOTE more flexible We observed a slight improved performance on three domains. However that improvement came at the expense of greater time consumption.

Future Work [This was a very preliminary study!] To clean-up the system (e.g., to use a standard clustering method) To test the system more rigorously (to test for significance; to use TANGO [used in the medical domain] To test our system on highly imbalanced data sets, to see if, indeed, our design helps address this particular issue. To modify the data generation process so as to test biases other than the one proposed by SMOTE.