A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data Author: Gustavo E. A. Batista Presenter: Hui Li University of Ottawa.

Slides:



Advertisements
Similar presentations
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Class Imbalance vs. Cost-Sensitive Learning
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Lecture 3 Nonparametric density estimation and classification
Classification and Decision Boundaries
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Evaluation.
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
Evaluating Hypotheses
Non Parametric Classifiers Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Nearest Neighbour Condensing and Editing David Claus February 27, 2004 Computer Vision Reading Group Oxford.
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
CS Instance Based Learning1 Instance Based Learning.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Imbalanced Data Set Learning with Synthetic Examples
InCob A particle swarm based hybrid system for imbalanced medical data sampling Pengyi Yang School of Information Technologies.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
Active Learning for Class Imbalance Problem
Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing.
by B. Zadrozny and C. Elkan
Dealing with Noisy Data. 1.Introduction 2.Identifying Noise 3.Types of Noise 4.Noise Filtering 5.Robust Learners Against Noise 6.Experimental Comparative.
Instance Selection. 1.Introduction 2.Training Set Selection vs. Prototype Selection 3.Prototype Selection Taxonomy 4.Description of Methods 5.Related.
הערכת טיב המודל F-Measure, Kappa, Costs, MetaCost ד " ר אבי רוזנפלד.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Experimental Evaluation of Learning Algorithms Part 1.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques Nearest Neighbor Editing and Condensing Techniques 1.Nearest Neighbor.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK.
For Monday Finish chapter 19 Take-home exam due. Program 4 Any questions?
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Class Imbalance Classification Implementation Group 4 WEI Lili, ZENG Gaoxiong,
On Utillizing LVQ3-Type Algorithms to Enhance Prototype Reduction Schemes Sang-Woon Kim and B. John Oommen* Myongji University, Carleton University*
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Ensemble Methods in Machine Learning
Class Imbalance in Text Classification
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Data Mining and Decision Support
NTU & MSRA Ming-Feng Tsai
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
Combining Bagging and Random Subspaces to Create Better Ensembles
CS 9633 Machine Learning Support Vector Machines
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Balancing Techniques Gretel Fernández.
Ch8: Nonparametric Methods
K Nearest Neighbor Classification
iSRD Spam Review Detection with Imbalanced Data Distributions
Data Mining Class Imbalance
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Machine Learning: Lecture 5
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data Author: Gustavo E. A. Batista Presenter: Hui Li University of Ottawa

Contents Introduction Why Learning from Imbalanced data sets might be difficult? 10 Methods Experimental Evaluation Conclusion

Class Imbalance Problem Problem –Class Imbalance: examples in training data belonging to one class heavily outnumber the examples in the other class. –Most learning systems assume the training sets to be balanced. Result: –influence the performance achieved by existing learning systems. –The learning system may have difficulties to learn the concept related to the minority class.

Why Learning from Imbalanced Data Sets might be difficult? Domains –Medical record databases regarding a rare disease –Continuous fault-monitoring tasks ML community seems to agree on the hypothesis that the class imbalance is the major hypothesis in inducing classifiers in imbalanced domains.

Why Learning from Imbalanced Data Sets might be difficult? (Con ’ t) However, standard ML algorithms are capable of inducing good classifiers even in heavily imbalanced training sets, for example in sick data set. –Class imbalance is not the only problem responsible for the decrease in performance in learning algorithm –So, what ’ s the other factors???

Why Learning from Imbalanced Data Sets might be difficult? (Con ’ t) Spare cases from the minority class may confuse a classifier like k-nearest Neighbor (k-NN)

Motivation and Methods Answer: The degree of data overlapping among the classes Motivation: –Balance the training data –Remove noisy examples lying on the wrong side of the decision border Methods: –Over-sampling method: Smote –Data cleaning methods: Tomek links, and Edited Nearest Neighbor Rule

Methods Baseline Methods –Random over-sampling –Random under-sampling Under-sampling Methods –Tomek links –Condensed Nearest Neighbor Rule –One-sided selection –CNN + Tomek links –Neighborhood Cleaning Rule Over-sampling Methods –Smote Combination of Over-sampling method with Under-sampling method –Smote + Tomek links –Smote + ENN

Baseline Methods Baseline methods –Random over-sampling random replication of minority class examples Can increase the likelihood of occurring overfitting –Random over-sampling random elimination of majority class examples Can discard potentially useful data that could be important for the induction process

Four Groups of Negative Examples Noise examples Borderline examples –Borderline examples are unsafe since a small amount of noise can make them fall on the wrong side of the decision border. Redundant examples Safe examples

Tomek links [1] To remove both noise and borderline examples Tomek link –E i, E j belong to different classes, d (E i, E j ) is the distance between them. –A (E i, E j ) pair is called a Tomek link if there is no example E l, such that d(E i, E l ) < d(E i, E j ) or d(E j, E l ) < d(E i, E j ).

Tomek links

Condensed Nearest Neighbor Rule (CNN rule) [2] To pick out points near the boundary between the classes To find a consistent subset of examples. –A subset E’ ⊆ E is consistent with E if using a 1-nearest neighbor, E’ correctly classifies the examples in E Algorithm: –Let E be the original training set –Let E’ contains all positive examples from S and one randomly selected negative example –Classify E with the 1-NN rule using the examples in E’ –Move all misclassified example from E to E’ Sensitive to noise. Noisy examples are likely to be misclassified, many of them will be added to the training set.

Condensed Nearest Neighbor Rule (CNN rule)

One-sided selection [3] vs CNN+Tomek links One-sided selection –Tomek links + CNN CNN + Tomek links –Proposed by the author –Finding Tomek links is computationally demanding, it would be computationally cheaper if it was performed on a reduced data set.

Neighborhood Cleaning Rule [4] To remove majority class examples Different from OSS, emphasize more data cleaning than data reduction Algorithm: –Find three nearest neighbors for each example E i in the training set –If E i belongs to majority class, & the three nearest neighbors classify it to be minority class, then remove E i –If E i belongs to minority class, and the three nearest neighbors classify it to be majority class, then remove the three nearest neighbors

Smote: Synthetic Minority Over- sampling Technique [6] To form new minority class examples by interpolating between several minority class examples that lie together. in ``feature space'' rather than ``data space'' Algorithm: For each minority class example, introduce synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Note: Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. For example: if we are using 5 nearest neighbors, if the amount of over-sampling needed is 200%, only two neighbors from the five nearest neighbors are chosen and one sample is generated in the direction of each.

Smote: Synthetic Minority Over- sampling Technique Synthetic samples are generated in the following way: –Take the difference between the feature vector (sample) under consideration and its nearest neighbor. –Multiply this difference by a random number between 0 and 1 –Add it to the feature vector under consideration. Consider a sample (6,4) and let (4,3) be its nearest neighbor. (6,4) is the sample for which k-nearest neighbors are being identified (4,3) is one of its k-nearest neighbors. Let: f1_1 = 6 f2_1 = 4 f2_1 - f1_1 = -2 f1_2 = 4 f2_2 = 3 f2_2 - f1_2 = -1 The new samples will be generated as (f1',f2') = (6,4) + rand(0-1) * (-2,-1) rand(0-1) generates a random number between 0 and 1.

Smote + Tomek links Problem with Smote: might introduce the artificial minority class examples too deeply in the majority class space. Tomek links: data cleaning Instead of removing only the majority class examples that form Tomek links, examples from both classes are removed

Smote + Tomek links

Smote + ENN ENN removes any example whose class label differs from the class of at least two of its three nearest neighbors. ENN remove more examples than the Tomek links does ENN remove examples from both classes

Experimental Evaluation 10 methods 13 UCI data sets which have different degrees of imbalance

First Stage Ran 4.5 over the original imbalanced data sets using 10- fold cross validation

First Stage Facts: –In spite of a large degree of imbalance, the data sets Letter-a and Nursery obtained almost 100% AUC Conclude: –domains with non-overlapping classes do not seem to be problematic for learning no matter the degree of imbalance –But when allied to highly overlapped classes, it can significantly decrease the number of minority class examples correctly classified. –The relationship between training set size and performance For small imbalanced data sets, when a large degree of class overlapping exists and the class is further divided into subclusters, the minority class is poorly represented by an excessively reduced number of examples For large data sets, the effect of these complicating factors seems to be reduced, the minority class is better represented by a larger number of examples.

Second Stage Applied the over and under-sampling methods to the original data sets Facts: –Pruning rarely leads to an improvement in AUC for the original and balanced data sets. –All best results (results in bold) were obtained by the over-sampling methods. –Over-sampling methods are better ranked than the under-sampling methods –Random over-sampling in particular is well-ranked among the remainder methods –Two of the proposed methods: Smote+Tomek and Smoke+ENN are generally ranked among the best for data sets with a small number of positive examples Explanation –The loss of performance is directly related to the lack of minority class examples in conjunction with other complicating factors. –Over sampling is the methods that most directly attack the problem of the lack minority class examples.

Second Stage - results for the original and over sampled data sets

Second Stage - results for the original and under sampled data sets

Second Stage - performance ranking for original and balanced data sets for pruned decision trees Light gray color: results obtained with over sampling methods Dark gray color: results obtained with the original data sets Methods marked with an asterisk obtained statistically inferior results when compared to the top ranked method

Second Stage - performance ranking for original and balanced data sets for unpruned decision trees

Third Stage To measure the syntactic complexity of the induced models. Syntactic complexity is given by two main parameters: –the mean number of induced rules –the mean number of conditions per rule Facts –Over sampling lead to an increase in the number of induced rules compared to the one induced with the original data sets –Random over-sampling and Smote+ENN provide a smaller increase in the mean number of rules –Smote+ENN provide a smaller increase in the mean number of conditions per rule Explanation –Over-sampling increase the total number of training examples, which usually generates larger decision trees

Third Stage - mean number of induced rules Best results are shown in bold The best results obtained by an over-sampling method are highlighted in light gray color

Third Stage - mean number of conditions per rule

Conclusion Class imbalance does not systematically hinder the performance of learning systems. Besides class imbalance, the degree of data overlapping among the classes is another factor that lead to the decrease in performance of learning algorithms. Experiments show that in general, over-sampling methods provide more accurate results than under-sampling methods considering the are under the ROC curve. Random over-sampling is very competitive to more complex over-sampling methods. Random over-sampling usually produced the smallest increase in the mean number of induced rules, when compared among the over-sampling methods. Smote+ENN produced the smallest increase in the mean number of conditions per rule, when compared among the over-sampling methods.

References [1] Tomek, I. Two Modifications of CNN. IEEE Transactions on Systems Man and Communications SMC-6 (1976), pp [2] Hart, P. E. The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory IT-14 (1968), [3] Kubat, M., and Matwin, S. Addressing the Course of Imbalanced Training Sets: One-sided Selection. In ICML (1997), pp [4] Laurikkala, J. Improving Identification of Difficult Small Classes by Balancing Class Distribution. Tech. Rep. A , University of Tampere, [5] Wilson, D. L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Communications 2, 3 (1972), [6] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. SMOTE: Synthetic Minority Over-sampling Technique. JAIR 16 (2002),

Thanks Question?