Learning from Imbalanced Data

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Sparse vs. Ensemble Approaches to Supervised Learning
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.
Reduced Support Vector Machine
Ensemble Learning: An Introduction
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Three kinds of learning
CSSE463: Image Recognition Day 31 Due tomorrow night – Project plan Due tomorrow night – Project plan Evidence that you’ve tried something and what specifically.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Imbalanced Data Set Learning with Synthetic Examples
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
InCob A particle swarm based hybrid system for imbalanced medical data sampling Pengyi Yang School of Information Technologies.
ENN: Extended Nearest Neighbor Method for Pattern Recognition
Active Learning for Class Imbalance Problem
A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data Author: Gustavo E. A. Batista Presenter: Hui Li University of Ottawa.
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Thesis Proposal PrActive Learning: Practical Active Learning, Generalizing Active Learning for Real-World Deployments.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Experimental Evaluation of Learning Algorithms Part 1.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
ECE738 Advanced Image Processing Face Detection IEEE Trans. PAMI, July 1997.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Learning with AdaBoost
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Class Imbalance in Text Classification
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
CS 9633 Machine Learning Support Vector Machines
Trees, bagging, boosting, and stacking
Classification of class-imbalanced data
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensemble learning.
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Presentation transcript:

Learning from Imbalanced Data This lecture notes is based on the following paper: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 Learning from Imbalanced Data Prof. Haibo He Electrical Engineering University of Rhode Island, Kingston, RI 02881 Computational Intelligence and Self-Adaptive Systems (CISA) Laboratory http://www.ele.uri.edu/faculty/he/ Email: he@ele.uri.edu Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Learning from Imbalanced Data The problem: Imbalanced Learning The solutions: State-of-the-art The evaluation: Assessment Metrics The future: Opportunities and Challenges Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

The Nature of Imbalanced Learning Problem Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

The Problem What about data in reality? Requirement? The Nature of Imbalance Learning The Problem Explosive availability of raw data Well-developed algorithms for data analysis Requirement? Balanced distribution of data Equal costs of misclassification What about data in reality? Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Imbalance is Everywhere Between-class Within-class Intrinsic and extrinsic Relativity and rarity Imbalance and small sample size Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Growing interest Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Mammography Data Set: An example of between-class imbalance The Nature of Imbalance Learning Mammography Data Set: An example of between-class imbalance   Negative/healthy Positive/cancerous Number of cases 10,923 260 Category Majority Minority Imbalanced accuracy ≈ 100% 0-10 % Imbalance can be on the order of 100 : 1 up to 10,000 : 1! Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Intrinsic and extrinsic imbalance Imbalance due to the nature of the dataspace Extrinsic: Imbalance due to time, storage, and other factors Example: Data transmission over a specific interval of time with interruption Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Data Complexity The Nature of Imbalance Learning Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Relative imbalance and absolute rarity ?   The minority class may be outnumbered, but not necessarily rare Therefore they can be accurately learned with little disturbance Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Imbalanced data with small sample size Data with high dimensionality and small sample size Face recognition, gene expression Challenges with small sample size: Embedded absolute rarity and within-class imbalances Failure of generalizing inductive rules by learning algorithms Difficulty in forming good classification decision boundary over more features but less samples Risk of overfitting Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

The Solutions to Imbalanced Learning Problem Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Solutions to imbalanced learning Sampling methods Cost-sensitive methods Kernel and Active Learning methods Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Sampling methods Sampling methods Create balance though sampling Create balanced dataset Modify data distribution If data is Imbalanced… Create balance though sampling Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Sampling methods Random Sampling S: training data set; Smin: set of minority class samples, Smaj: set of majority class samples; E: generated samples     Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Sampling methods Informed Undersampling EasyEnsemble BalanceCascade Unsupervised: use random subsets of the majority class to create balance and form multiple classifiers BalanceCascade Supervised: iteratively create balance and pull out redundant samples in majority class to form a final classifier   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Sampling methods Informed Undersampling   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Synthetic Sampling with Data Generation Sampling methods Synthetic Sampling with Data Generation   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Synthetic Sampling with Data Generation Sampling methods Synthetic Sampling with Data Generation Synthetic minority oversampling technique (SMOTE) Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Adaptive Synthetic Sampling Sampling methods Adaptive Synthetic Sampling   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Adaptive Synthetic Sampling Sampling methods Adaptive Synthetic Sampling Overcomes over generalization in SMOTE algorithm Border-line-SMOTE Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Adaptive Synthetic Sampling Sampling methods Adaptive Synthetic Sampling   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Sampling with Data Cleaning Sampling methods Sampling with Data Cleaning   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Sampling with Data Cleaning Sampling methods Sampling with Data Cleaning Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Cluster-based oversampling (CBO) method Sampling methods Cluster-based oversampling (CBO) method   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Sampling methods CBO Method Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Integration of Sampling and Boosting Sampling methods Integration of Sampling and Boosting   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Integration of Sampling and Boosting Sampling methods Integration of Sampling and Boosting   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Cost-Sensitive Methods Utilize cost-sensitive methods for imbalanced learning Considering the cost of misclassification Instead of modifying data… Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Cost-Sensitive Learning Framework Cost-Sensitive methods Cost-Sensitive Learning Framework   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Cost-Sensitive methods Cost-Sensitive Dataspace Weighting with Adaptive Boosting   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Cost-Sensitive methods Cost-Sensitive Dataspace Weighting with Adaptive Boosting   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Cost-Sensitive Decision Trees Cost-Sensitive methods Cost-Sensitive Decision Trees Cost-sensitive adjustments for the decision threshold The final decision threshold shall yield the most dominant point on the ROC curve Cost-sensitive considerations for split criteria The impurity function shall be insensitive to unequal costs Cost-sensitive pruning schemes The probability estimate at each node needs improvement to reduce removal of leaves describing the minority concept Laplace smoothing method and Laplace pruning techniques Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Cost-Sensitive Neural Network Cost-Sensitive methods Cost-Sensitive Neural Network Four ways of applying cost sensitivity in neural networks Modifying probability estimate of outputs Applied only at testing stage Maintain original neural networks Altering outputs directly Bias neural networks during training to focus on expensive class Modify learning rate Set η higher for costly examples and lower for low-cost examples Replacing error-minimizing function Use expected cost minimization function instead Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Kernel-based learning framework Kernel-Based Methods Kernel-based learning framework Based on statistical learning and Vapnik-Chervonenkis (VC) dimensions Problems with Kernel-based support vector machines (SVMs) Support vectors from the minority concept may contribute less to the final hypothesis Optimal hyperplane is also biased toward the majority class To minimize the total error Biased toward the majority Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Integration of Kernel Methods with Sampling Methods Kernel-Based Methods Integration of Kernel Methods with Sampling Methods SMOTE with Different Costs (SDCs) method Ensembles of over/under-sampled SVMs SVM with asymmetric misclassification cost Granular Support Vector Machines—Repetitive Undersampling (GSVM-RU) algorithm Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Kernel Modification Methods Kernel-Based Methods Kernel Modification Methods Kernel classifier construction Orthogonal forward selection (OFS) and Regularized orthogonal weighted least squares (ROWLSs) estimator SVM class boundary adjustment Boundary movement (BM), biased penalties (BP), class-boundary alignment(CBA), kernel- boundary alignment (KBA) Integrated approach Total margin-based adaptive fuzzy SVM (TAF-SVM) K-category proximal SVM (PSVM) with Newton refinement Support cluster machines (SCMs), Kernel neural gas (KNG), P2PKNNC algorithm, hybrid kernel machine ensemble (HKME) algorithm, Adaboost relevance vector machine (RVM), … Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Active Learning Methods SVM-based active learning Active learning with sampling techniques Undersampling and oversampling with active learning for the word sense disambiguation (WSD) imbalanced learning New stopping mechanisms based on maximum confidence and minimal error Simple active learning heuristic (SALH) approach Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Active Learning Methods Additional methods One-class learning/novelty detection methods Mahalanobis-Taguchi System Combination of imbalanced data and the small sample size problem Rank metrics and multitask learning AdaC2.M1 Rescaling approach for multiclass cost-sensitive neural networks the ensemble knowledge for imbalance sample sets (eKISS) method Multiclass imbalanced learning Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

The Evaluation of Imbalanced Learning Problem Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

How to evaluate the performance of imbalanced learning algorithms ? Assessment Metrics How to evaluate the performance of imbalanced learning algorithms ? 1. Singular assessment metrics 2. Receiver operating characteristics (ROC) curves 3. Precision-Recall (PR) Curves 4. Cost Curves 5. Assessment Metrics for Multiclass Imbalanced Learning Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Singular assessment metrics Limitations of accuracy – sensitivity to data distributions Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Singular assessment metrics Insensitive to data distributions Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Singular assessment metrics   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Receive Operating Characteristics (ROC) curves Assessment Metrics Receive Operating Characteristics (ROC) curves   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Precision-Recall (PR) curves Assessment Metrics Precision-Recall (PR) curves Plotting the precision rate over the recall rate A curve dominates in ROC space (resides in the upper-left hand) if and only if it dominates (resides in the upper-right hand) in PR space PR space has all the analogous benefits of ROC space Provide more informative representations of performance assessment under highly imbalanced data Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Cost Curves Assessment Metrics   Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

The Future of Imbalanced Learning Problem Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Opportunities and Challenges Understanding the Fundamental Problem Need of a Uniform Benchmark Platform Need of Standardized Evaluation Practices Semi-supervised Learning from Imbalanced data Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Understanding the Fundamental Problem Opportunities and Challenges Understanding the Fundamental Problem What kind of assumptions will make imbalanced learning algorithms work better compared to learning from the original distributions? To what degree should one balance the original data set? How do imbalanced data distributions affect the computational complexity of learning algorithms? What is the general error bound given an imbalanced data distribution? Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Need of a Uniform Benchmark Platform Opportunities and Challenges Need of a Uniform Benchmark Platform Lack of a uniform benchmark for standardized performance assessments Lack of data sharing and data interoperability across different disciplinary domains; Increased procurement costs, such as time and labor, for the research community as a whole group since each research group is required to collect and prepare their own data sets. Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Need of Standardized Evaluation Practices Opportunities and Challenges Need of Standardized Evaluation Practices Establish the practice of using the curve-based evaluation techniques A standardized set of evaluation practices for proper comparisons Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Incremental Learning from Imbalanced Opportunities and Challenges Incremental Learning from Imbalanced Data Streams How can we autonomously adjust the learning algorithm if an imbalance is introduced in the middle of the learning period? Should we consider rebalancing the data set during the incremental learning period? If so, how can we accomplish this? How can we accumulate previous experience and use this knowledge to adaptively improve learning from new data? How do we handle the situation when newly introduced concepts are also imbalanced (i.e., the imbalanced concept drifting issue)? Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Semi-supervised Learning from Imbalanced Data Opportunities and Challenges Semi-supervised Learning from Imbalanced Data How can we identify whether an unlabeled data example came from a balanced or imbalanced underlying distribution? Given an imbalanced training data with labels, what are the effective and efficient methods for recovering the unlabeled data examples? What kind of biases may be introduced in the recovery process (through the conventional semi-supervised learning techniques) given imbalanced, labeled data? Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

This lecture notes is based on the following paper: Reference: This lecture notes is based on the following paper: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 Should you have any comments or suggestions regarding this lecture note, please feel free to contact Dr. Haibo He at he@ele.uri.edu Web: http://www.ele.uri.edu/faculty/he/ Source: H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009