Data Mining Class Imbalance

Slides:

Advertisements

Similar presentations

Imbalanced data David Kauchak CS 451 – Fall 2013.

Advertisements

Learning Algorithm Evaluation

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Lecture Notes for Chapter 4 Part III Introduction to Data Mining

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

Model Evaluation Metrics for Performance Evaluation

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

Classification II (continued) Model Evaluation

Evaluating Classifiers

Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.

1 CSI 5388: ROC Analysis (Based on ROC Graphs: Notes and Practical Considerations for Data Mining Researchers by Tom Fawcett, (Unpublished) January 2003.

Active Learning for Class Imbalance Problem

Evaluation – next steps

1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 11 = Finish ch. 4 and start.

Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.

הערכת טיב המודל F-Measure, Kappa, Costs, MetaCost ד " ר אבי רוזנפלד.

Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.

Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.

Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Class Imbalance in Text Classification

Evaluating Classification Performance

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.

Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.

Computational Biology

7. Performance Measurement

Evolving Decision Rules (EDR)

Evaluating Classifiers

Evaluation – next steps

Lecture Notes for Chapter 4 Introduction to Data Mining

Performance Evaluation 02/15/17

7CCSMWAL Algorithmic Issues in the WWW

CSSE463: Image Recognition Day 11

CSE 4705 Artificial Intelligence

Performance Measures II

Dipartimento di Ingegneria «Enzo Ferrari»,

Chapter 7 – K-Nearest-Neighbor

Lecture Notes for Chapter 4 Introduction to Data Mining

Data Mining Classification: Alternative Techniques

Features & Decision regions

Introduction to Data Mining, 2nd Edition by

Data Mining Practical Machine Learning Tools and Techniques

CSSE463: Image Recognition Day 11

Experiments in Machine Learning

Learning Algorithm Evaluation

Classification of class-imbalanced data

iSRD Spam Review Detection with Imbalanced Data Distributions

آبان 96. آبان 96 Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan,

Model Evaluation and Selection

Nearest Neighbors CSC 576: Data Mining.

Computational Intelligence: Methods and Applications

CSSE463: Image Recognition Day 11

Roc curves By Vittoria Cozza, matr

CSSE463: Image Recognition Day 11

Data Mining Anomaly Detection

Evaluation and Its Methods

Data Mining Anomaly Detection

COSC 4368 Intro Supervised Learning Organization

Outlines Introduction & Objectives Methodology & Workflow

Presentation transcript:

Data Mining Class Imbalance Last modified 4/3/19

Class Imbalance Problem Lots of classification problems where the classes are skewed (more records from one class than another) Credit card fraud Intrusion detection Telecommunication switching equipment failures

Challenges Evaluation measures such as accuracy are not well-suited for imbalanced class Prediction algorithms are generally focused on maximizing accuracy Detecting the rare class is like finding needle in a haystack

Confusion Matrix PREDICTED CLASS ACTUAL CLASS Class=P Class=N a (TP) b (FN) c (FP) d (TN) a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Accuracy PREDICTED CLASS ACTUAL CLASS Class=P Class=N a (TP) b (FN) c (FP) d (TN)

Problem with Accuracy Consider a 2-class problem Number of Class NO examples = 990 Number of Class YES examples = 10 If a model predicts everything to be class NO, accuracy is 990/1000 = 99 % This is misleading because the model does not detect any class YES example Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc.) Underlying issue is that the two classes are not equally important

Metrics Suited to Class Imbalance We need metrics tailored to class imbalance Note that if a metric treats all classes equally, then relatively speaking, the minority class counts for more than if accuracy is used For example, if we reported balanced accuracy, which is the average of the accuracy on each class, then the minority class would count as much as majority class, where normally they are weighted by the number of examples in each class Alternatively, we could have metrics that describe performance on the minority class

Precision, Recall, and F-measure PREDICTED CLASS ACTUAL CLASS Class=P Class=N a (TP) b (FN) c (FP) d (TN)

F-measure vs. Accuracy Example 1 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 10 980 Precision = 10/(10+10) = 0.5 Recall = 10/(10+0) = 1.0 F-measure = 2 x 1 x 0.5 / (1+0.5) =0.66 Accuracy = 990/1000 = 0.99 Accuracy is significantly higher than F-measure. Now compare this to the top item on next slide: Precision, recall, and F-measure are identical, since the TN (bottom right) value has no impact.

F-measure vs. Accuracy Example 2 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 10 Precision = 10/(10+10) = 0.5 Recall = 10/(10+0) = 1 F-measure = 2 x 1 x 0.5 / (1+0.5) =0.66 Accuracy = 20/30 = 0.66 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 1 9 990 Precision = 1/(1+0) = 1 Recall = 1/(1+9) = 0.1 F-measure = 2 x 0.1 x 1 / (1+0.1) =0.18 Accuracy = 991/1000 = 0.991

F-measure vs. Accuracy Example 3 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 40 10 Note that in this case F-measure = accuracy. Also, there are equal number of examples in each class.

F-measure vs. Accuracy Example 4 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 40 10 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 40 10 1000 4000

Measures of Classification Performance PREDICTED CLASS ACTUAL CLASS Yes No TP FN FP TN Sensitivity and specificity are often used in statistics You are only responsible for knowing (by memory) accuracy, error rate, precision, and recall. The others are for reference. However, TP Rate (TPR) and FP Rate (FPR) are used in ROC analysis. Since we study this, they are important, but I will give you them on an exam.

Alternative Measures PREDICTED CLASS ACTUAL CLASS PREDICTED CLASS Class=Yes Class=No 40 10 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 40 10 1000 4000

Alternative Measures PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 10 40 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 25 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No 40 10

ROC (Receiver Operating Characteristic) A graphical approach for displaying trade-off between detection rate and false alarm rate Developed in 1950s for signal detection theory to analyze noisy signals ROC curve plots TPR against FPR Performance of a model represented as a point in an ROC curve Changing the threshold parameter of classifier changes the location of the point

ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class

Using ROC for Model Comparison No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under ROC curve Ideal: AUC = 1 Random guess: AUC = 0.5 Larger AUC better, but may not be best for a given set of costs

ROC Curve To draw ROC curve, classifier must produce continuous-valued output Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record Many classifiers produce only discrete outputs (i.e., predicted class) How to get continuous-valued outputs? Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM

Example: Decision Trees Continuous-valued outputs

ROC Curve Example

How to Construct an ROC curve Use a classifier that produces continuous-valued score The more likely it is for the instance to be in the + class, the higher the score Sort the instances in decreasing order according to the score Apply a threshold at each unique value of the score Count the number of TP, FP, TN, FN at each threshold TPR = TP/(TP+FN) FPR = FP/(FP + TN) Instance Score True Class 1 0.95 + 2 0.93 3 0.87 - 4 0.85 5 6 7 0.76 8 0.53 9 0.43 10 0.25

How to construct ROC curve Threshold >= ROC Curve: TPR FPR

Handling Class Imbalance Problem Class-based ordering (e.g. RIPPER) Rules for rare class have higher priority Cost-sensitive classification Misclassifying rare class as majority class is more expensive than misclassifying majority as rare class Sampling-based approaches

Cost Matrix C(i,j): Cost of misclassifying class i example as class j PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No f(Yes, Yes) f(Yes,No) f(No, Yes) f(No, No) C(i,j): Cost of misclassifying class i example as class j Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i, j) Class=Yes Class=No C(Yes, Yes) C(Yes, No) C(No, Yes) C(No, No)

Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i,j) + - -1 100 1 Model M1 PREDICTED CLASS ACTUAL CLASS + - 150 40 60 250 Model M2 PREDICTED CLASS ACTUAL CLASS + - 250 45 5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

Cost Sensitive Classification Example: Bayesian classifer Given a test record x: Compute p(i|x) for each class i Decision rule: classify node as class k if For 2-class, classify x as + if p(+|x) > p(-|x) This decision rule implicitly assumes that C(+|+) = C(-|-) = 0 and C(+|-) = C(-|+)

Cost Sensitive Classification General decision rule: Classify test record x as class k if 2-class: Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+) Cost(-) = p(+|x) C(+,-) + p(-|x) C(-,-) Decision rule: classify x as + if Cost(+) < Cost(-)

Sampling-based Approaches Modify the distribution of training data so rare class is well-represented in training set Undersample the majority class Oversample the rare class Can try something smarter like SMOTE SMOTE=Synthetic Minority Oversampling Technique Generate new minority examples near other ones Advantages and disadvantages Undersampling means throwing away data (bad) Oversampling w/o SMOTE means duplicating examples that can lead to overfitting

Discussion of Mining with Rarity: A Unifying Framework

Mining with Rarity: Unifying Framework This paper discusses: What is rarity? Why does rarity make learning difficult? How can rarity be addressed?

What is Rarity? Rare classes Rare cases Absolute Rarity Number of examples belonging to rare class is small and this makes it difficult to learn well Relative Rarity Number of examples in rare class(es) is much smaller than in the common class(es). The relative difference causes problems Rare cases Within a class there may be rare cases, which correspond to sub-concepts. For example, in medical diagnosis, there may be a class associated with “disease” and rare cases correspond to a specific rate disease.

Examples belong to 2 classes: + (minority) and – (majority) Solid lines represent true decision boundaries and dashed lines represent the learned decision boundaries. A1-A5 cover minority class, with A1 being a common case and A2-A5 being rare cases B1-B2 cover majority class with B1 being a very common case and B2 being rare case

Why Does Rarity Make Learning Difficult? Improper evaluation metrics (e.g., accuracy) Lack of data: absolute rarity See next slide Relative rarity: class imbalance Learners biased toward more common class Data Fragmentation Rare cases/classes affected more Inappropriate Inductive Bias Most classifiers use maximum generality bias Noise See second slide

Problem: Absolute Rarity Data from the same distribution is shown on the left and right side, but there is more data sampled on the right side. Note that the learned decision boundaries (dashed) are much closer to true decision boundaries (solid) on the right side. This is due to having more data. The left side shows the problem with absolute rarity

Problem: Noise Noise added to the right side. There are some “-” examples added to regions that previously only had “+”, and vice versa A1 now has 4 “–” examples, but decision boundary not really affected because there are so many “+” examples. Common case A1 not really impacted by noise. A3 now has 2 “-” examples and as a result the “+” example is no longer learned; there is no decision boundary (dashed) learned. Rare cases very impacted.

Methods for Addressing Rarity Use more appropriate evaluation metrics Like AUC and F1-measure Non-greedy search techniques More robust search methods (genetic algorithms) can identify subtle interactions between many features Use more appropriate inductive bias Knowledge/Human Interaction Human can insert domain knowledge by using good features

Methods for Addressing Rarity Learn only the rare class Some methods, like Ripper, use examples from all classes but can learns the rare class (Ripper uses default rule for majority class) One-class learning/recognition-based methods Learns using data from only one class, in this case the minority class This method is very different from methods that learn to distinguish between classes Example: gait biometrics using sensor data. Given a sample of data from person X, assume that if new data is close to X (within a threshold) then it is from X.

Methods for Addressing Rarity Cost-sensitive learning If we know the actual costs for different types of errors, we can use cost-sensitive learning, which will tend to favor the minority class (usually has more costly errors) Sampling This is a very common and well studied method (see next slide)

Sampling Can handle class imbalance by eliminating or reducing it Oversample minority class (duplicate examples) Undersample minority class (discard examples) Do a bit of both Instead of random over/under-sampling, create new minority class examples SMOTE: Synthetic Minority Oversampling Technique Creates new minority examples between existing ones

Some Final Points In other research, I showed that examples belonging to rare class almost always have higher error rate I also showed that small disjuncts have higher error rate than large disjuncts Small disjunct are rules/leaf nodes that cover few examples There is no really good way to handle the class imbalance problem, especially if absolute rarity Ideally get more data/more minority examples