Imbalanced data David Kauchak CS 451 – Fall 2013.

Slides:



Advertisements
Similar presentations
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Advertisements

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Is Random Model Better? -On its accuracy and efficiency-
Conceptual Clustering
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Random Forest Predrag Radenković 3237/10
Regularization David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Evaluation.
Model Evaluation Metrics for Performance Evaluation
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Evaluation.
Ensemble Learning: An Introduction
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ensemble Learning (2), Tree and Forest
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Evaluating Classifiers
INTRODUCTION TO MACHINE LEARNING David Kauchak CS 451 – Fall 2013.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing.
MULTICLASS CONTINUED AND RANKING David Kauchak CS 451 – Fall 2013.
GEOMETRIC VIEW OF DATA David Kauchak CS 451 – Fall 2013.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Chapter 9 – Classification and Regression Trees
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
ADVANCED PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Evaluating Results of Learning Blaž Zupan
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
MULTICLASS David Kauchak CS 451 – Fall Admin Assignment 4 Assignment 2 back soon If you need assignment feedback… CS Lunch tomorrow (Thursday):
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Class Imbalance in Text Classification
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
RANKING David Kauchak CS 451 – Fall Admin Assignment 4 Assignment 5.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bagging and Random Forests
Evaluating Classifiers
Large Margin classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
iSRD Spam Review Detection with Imbalanced Data Distributions
Chapter 11 Practical Methodology
Backpropagation David Kauchak CS159 – Fall 2019.
Evaluating Classifiers
CS639: Data Management for Data Science
Decision Trees Jeff Storey.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

imbalanced data David Kauchak CS 451 – Fall 2013

Admin Assignment 3: - how did it go? - do the experiments help? Assignment 4 Course feedback

Phishing

Setup for 1 hour, google collects 1M e-mails randomly they pay people to label them as “phishing” or “not-phishing” they give the data to you to learn to classify e-mails as phishing or not you, having taken ML, try out a few of your favorite classifiers You achieve an accuracy of 99.997% Should you be happy?

Imbalanced data The phishing problem is what is called an imbalanced data problem This occurs where there is a large discrepancy between the number of examples with each class label e.g. for our 1M example dataset only about 30 would actually represent phishing e-mails 99.997% not-phishing labeled data What is probably going on with our classifier? 0.003% phishing

always predict not-phishing Imbalanced data always predict not-phishing 99.997% accuracy Why does the classifier learn this?

Imbalanced data Many classifiers are designed to optimize error/accuracy This tends to bias performance towards the majority class Anytime there is an imbalance in the data this can happen It is particularly pronounced, though, when the imbalance is more pronounced

Imbalanced problem domains Besides phishing (and spam) what are some other imbalanced problems domains?

Imbalanced problem domains Medical diagnosis Predicting faults/failures (e.g. hard-drive failures, mechanical failures, etc.) Predicting rare events (e.g. earthquakes) Detecting fraud (credit card transactions, internet traffic)

Imbalanced data: current classifiers 99.997% not-phishing labeled data 0.003% phishing How will our current classifiers do on this problem?

Imbalanced data: current classifiers All will do fine if the data can be easily separated/distinguished Decision trees: explicitly minimizes training error when pruning pick “majority” label at leaves tend to do very poor at imbalanced problems k-NN: even for small k, majority class will tend to overwhelm the vote perceptron: can be reasonable since only updates when a mistake is made can take a long time to learn

Part of the problem: evaluation Accuracy is not the right measure of classifier performance in these domains Other ideas for evaluation measures?

“identification” tasks View the task as trying to find/identify “positive” examples (i.e. the rare events) Precision: proportion of test examples predicted as positive that are correct Recall: proportion of test examples labeled as positive that are correct # correctly predicted as positive # examples predicted as positive # correctly predicted as positive # positive examples in test set

“identification” tasks Precision: proportion of test examples predicted as positive that are correct Recall: proportion of test examples labeled as positive that are correct # correctly predicted as positive # examples predicted as positive # correctly predicted as positive # positive examples in test set predicted positive all positive precision recall

precision and recall precision = recall = data label predicted # correctly predicted as positive precision = # examples predicted as positive 1 # correctly predicted as positive recall = 1 # positive examples in test set 1 1 1 1 1

precision and recall precision = recall = precision = recall = data label predicted # correctly predicted as positive precision = # examples predicted as positive 1 # correctly predicted as positive recall = 1 # positive examples in test set 1 1 2 1 precision = 4 1 1 2 recall = 3

precision and recall precision = recall = data label predicted # correctly predicted as positive precision = # examples predicted as positive 1 # correctly predicted as positive recall = 1 # positive examples in test set 1 1 1 Why do we have both measures? How can we maximize precision? How can we maximize recall? 1 1

Maximizing precision precision = recall = data label predicted # correctly predicted as positive precision = # examples predicted as positive # correctly predicted as positive recall = 1 # positive examples in test set 1 Don’t predict anything as positive! 1

Maximizing recall precision = recall = Predict everything as positive! data label predicted # correctly predicted as positive precision = # examples predicted as positive 1 1 # correctly predicted as positive recall = 1 1 # positive examples in test set 1 1 1 Predict everything as positive! 1 1 1

precision vs. recall Often there is a tradeoff between precision and recall increasing one, tends to decrease the other For our algorithms, how might be increase/decrease precision/recall?

precision/recall tradeoff data label predicted confidence For many classifiers we can get some notion of the prediction confidence Only predict positive if the confidence is above a given threshold By varying this threshold, we can vary precision and recall 0.75 1 0.60 1 0.20 1 1 0.80 1 0.50 1 1 0.55 0.90

precision/recall tradeoff data label predicted confidence put most confident positive predictions at top put most confident negative predictions at bottom calculate precision/recall at each break point/threshold classify everything above threshold as positive and everything else negative 1 1 0.80 1 0.60 1 1 0.55 1 0.50 1 0.20 0.75 0.90

precision/recall tradeoff data label predicted confidence precision recall 1 1 0.80 1 0.60 1/2 = 0.5 1/3 = 0.33 1 1 0.55 1 0.50 1 0.20 0.75 0.90

precision/recall tradeoff data label predicted confidence precision recall 1 1 0.80 1 0.60 1 1 0.55 1 0.50 1 0.20 3/5 = 0.6 3/3 = 1.0 0.75 0.90

precision/recall tradeoff data label predicted confidence precision recall 1 1 0.80 1 0.60 1 1 0.55 1 0.50 1 0.20 0.75 0.90 3/7 = 0.43 3/3 = 1.0

precision/recall tradeoff data label predicted confidence precision recall 1.0 0.33 1 1 0.80 1 0.60 0.5 0.33 1 1 0.55 0.66 0.66 1 0.50 0.5 0.66 1 0.20 0.6 1.0 0.75 0.5 1.0 0.90 0.43 1.0

precision-recall curve

Which is system is better? precision precision recall recall How can we quantify this?

Area under the curve Area under the curve (AUC) is one metric that encapsulates both precision and recall calculate the precision/recall values for all thresholding of the test set (like we did before) then calculate the area under the curve can also be calculated as the average precision for all the recall points

Area under the curve? Any concerns/problems? precision precision recall recall Any concerns/problems?

Area under the curve? ? precision precision recall recall For real use, often only interested in performance in a particular range Eventually, need to deploy. How do we decide what threshold to use?

Area under the curve? ? precision precision recall recall Ideas? We’d like a compromise between precision and recall

A combined measure: F Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean):

F1-measure Most common α=0.5: equal balance/weighting between precision and recall:

A combined measure: F Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean): Why harmonic mean? Why not normal mean (i.e. average)?

F1 and other averages Harmonic mean encourages precision/recall values that are similar!

Evaluation summarized Accuracy is often NOT an appropriate evaluation metric for imbalanced data problems precision/recall capture different characteristics of our classifier AUC and F1 can be used as a single metric to compare algorithm variations (and to tune hyperparameters)

Phishing – imbalanced data

Black box approach Abstraction: we have a generic binary classifier, how can we use it to solve our new problem +1 binary classifier optionally: also output a confidence/score -1 Can we do some pre-processing/post-processing of our data to allow us to still use our binary classifiers?

Idea 1: subsampling labeled data pros/cons? Create a new training data set by: including all k “positive” examples randomly picking k “negative” examples 99.997% not-phishing 50% not-phishing labeled data 50% phishing pros/cons? 0.003% phishing

Subsampling Pros: Cons: Easy to implement Training becomes much more efficient (smaller training set) For some domains, can work very well Cons: Throwing away a lot of data/information

Idea 2: oversampling labeled data pros/cons? Create a new training data set by: including all m “negative” examples include m “positive examples: repeat each example a fixed number of times, or sample with replacement 99.997% not-phishing 50% not-phishing labeled data 50% phishing 0.003% phishing pros/cons?

oversampling Pros: Cons: Easy to implement Utilizes all of the training data Tends to perform well in a broader set of circumstances than subsampling Cons: Computationally expensive to train classifier

Idea 2b: weighted examples cost/weights Add costs/weights to the training set “negative” examples get weight 1 “positive” examples get a much larger weight change learning algorithm to optimize weighted training error 99.997% not-phishing 1 labeled data 99.997/0.003 = 33332 0.003% phishing pros/cons?

weighted examples Pros: Cons: Achieves the effect of oversampling without the computational cost Utilizes all of the training data Tends to perform well in a broader set circumstances Cons: Requires a classifier that can deal with weights Of our three classifiers, can all be modified to handle weights?

Building decision trees Otherwise: calculate the “score” for each feature if we used it to split the data pick the feature with the highest score, partition the data based on that data value and call recursively We used the training error to decide on which feature to choose: use the weighted training error In general, any time we do a count, use the weighted count (e.g. in calculating the majority label at a leaf)

Idea 3: optimize a different error metric Train classifiers that try and optimize F1 measure or AUC or … or, come up with another learning algorithm designed specifically for imbalanced problems pros/cons?

Idea 3: optimize a different error metric Train classifiers that try and optimize F1 measure or AUC or … Challenge: not all classifiers are amenable to this or, come up with another learning algorithm designed specifically for imbalanced problems Don’t want to reinvent the wheel! That said, there are a number of approaches that have been developed to specifically handle imbalanced problems