Cost-Sensitive Boosting algorithms: Do we really need them?

Slides:



Advertisements
Similar presentations
An Introduction to Boosting Yoav Freund Banter Inc.
Advertisements

On-line learning and Boosting
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Boosting CMPUT 615 Boosting Idea We have a weak classifier, i.e., it’s error rate is a little bit better than 0.5. Boosting combines a lot of such weak.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
A Brief Introduction to Adaboost
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Visual Recognition Tutorial
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
For Better Accuracy Eick: Ensemble Learning
Machine Learning CS 165B Spring 2012
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
CS 391L: Machine Learning: Ensembles
Regression Using Boosting Vishakh Advanced Machine Learning Fall 2006.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Learning with AdaBoost
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Predicting Good Probabilities With Supervised Learning
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning: Ensemble Methods
HW 2.
Machine Learning – Classification David Fenyő
Data Mining Practical Machine Learning Tools and Techniques
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
Calibration from Probabilistic Classification
The Boosting Approach to Machine Learning
Boosting and Additive Trees
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
COMP61011 : Machine Learning Ensemble Models
The Boosting Approach to Machine Learning
ECE 5424: Introduction to Machine Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
CS 4/527: Artificial Intelligence
Adaboost Team G Youngmin Jun
Data Mining Practical Machine Learning Tools and Techniques
Hidden Markov Models Part 2: Algorithms
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
The
Experiments in Machine Learning
Introduction to Data Mining, 2nd Edition
Introduction to Boosting
Lecture 18: Bagging and Boosting
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensemble learning.
Lecture 06: Bagging and Boosting
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Cost-Sensitive Boosting algorithms: Do we really need them? Professor Gavin Brown School of Computer Science, University of Manchester

“The important thing in science is not so much to obtain new facts, but to discover new ways of thinking about them.” Sir William Bragg Nobel Prize, Physics 1915

Cost-sensitive Boosting algorithms: Do we really need them? Nikolaos Nikolaou, Narayanan Edakunni, Meelis Kull, Peter Flach and Gavin Brown Machine Learning Journal, Vol. 104, Issue 2, Sept 2016 Nikos Nikolaou Cost-sensitive Boosting: A Unified Approach PhD thesis, University of Manchester, 2016

The “Usual” Machine Learning Approach data + labels Learning Algorithm Model Predicted label New data (no labels)

The “Boosting” approach data + labels Dataset 2 Dataset 3 Dataset 4 Model 1 Model 2 Model 3 Model 4 Each model corrects the mistakes of its predecessor.

The “Boosting” approach New data Model 1 Model 2 Model 3 Model 4 Weighted vote

Marginally more accurate than random guessing Boosting: A Rich Theoretical History “Can we turn a weak learner into a strong learner?” Kearns, 1988 Marginally more accurate than random guessing Arbitrarily high accuracy “Yes!” Schapire, 1990 AdaBoost algorithm. Freund & Schapire, 1997 Gödel Prize 2003

The Adaboost algorithm (informal) Start with all datapoints equally important. for t=1 to T do Build a classifier, h Calculate the voting weight for h Update the distribution over the datapoints - increase emphasis where h makes mistakes - decrease emphasis where h is already correct end for Calculate answer for test point by majority vote.

The Adaboost algorithm Majority voting weight for classifier #t Distribution update Majority vote on test example x’

Cost-sensitive Problems What happens with differing False Positive / False Negative costs? …can we minimize the expected cost (a.k.a. risk)? Prediction 1 0 1 Truth

Cost-sensitive Adaboost? 15+ boosting variants, 20 years Some re-invented multiple times Most proposed as heuristic modifications to original AdaBoost Many treat FP/FN costs as hyperparameters

The Adaboost algorithm, made cost-sensitive? We could modify this Or, some modify this Others modify this And some modify this

“The important thing in science is not so much to obtain new facts, but to discover new ways of thinking about them.” Let’s take a step back… Functional Gradient Descent (Mason et al, 2000) Decision Theory (Freund & Schapire, 1997) Margin Theory (Schapire et al., 1998) Probabilistic Modelling (Lebanon & Lafferty 2001)

still follow from each?” So, for a cost-sensitive boosting algorithm…? Algorithm X Functional Gradient Descent Decision Theory Margin Theory Probabilistic Modelling “Does my new algorithm still follow from each?”

(i.e. both a consequence of FGD on a given loss?) Functional Gradient Descent Distribution update? = Direction in function space Voting weight? = Step size Property 1: FGD-consistency Are the voting weights and distribution updates consistent with each other? (i.e. both a consequence of FGD on a given loss?)

FN FP > Decision Theory Property 2: Cost-consistency Ideally: Assign each example to risk-minimizing class… Bayes optimal rule, predict class y=1 iff… > FP FN Property 2: Cost-consistency Does the algorithm use the above to make decisions? (assuming ‘good’ probability estimates)

Margin theory Surrogate exponential loss, upper bound on 0/1 loss . Large margins encourage small generalization error. Adaboost promotes large margins.

Margin theory… with costs… Different surrogates for each class.

Margin theory… with costs… We expect this to be the case. But some algorithms do this… Property 3: Asymmetry preservation Does the loss function preserve the relative importance of each class, for all margin values?

Probabilistic Models Niculescu-Mizil & Caruana, 2005 ‘AdaBoost does not produce good probability estimates.’ Niculescu-Mizil & Caruana, 2005 ‘AdaBoost is successful at [..] classification [..] but not class probabilities.’ Mease et al., 2007 ‘This increasing tendency of [the margin] impacts the probability estimates by causing them to quickly diverge to 0 and 1.’ Mease & Wyner, 2008

Probabilistic Models Adaboost tends to produce probability estimates close to 0 or 1. Adaboost output (score) Frequency Empirically Observed Probability Adaboost output (score)

Property 4: Calibrated estimates Probabilistic Models Adaboost tends to produce probability estimates close to 0 or 1. Adaboost output (score) Frequency Empirically Observed Probability Adaboost output (score) Property 4: Calibrated estimates Does the algorithm generate “calibrated” probability estimates?

All algorithms produce uncalibrated probability estimates! The results are in…. All algorithms produce uncalibrated probability estimates! So what if we calibrate these last three?

Empirically Observed Probability Platt scaling (logistic calibration) Training phase: Reserve part of training data (here 50%) to fit a sigmoid - corrects the distortion: Adaboost output (score) Empirically Observed Probability Test phase: Apply sigmoid transform to score (output of ensemble) to get probability estimate

Experiments (lots) 15 algorithms. 18 datasets. 21 degrees of cost imbalance.

In summary…. Have all except calibrated probs Have all four properties Algorithms Ranked by Average Brier Score AdaMEC- Calibrated AdaMEC, CGAda & AsymAda outperform all others. Their calibrated versions outperform the uncalibrated ones

In summary…. “AdaMEC-calibrated” was an algorithm ranked 2nd overall. So what is it? 1. Take original Adaboost. 2. Calibrate it (we use Platt scaling) 3. Shift the decision threshold…. Consistent with all theory perspectives. No extra hyperparameters added. No need to retrain if cost ratio changes. Consistently top (or joint top) in empirical comparisons.

All algorithms produce uncalibrated probability estimates! The results are in…. All algorithms produce uncalibrated probability estimates!

Question: What if we calibrate all algorithms? Answer: in theory … … calibration will improve the probability estimates. … if a method is not cost-sensitive, calibration will not fix it. … if the gradient steps/direction are not consistent, calibration will not fix it. … if class importance is swapped, calibration will not fix it.

Question: What if we calibrate all algorithms? All except calibrated Have all properties Original AdaBoost (1997)

Conclusions We analyzed the cost-sensitive boosting literature… … 15+ variants over 20 years, from 4 different theoretical perspectives “Cost sensitive” modifications to the original Adaboost seem unnecessary. … if the scores are properly calibrated, and the decision threshold is shifted according to the cost matrix.

Code to play with All software at http://www.cs.man.ac.uk/~gbrown/costsensitiveboosting/ Including… - a step by step walkthrough of ‘Calibrated AdaMEC’, in Python implementation - a detailed tutorial on all aspects, allowing experiments to be reproduced

Results (extended) Isotonic regression beats Platt scaling, for larger datasets Can do better than 50%-50% train-calibration split. (Calibrated) Real AdaBoost > (Calibrated) Discrete AdaBoost...