Predicting Good Probabilities With Supervised Learning

Name: Predicting Good Probabilities With Supervised Learning
Uploaded: 2017-12-19T15:15:03+00:00
Duration: PTM9S8
Channel: Buddy Lester
Description: Predicting Good Probabilities With Supervised Learning

Predicting Good Probabilities With Supervised Learning
Alexandru Niculescu-Mizil Rich Caruana Cornell University

What are good probabilities?
Ideally, if the model predicts 0.75 for an example then the conditional probability, given the available attributes, of that example to be positive is 0.75. In practice: Good calibration: out of all the cases the model predicts 0.75 for, 75% are positive. Low Brier score (squared error). Low cross-entropy (log-loss).

Why good probabilities?
Intelligibility If the classifier is part of a larger system Speech recognition Handwritten recognition If the classifier is used for decision making Cost sensitive decisions Medical applications Meteorology Risk analysis In spite of the great accuracy, we can’t use boosting in such applications

What did we do? We analyzed the predictions made by ten supervised learning algorithms. For the analysis we used eight binary classification problems. Limitations: Only binary problems. No multiclass. No high dimensional problems (only under 200). Only moderately sized training sets.

Questions addressed in this talk
Which models are well calibrated and which are not? Can we fix the models that are not well calibrated? Which learning algorithm makes the best probabilistic predictions?

Reliability diagrams Put the cases with predicted values between 0 and 0.1 in the first bin, between 0.1 and 0.2 in the second, etc. For each bin, plot the mean predicted value against the true fraction of positives.

Which models are well calibrated?
ANN LOGREG BAG-DT SVM BST-DT BST-STMP DT RF KNN NB

Can we fix the models that are not well calibrated?
Platt Scaling Method used by Platt to obtain calibrated probabilities from SVMs [Platt `99] Converts the outputs by passing them through a sigmoid. The sigmoid is fitted using an independent calibration set.

Can we fix the models that are not well calibrated?
Isotonic Regression [Robertson et al. `88] More general calibration method used by Zadrozny and Elkan [Zadrozny & Elkan `01, `02] Converts the outputs by passing them through a general isotonic (monotonically increasing) function. The isotonic function is fitted using an independent calibration set.

Max-margin methods Predictions are pushed away from 0 and 1.
Calibration undoes the shift in predictions: more cases have predicted values closer to 0 and 1. PLATT HIST Reliability plots have sigmoidal shape. PLATT SVM Predictions are pushed away from 0 and 1. HIST ISO BST-DT BST-STMP

Boosted decision trees
P1 COVT P2 ADULT P3 LET1 P4 LET2 P5 MEDIS P6 SLAC P7 HS P8 MG HIST PLATT ISO

Platt calibration for boosting
Before: COVT ADULT LET1 LET2 MEDIS SLAC HS MG After: COVT ADULT LET1 LET2 MEDIS SLAC HS MG

Naive Bayes PLATT HIST NB HIST Naive Bayes pushes predictions toward 0 and 1 because of the unrealistic independence assumptions. ISO Isotonic Regression provides a better fit. PLATT This generates reliability plots that have an inverted sigmoid shape. Even if Platt Calibration is helping to improve the calibration, it is clear that a sigmoid is not the the right function for Naive Bayes models.

Platt Scaling vs. Isotonic Regression
UNCAL .38 .36 .34 .32 .30 .28 BRIER SCORE BST-DT ANN ISO PLATT UNCAL .38 .36 .34 .32 .30 .28 BRIER SCORE ISO PLATT UNCAL .38 .36 .34 .32 .30 .28 BRIER SCORE NB ISO PLATT UNCAL .38 .36 .34 .32 .30 .28 BRIER SCORE RF

Empirical Comparison For every learning algorithm we train different models using many parameter settings and variations. For SVMs we vary the kernel, kernel parameters and tradeoff parameter For neural nets we vary the number of hidden units, momentum, … For boosted trees we vary the type of decision tree used as a base learner and the number of steps of boosting e.t.c. Each model is trained on 4000 points and calibrated with Platt Scaling and Isotonic Regression on 1000 points For each data set, learning algorithm and calibration method we select the best model using the same 1000 points used for calibration.

Empirical Comparison BRIER SCORE
BST-DT SVM RF ANN BAG KNN STMP DT LR NB

Summary and Conclusions
We examined the quality of the probabilities predicted by ten supervised learning algorithms. Neural nets, bagged trees and logistic regression have well calibrated predictions. Max-margin methods such as boosting and SVMs push the predicted values away from 0 and 1. This yields a sidmoid-shaped reliability diagram. Learning algorithms such as Naive Bayes distort the probabilities in the opposite way, pushing them closer to 0 and 1.

Summary and Conclusions
We examined two methods to calibrate the predictions. Max-margin methods and Naive Baies benefit a lot from calibration, while well-calibrated methods do not. Platt Scaling is more effective when the calibration set is small, but Isotonic Regression is more powerful when there is enough data to prevent overfitting. The methods that predict the best probabilities are calibrated boosted trees, calibrated random forests, calibrated SVMs, uncalibrated bagged trees and uncalibrated neural nets.

Thank you! Questions?

Predicting Good Probabilities With Supervised Learning

Similar presentations

Presentation on theme: "Predicting Good Probabilities With Supervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting Good Probabilities With Supervised Learning

Similar presentations

Presentation on theme: "Predicting Good Probabilities With Supervised Learning"— Presentation transcript:

Similar presentations

About project

Feedback