Predicting Good Probabilities With Supervised Learning

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 20 Jim Martin.
Data Mining Classification: Alternative Techniques
Efficient Large-Scale Structured Learning
An Overview of Machine Learning
The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer.
Sparse vs. Ensemble Approaches to Supervised Learning
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
COMP 328: Midterm Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:
Using Machine Learning to Model Standard Practice: Retrospective Analysis of Group C-Section Rate via Bagged Decision Trees Rich Caruana Cornell CS Stefan.
Deep Belief Networks for Spam Filtering
Sparse vs. Ensemble Approaches to Supervised Learning
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Intelligible Models for Classification and Regression
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Classifiers Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/09/15.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Gary M. Weiss Alexander Battistin Fordham University.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
COMP24111: Machine Learning Ensemble Models Gavin Brown
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Spooky Stuff in Metric Space. Spooky Stuff Spooky Stuff Data Mining in Metric Space Rich Caruana Alex Niculescu Cornell University.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Improving Collaborative Filtering by Incorporating Customer Reviews Hui Hui Supervisor Prof Min-Yen Kan Dr. Kazunari Sugiyama 1.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Cost-Sensitive Boosting algorithms: Do we really need them?
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Evaluating Classifiers
An Empirical Comparison of Supervised Learning Algorithms
Calibration from Probabilistic Classification
Learning Coordination Classifiers
Trees, bagging, boosting, and stacking
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Source: Procedia Computer Science(2015)70:
COMP61011 : Machine Learning Ensemble Models
CAMCOS Report Day December 9th, 2015 San Jose State University
IDSL, Intelligent Database System Lab
Somi Jacob and Christian Bach
Machine Learning with Clinical Data
Predicting Loan Defaults
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

What are good probabilities? Ideally, if the model predicts 0.75 for an example then the conditional probability, given the available attributes, of that example to be positive is 0.75. In practice: Good calibration: out of all the cases the model predicts 0.75 for, 75% are positive. Low Brier score (squared error). Low cross-entropy (log-loss).

Why good probabilities? Intelligibility If the classifier is part of a larger system Speech recognition Handwritten recognition If the classifier is used for decision making Cost sensitive decisions Medical applications Meteorology Risk analysis In spite of the great accuracy, we can’t use boosting in such applications

What did we do? We analyzed the predictions made by ten supervised learning algorithms. For the analysis we used eight binary classification problems. Limitations: Only binary problems. No multiclass. No high dimensional problems (only under 200). Only moderately sized training sets.

Questions addressed in this talk Which models are well calibrated and which are not? Can we fix the models that are not well calibrated? Which learning algorithm makes the best probabilistic predictions?

Reliability diagrams Put the cases with predicted values between 0 and 0.1 in the first bin, between 0.1 and 0.2 in the second, etc. For each bin, plot the mean predicted value against the true fraction of positives.

Which models are well calibrated? ANN LOGREG BAG-DT SVM BST-DT BST-STMP DT RF KNN NB

Questions addressed in this talk Which models are well calibrated and which are not? Can we fix the models that are not well calibrated? Which learning algorithm makes the best probabilistic predictions?

Can we fix the models that are not well calibrated? Platt Scaling Method used by Platt to obtain calibrated probabilities from SVMs. [Platt `99] Converts the outputs by passing them through a sigmoid. The sigmoid is fitted using an independent calibration set.

Can we fix the models that are not well calibrated? Isotonic Regression [Robertson et al. `88] More general calibration method used by Zadrozny and Elkan. [Zadrozny & Elkan `01, `02] Converts the outputs by passing them through a general isotonic (monotonically increasing) function. The isotonic function is fitted using an independent calibration set.

Max-margin methods Predictions are pushed away from 0 and 1. Calibration undoes the shift in predictions: more cases have predicted values closer to 0 and 1. PLATT HIST Reliability plots have sigmoidal shape. PLATT SVM Predictions are pushed away from 0 and 1. HIST ISO BST-DT BST-STMP

Boosted decision trees P1 COVT P2 ADULT P3 LET1 P4 LET2 P5 MEDIS P6 SLAC P7 HS P8 MG HIST PLATT ISO

Platt calibration for boosting Before: COVT ADULT LET1 LET2 MEDIS SLAC HS MG After: COVT ADULT LET1 LET2 MEDIS SLAC HS MG

Naive Bayes PLATT HIST NB HIST Naive Bayes pushes predictions toward 0 and 1 because of the unrealistic independence assumptions. ISO Isotonic Regression provides a better fit. PLATT This generates reliability plots that have an inverted sigmoid shape. Even if Platt Calibration is helping to improve the calibration, it is clear that a sigmoid is not the the right function for Naive Bayes models.

Platt Scaling vs. Isotonic Regression UNCAL .38 .36 .34 .32 .30 .28 BRIER SCORE BST-DT 10 100 1000 10000 ANN ISO PLATT UNCAL .38 .36 .34 .32 .30 .28 BRIER SCORE 10 100 1000 10000 ISO PLATT UNCAL .38 .36 .34 .32 .30 .28 BRIER SCORE NB 10 100 1000 10000 ISO PLATT UNCAL .38 .36 .34 .32 .30 .28 BRIER SCORE RF 10 100 1000 10000

Questions addressed in this talk Which models are well calibrated and which are not? Can we fix the models that are not well calibrated? Which learning algorithm makes the best probabilistic predictions?

Empirical Comparison For every learning algorithm we train different models using many parameter settings and variations. For SVMs we vary the kernel, kernel parameters and tradeoff parameter For neural nets we vary the number of hidden units, momentum, … For boosted trees we vary the type of decision tree used as a base learner and the number of steps of boosting e.t.c. Each model is trained on 4000 points and calibrated with Platt Scaling and Isotonic Regression on 1000 points For each data set, learning algorithm and calibration method we select the best model using the same 1000 points used for calibration.

Empirical Comparison BRIER SCORE BST-DT SVM RF ANN BAG KNN STMP DT LR NB

Summary and Conclusions We examined the quality of the probabilities predicted by ten supervised learning algorithms. Neural nets, bagged trees and logistic regression have well calibrated predictions. Max-margin methods such as boosting and SVMs push the predicted values away from 0 and 1. This yields a sidmoid-shaped reliability diagram. Learning algorithms such as Naive Bayes distort the probabilities in the opposite way, pushing them closer to 0 and 1.

Summary and Conclusions We examined two methods to calibrate the predictions. Max-margin methods and Naive Baies benefit a lot from calibration, while well-calibrated methods do not. Platt Scaling is more effective when the calibration set is small, but Isotonic Regression is more powerful when there is enough data to prevent overfitting. The methods that predict the best probabilities are calibrated boosted trees, calibrated random forests, calibrated SVMs, uncalibrated bagged trees and uncalibrated neural nets.

Thank you! Questions?