An Empirical Comparison of Supervised Learning Algorithms

Slides:

Advertisements

Similar presentations

On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.

Advertisements

Lazy Paired Hyper-Parameter Tuning

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.

Indian Statistical Institute Kolkata

CMPUT 466/551 Principal Source: CMU

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer.

x – independent variable (input)

Sparse vs. Ensemble Approaches to Supervised Learning

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.

Instance Based Learning IB1 and IBK Small section in chapter 20.

Weka solution for the 2004 KDD Cup Protein Homology Prediction task Bernhard Pfahringer Weka Group, University of Waikato, New Zealand.

Sparse vs. Ensemble Approaches to Supervised Learning

General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Intelligible Models for Classification and Regression

Classification and Prediction: Regression Analysis

Ensemble Learning (2), Tree and Forest

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Ensemble Methods: Bagging and Boosting

CLASSIFICATION: Ensemble Methods

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

Predicting Good Probabilities With Supervised Learning

Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.

NTU & MSRA Ming-Feng Tsai

Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Spooky Stuff in Metric Space. Spooky Stuff Spooky Stuff Data Mining in Metric Space Rich Caruana Alex Niculescu Cornell University.

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Evaluating Classifiers

Deep Feedforward Networks

Machine Learning & Deep Learning

Boosted Augmented Naive Bayes. Efficient discriminative learning of

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Trees, bagging, boosting, and stacking

CSE 4705 Artificial Intelligence

Kaniz Rashid Lubana Mamun MS Student: CSU Hayward Dr. Eric A. Suess

KDD 2004: Adversarial Classification

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

ECE 5424: Introduction to Machine Learning

Data Mining Lecture 11.

Transportation Mode Recognition using Smartphone Sensor Data

Informing the Use of Hyper-Parameter Optimization Through Metalearning

Image Classification via Attribute Detection

Lecture 05: Decision Trees

Deep Learning for Non-Linear Control

Neural Networks ICS 273A UC Irvine Instructor: Max Welling

Model generalization Brief summary of methods

Chapter 7: Transformations

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Predicting Loan Defaults

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

An Empirical Comparison of Supervised Learning Algorithms Presented by Anilkumar Kopparthy

What is Supervised Learning??

Supervised Learning Methods Support vector machines(SVM) Neural Nets Logistic Regression Naïve bayes Memory-based Learning(KNN) Random forests Decision trees Bagged trees Boosted trees Boosted stumps

Questions Is one algorithm “better” than the others? Are some learning methods best for certain loss functions? SVMs for classification? ANNs for regression or predicting probabilities? If no method(s) dominate, can we at least ignore some algorithms? What should you use ???

Methodology We attempt to explore the space of parameters and common variations for each learning algorithm as thoroughly as is computationally feasible. Each algorithm has many variations and free parameters: SVM: margin parameter, kernel, kernel parameters (e.g. gamma), … ANN: # hidden units, # hidden layers, learning rate, momentum, … DT: splitting criterion, pruning options, smoothing options, … KNN: K, distance metric, distance weighted averaging, … Must optimize to each problem: failure to optimize makes superior algorithm inferior optimization depends on size of train set

Data sets 11 binary classification data sets ADULT COV_TYPE LETTER.P1 BACT CALHOUS HS MEDIS MG SLAC COD UCI repository Medical Records 4000 Training Sets 1000 validation Sets Training records Large final test sets (usually 20000)

Binary Classification Performance Metrics Threshold Metrics: Accuracy(ACC) F-Score(FSC) Lift(LFT) Ordering/Ranking Metrics: ROC Area(ROC) Average Precision(APR) Precision/Recall Break-Even Point(BEP) Probability Metrics: Root-Mean-Squared-Error(RMS) Cross-Entropy(MXE)

Normalized Scores Small Difficulty: some metrics, 1.00 is best (e.g. ACC) some metrics, 0.00 is best (e.g. RMS) some metrics, baseline is 0.50 (e.g. AUC) some metrics, best depends on data (e.g. Lift) some problems/metrics, 0.60 is excellent performance some problems/metrics, 0.99 is poor performance Solution: Normalized Scores: baseline performance => 0.00 best observed performance => 1.00 (proxy for Bayes optimal) puts all metrics/problems on equal footing

Massive Empirical Comparison 10 learning methods X 100’s of parameter settings per method 5-fold cross validation = 10,000+ models trained per problem 11 Boolean classification test problems 110,000+ models 9 performance metrics 1,000,000+ model evaluations

Probability Metrics Results on Test Sets (Normalized Scores) Model Squared Error Cross- Entropy Calibration Mean ANN 0.872 0.878 0.826 0.859 BAG-DT 0.875 0.901 0.637 0.804 RND-FOR 0.882 0.899 0.567 0.783 KNN 0.769 0.684 0.745 LOG-REG 0.614 0.620 0.678 DT 0.583 0.638 0.512 0.578 BST-DT 0.596 0.598 0.045 0.413 SVM [0,1] 0.484 0.447 0.000 0.310 BST-STMP 0.355 0.339 0.123 0.272 NAÏVE-B 0.271 0.00 0.090 Best probabilities overall: Neural Nets Bagged Decision Trees Random Forests Not competitive: Boosted decision trees and stumps (exponential loss) SVMs (standard hinge-loss) SVMs scaled to [0,1] via simple min/max scaling

Calibration & Reliability Diagrams If on 100 days the forecast is 70% chance of rain, predictions well calibrated at p = 0.70 if it actually rains about 70/100 days Put cases with predicted values between 0 and 0.1 in first bin and so on. # # # # # # # # # # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 For each bin, plot mean predicted value against true fraction of positives:

SVM Reliability Plots

Platt Scaling by Fitting a Sigmoid Linear scaling of SVM [-∞,+∞] predictions to [0,1] is bad Platt’s Method [Platt 1999]: – scale predictions by fitting sigmoid on a validation set using 3- fold CV and Bayes-motivated smoothing to avoid overfitting

Isotonic Regression Isotonic regression is the technique of fitting a free-form line to a sequence of observations under the following constraints: the fitted free-form line has to be non-decreasing everywhere, and it has to lie as close to the observations as possible. The isotonic regression finds a non-decreasing approximation of a function while minimizing the mean squared error on the training data. The benefit of such a model is that it does not assume any form for the target function such as linearity.

Results After Platt Scaling SVMs Probability Metrics Model Squared Error Cross- Entropy Calibration Mean ANN 0.872 0.878 0.826 0.859 SVM-PLT 0.882 0.880 0.769 0.844 BAG-DT 0.875 0.901 0.637 0.804 RND-FOR 0.899 0.567 0.783 KNN 0.684 0.745 LOG-REG 0.614 0.620 0.678 DT 0.583 0.638 0.512 0.578 BST-DT 0.596 0.598 0.045 0.413 SVM [0,1] 0.484 0.447 0.000 0.310 BST-STMP 0.355 0.339 0.123 0.272 NAÏVE-B 0.271 0.00 0.090 Platt’s Method (Platt 1999) for obtaining posterior probabilities from SVMs by fitting a sigmoid SVM probabilities as good as Neural Net probabilities after scaling with Platt’s Method SVMs slightly better than Neural Nets on 2 of 3 metrics! Would other learning methods benefit from calibration with Platt’s Method?

Results After Platt Scaling All Models Models that benefit from calibration: SVMs Boosted decision trees Boosted stumps Random forests Naïve Bayes Vanilla decision trees Do not benefit from calibration: Neural nets Bagged trees Logistic regression MBL/KNN Boosting full trees dominates Probability Metrics Model Squared Error Cross- Entropy Calibration Mean BST-DT 0.929 0.932 0.808 0.890 ANN 0.872 0.878 0.826 0.859 SVM-PLT 0.882 0.880 0.769 0.844 RND-FOR 0.892 0.898 0.702 0.831 BAG-DT 0.875 0.901 0.637 0.804 KNN 0.786 0.805 0.706 0.766 BST-STMP 0.740 0.783 0.678 0.734 LOG-REG 0.614 0.620 DT 0.586 0.625 0.688 0.633 NAÏVE-B 0.539 0.565 0.161 0.422

Normalized scores for each learning algorithm by metric Threshold Metrics Rank/Ordering Metrics Probability Metrics Model Accuracy F-Score Lift ROC Area Average Precision Break Even Point Squared Error Cross- Entrop y Calibration Mean BST-DT 0.860 0.854 0.956 0.977 0.958 0.952 0.929 0.932 0.808 0.914 RND-FOR 0.866 0.871 0.957 0.948 0.892 0.898 0.702 0.897 ANN 0.817 0.875 0.947 0.963 0.926 0.872 0.878 0.826 SVM 0.823 0.851 0.928 0.961 0.931 0.882 0.880 0.769 0.884 BAG-DT 0.836 0.849 0.953 0.972 0.950 0.901 0.637 KNN 0.759 0.820 0.937 0.893 0.786 0.805 0.706 0.835 BST-STMP 0.698 0.760 0.740 0.783 0.678 0.801 DT 0.611 0.771 0.856 0.789 0.586 0.625 0.688 0.734 LOG-REG 0.602 0.623 0.829 0.732 0.714 0.614 0.620 0.696 NAÏVE-B 0.536 0.615 0.833 0.733 0.730 0.539 0.565 0.161 After Platt Scaling, boosted trees are best models overall across all metrics Neural nets are best models overall if no calibration is applied post-training

Platt Scaling vs. Isotonic Regression

Conclusions Calibration via Platt Scaling or Isotonic Regression improves probs from max-margin methods such as Boosted Trees and SVMs. With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second, followed by uncalibrated bagged trees, calibrated SVMs, and uncalibrated neural nets. The models that performed poorest were naive bayes, logistic regression, decision trees, and boosted stumps. Although some methods clearly perform better or worse than other methods on average, there is significant variability across the problems and metrics.