Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

Slides:



Advertisements
Similar presentations
Detecting Statistical Interactions with Additive Groves of Trees
Advertisements

Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,
Lazy Paired Hyper-Parameter Tuning
Imbalanced data David Kauchak CS 451 – Fall 2013.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Comparing Supervised Machine Learning algorithms using WEKA & Orange Abhishek Iyer.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Sparse vs. Ensemble Approaches to Supervised Learning
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Using Machine Learning to Model Standard Practice: Retrospective Analysis of Group C-Section Rate via Bagged Decision Trees Rich Caruana Cornell CS Stefan.
Ensemble Learning: An Introduction
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Bayesian Learning Rong Jin.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Intelligible Models for Classification and Regression
Chapter 5 Data mining : A Closer Look.
Ensemble Learning (2), Tree and Forest
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Predicting Good Probabilities With Supervised Learning
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Classification Ensemble Methods 1
COMP24111: Machine Learning Ensemble Models Gavin Brown
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
NTU & MSRA Ming-Feng Tsai
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Spooky Stuff in Metric Space. Spooky Stuff Spooky Stuff Data Mining in Metric Space Rich Caruana Alex Niculescu Cornell University.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Evaluating Classifiers
An Empirical Comparison of Supervised Learning Algorithms
Calibration from Probabilistic Classification
Performance Evaluation 02/15/17
Trees, bagging, boosting, and stacking
COMP61011 : Machine Learning Ensemble Models
Introduction Feature Extraction Discussions Conclusions Results
Mitchell Kossoris, Catelyn Scholl, Zhi Zheng
Introduction to Data Mining, 2nd Edition
Somi Jacob and Christian Bach
Predicting Loan Defaults
Introduction to Machine learning
Presentation transcript:

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar

Importance: Empirical comparison of different learning algorithms provides answers to questions such as Which is the best learning algorithm? How well does a particular learning algorithm perform when compared to another algorithm over the same data?

The last comprehensive empirical comparison was STATLOG in 1995 Several new learning algorithms have been developed after STATLOG (Random forests, Bagging, SVMs) No extensive evaluation of these new methods.

SVMs ANN Logistic Regression Naïve Bayes KNN Random Forests Decision Trees(Bayes, Cart, Cart0, ID3,c4, MML,SMML) Bagged Trees Boosted Trees Boosted Stumps

Threshold Metrics: Accuracy-The proportion of correct predictions the classifier makes relative to the size of the dataset F-score -Harmonic mean of the precision and recall at a given threshold Lift -%of true positives above the threshold %of dataset above the threshold

Ordering/Rank Metrics: ROC curve - Plot of sensitivity vs. (1- specificity)for all possible thresholds APR - Average precision BEP(Break Even Point) -the precision at the point (threshold value) where precision and recall are equal.

Probability Metrics: (Root Mean Square Error) - A measure of total error defined as the square root of the sum of the variance and the square of the bias MXE (Mean Cross Entropy) - used in the probabilistic setting when interested in predicting the probability that an example is positive MXE = -1/NΣ(True©*ln(Pred(c)) + (1-true(c)*ln(1-pred(c))

Lift is appropriate for marketing Medicine prefers ROC Precision/Recall is used for information retrieval …It is also possible for a algorithm to perform well over one metric and perform poorly over some other metric

Letter Cover Type Adult Protein coding MEDIS MG IndianPine92 California Housing Bacteria SLAC(Stanford linear accelerator)

For each data set, 5000 random instances are used for training and the rest are used as one large test set. 5 fold cross validation is used on the 5000 training instances 5 fold cross validation is used to select the best parameters for the learning algorithm. …The purpose of the 5 fold cross validation is to calibrate the different algorithms using either Platt scaling or Isotonic regression

SVM predictions are transformed to posterior probabilities by passing them through a sigmoid Platt's method also works well for boosted trees and boosted stumps … might not be the correct transformation for all learning algorithms. Isotonic regression provides a more general solution since the only restriction it makes is that the mapping function should be isotonic (strictly increasing or strictly decreasing)

SVMs: radial width {.001,0.005,0.01,0.05,0.1,0.5,1,2} The regularization parameter is varied by factors of ten from to 10 3 ANN hidden units{1,2,4,8,32,128} momentum {0,0.2,0.5,0.9}

Logistic Regression: The ridge (regularization) parameter is varied by factors of 10 from to 10 4 KNN: 26 values of K ranging from K = 1 to K = |trainset| Random Forests: The size of the feature set considered at each split is 1,2,4,6,8,12,16 or 20.

Boosted Trees: 2,4,8,16,32,64,128,256,512,1024 and 2048 steps of boosting Boosted Stumps: single level decision trees generated with 5 different splitting criteria, each boosted for 2,4,8,16,32,64,128,256,512,1024,2048,4096,8192 steps

Without calibration, the best algorithms were bagged trees, random forests, and neural nets. After calibration, the best algorithms were calibrated boosted trees, calibrated random forests, bagged trees, PLT-calibrated SVMs and neural nets …SVMs and Boosted trees have improved rankings with calibrations.

Interestingly, calibrating neural nets with PLT or ISO hurts their calibration. And some algorithms such as Memory-based methods (e.g. KNN) are unaffected by calibration

Letter -Boosted DT(plt) Cover Type -Boosted DT(plt) Adult -Boosted STMP(plt) Protein coding -Boosted DT(plt) MEDIS -Random Forest(plt) MG -Bagged DT IndianPine92 -Boosted DT(plt) California Housing -Boosted DT(plt) Bacteria -Bagged DT SLAC -Random Forest(ISO)

Neural nets perform well on all metrics on 10 of 11 problems, but perform poorly on COD If the COD problem had not been included, neural nets would move up 1-2 places in the rankings

Bootstrap analysis

randomly select a bootstrap sample from the original 11 problems randomly select a bootstrap sample of 8 metrics from the original 8 metrics rank the ten algorithms by mean performance across the sampled problems and metrics Repeat bootstrap sampling 1000 times, yielding 1000 potentially different rankings of the learning methods

Model1st2nd3rd4th5th6th7th8th9th10th Bst DT RF Bag DT SVM ANN KNN Bst stm DT logreg NB

The models that performed poorest were naive bayes, logistic regression, decisiontrees, and boosted stumps bagged trees, random forests, and neural nets give the best average performance without calibration After calibration with Platt's Method, boosted trees predict better probabilities than all other methods But at the same time boosted stumps and logistic regression, which perform poorly on average, are the best models for some metrics Effectiveness of an algorithm depends on the metric used and the data set.

The End