Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Imbalanced data David Kauchak CS 451 – Fall 2013.
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Three kinds of learning
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
Scaling up Decision Trees. Decision tree learning.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.
CLASSIFICATION: Ensemble Methods
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
Konstantina Christakopoulou Liang Zeng Group G21
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Random Forests Feb., 2016 Roger Bohn Big Data Analytics 1.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Decision tree and random forest
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Bagging and Random Forests
Eco 6380 Predictive Analytics For Economists Spring 2016
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
ECE 5424: Introduction to Machine Learning
A “Holy Grail” of Machine Learing
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Data Mining, 2nd Edition
Lecture 18: Bagging and Boosting
Ensembles.
Ensemble learning.
Lecture 06: Bagging and Boosting
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015

Combining Models Models can be combined in different ways “Ensembles” refers specifically to combining large sets of large classifiers built with randomness applied to data or classifier Some theoretical work on tree ensembles Other combining strategies possible

Tree Pros (Recap) Easy to interpret and explain Pretty pictures Automatic variable selection –Noise removal, detection of redundancy Good missing value handling Robust to transformations, outliers Can handle case where p > n “Everything” is automatic (sorta kinda) 3

Tree Cons (recap) Unstable: small changes in data can make big changes in model –Don’t make policy decisions based on splits! Predictive power often not as good as “standard” linear/logistic regression Estimates are not smooth as a function of the predictors But trees can serve as building blocks in combinations of models 4

Why Combine? Flexible models mean that one “best” model runs a risk of over-fitting. By averaging over a set of “weak” models we can “smooth” to reduce over-fitting Flexible models have low bias, high variance: averaging should produce low bias, low variance -- smaller overall error

Benefits of Combining Heterogeneous data means some models fit some X’s better than others -- if we can find what model fits well where, perhaps we can exploit that If different models produce different predictions for the same X, that is information about uncertainty that may be worth having

How To Combine A number of approaches have been suggested: 1.Chaining: Use the predictions from one model as a predictor in the next –E.g. use the leaf that each observation falls in as a predictor in a logistic regression –Can combine strengths of two different types of models (here, interaction plus linear) –Over-fitting is always an issue 7

Combining Models 2.Hybrid: Use different models in different parts of the X space, with regions chosen by model –E.g. k-nearest-neighbor inside the leaves of a classification tree –Treed regression –Trees are an obvious choice to build the partition, but it’s not clear whether that’s the right thing to do 8

Voting Is (Usually) Good A combination of “weak learners” may often be stronger (voting collectively) than any one “strong learner” Example: two-class problem, strong learner has 75% accuracy Weak learner has 52% accuracy, but… …we have 2000 independent ones What is the accuracy if each one votes? The hard part is getting these to be independent If each is < 50%, mob rule and chaos

Voting Independent Models In this example, Pr (unweighted vote is correct) is about 96% How to construct independent classifiers? Answer: we can’t (no new data), but we can try to keep correlation low Goal: a set of models in which each individual is strong and models are uncorrelated –But we’ll settle for a set of “weak learners,” as long as each is “strong enough” and they’re all “roughly uncorrelated”

Remember this? Bias vs Variance 11 All Relationships Linear Linear Plus Best Linear Model Best LM with transformations, interactions… Bias Best Tree True Model

Why Averaging Might Help 12 All Relationships Linear Linear Plus Best LMs with randomness High bias, low variance Best Trees with randomness Low bias, high variance True Model Average of these should be close to the “truth”! Low bias, low variance

Example: Bagging Boostrap Aggregating (Breiman 1996) Idea: build hundreds of trees, each on a different bootstrap sample –i.e., re-sample the original data, to the original size, but with replacement –Some original items duplicated, others omitted Let the trees vote (or take average, for regression tree) This “smooths” the abrupt boundaries of a single tree: easy increase in quality

Examples R: package ipred has one implementation of bagging Easy gain in predictive power Examples: wage model (again) Optical digits (again) 14

Drawbacks of Combining Interpretability is mostly gone May be no improvement if using a low- variance model (regression, k-NN) Computationally expensive (but parallelizable) Often error on training set  0 as number of models  ; need to keep close track of test/validation set error

Bagging 1-NN Classifiers Recall rule: –find point in training set nearest to test item; –assign that training item’s class to test item Suppose that x 0 ’s NN is x j What’s the prob. that x j appears in a bootstrap sample? A: 1  [  n  1)/n] n  1  1/e =.632 This is the chance that x j is not picked on the first draw This is the chance that x j is not picked n times in a row This is the chance that x j ISN’T not picked – IS picked – in n tries

Bagging 1-NN, cont’d So most bootstrap samples will include x j, and voting produces no benefit Similar story for k-NN, k > 1 Related story for regression: model doesn’t change much under sampling In trees (+ nets), small changes in data can produce large changes in model –This is where ensembles will help most!

Other Ensembles Random Forests (Breiman, 2001) Take bootstrap sample, as with bagging At each split, select a random subset of the available predictors; use best split from among those –In bagging a few strong predictors might be chosen in almost every tree –This induces correlation among the trees, which random forests reduces

More on Random Forests RFs gives fast, often unbiased error ests. –Could use test set, but it’s more common to use the subset of data that is not “in the bag” on any iteration – the OOB (“out of bag”) set RFs give measure of variable importance –For variable j, compare performance of RF on oob sets to perf. when j is randomly permuted RFs give one measure of proximity –Measure how close two observations are by how often they last in the same leaf –Useful (?) for clustering – stay tuned 19

Other Ensembles Boosting (Freund and Shapire, 1997) Tree n+1 built w/a weighted re-sampling of the data, weights depending on the classifications in tree n –Later trees focus more on the cases that “harder” to classify Trees combined by weighted vote based on quality

Yet More Ensembles Perturb y directly by “smearing” (regression), “flipping” (classification) Perturbing x seems to work, too –Goal: produce less-correlated classifiers Sub-sampling Instead of best split, choose randomly from top 20 (Dieterrich) Choose from among all “good” splits with probability proportional to decrease in deviance (Kobayashi) Can be used on models other than trees… Add Randomness to Data (as bagging and RF do) Add Randomness to Algorithm (as RF does)

Implementation Bagging, boosting, RF have all been successful in practice –Good addition to analyst’s toolbox R: packages ada, adabag [includes boosting], gbev, gbm, ipred, ModelMap, mboost, randomForest, others More Examples! Maybe start with this?