ECE 5424: Introduction to Machine Learning

Slides:

Advertisements

Similar presentations

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

Advertisements

On-line learning and Boosting

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

Model Assessment, Selection and Averaging

CMPUT 466/551 Principal Source: CMU

Longin Jan Latecki Temple University

Sparse vs. Ensemble Approaches to Supervised Learning

A Brief Introduction to Adaboost

Ensemble Learning: An Introduction

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning (2), Tree and Forest

Machine Learning CS 165B Spring 2012

Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.

ECE 5984: Introduction to Machine Learning

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Benk Erika Kelemen Zsolt

CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

CLASSIFICATION: Ensemble Methods

Ensemble Learning (1) Boosting Adaboost Boosting is an additive model

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Ensemble Methods in Machine Learning

Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.

Classification Ensemble Methods 1

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Decision/Classification Trees Readings: Murphy ; Hastie 9.2.

CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Data Mining Practical Machine Learning Tools and Techniques

Reading: R. Schapire, A brief introduction to boosting

Bagging and Random Forests

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

COMP61011 : Machine Learning Ensemble Models

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

A “Holy Grail” of Machine Learing

ECE 5424: Introduction to Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Data Mining, 2nd Edition

ECE 5424: Introduction to Machine Learning

Computational Learning Theory

Lecture 18: Bagging and Boosting

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Ensemble Methods for Machine Learning: The Ensemble Strikes Back

Computational Learning Theory

Ensemble learning.

Lecture 06: Bagging and Boosting

Model Combination.

Ensemble learning Reminder - Bagging of Trees Random Forest

Model generalization Brief summary of methods

Presentation transcript:

ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4; Hastie 16 Stefan Lee Virginia Tech

Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Good: Low variance, don’t usually overfit Bad: High bias, can’t solve hard learning problems Sophisticated learners Kernel SVMs, Deep Neural Nets, Deep Decision Trees Good: Low bias, have the potential to learn with Big Data Bad: High variance, difficult to generalize Can we make combine these properties In general, No!! But often yes… (C) Dhruv Batra Slide Credit: Carlos Guestrin

Ensemble Methods Boosting Bagging Core Intuition: A combination of multiple classifiers will perform better than a single classifier. Boosting Bagging (C) Stefan Lee

Ensemble Methods Bagging Boosting Instead of learning a single predictor, learn many predictors Output class: (Weighted) combination of each predictor With sophisticated learners Uncorrelated errors  expected error goes down On average, do better than single classifier! Bagging With weak learners each one good at different parts of the input space Boosting (C) Dhruv Batra

Bagging (Bootstrap Aggregating / Bootstrap Averaging) Core Idea: Average multiple strong learners trained from resamples of your data to reduce variance and overfitting! (C) Stefan Lee

Dataset of N Training Examples Bagging Given: (xi, yi) Dataset of N Training Examples Sample N training points with replacement and train a predictor, repeat M times: Sample 1: hj(x) Sample M: hM(x) . . At test time, output the (weighted) average output of these predictors. (C) Stefan Lee

E[emek] = E[em]E[ek] (uncorrelated) then.. Why Use Bagging Let em be the error for the mth predictor trained through bagging and eavg be the error of the ensemble. If E[em] = 0 (unbiased) and E[emek] = E[em]E[ek] (uncorrelated) then.. 𝐸 𝑒 𝑎𝑣𝑔 = 1 𝑀 1 𝑀 ∑𝐸 𝑒 𝑚 The expected error of the average is a faction of the average expected error of the predictors! (C) Stefan Lee

When To Use Bagging In practice, completely uncorrelated predictors don’t really happen, but there also wont likely be perfect correlation either, so bagging may still help! Use bagging when… … you have overfit sophisticated learners (averaging lowers variance) … you have a somewhat reasonably sized dataset … you want an extra bit of performance from your models (C) Stefan Lee

Example: Decision Forests … We’ve seen that single decision trees can easily overfit! Train a M trees on different samples of the data and call it a forest. Uncorrelated errors result in better ensemble performance. Can we force this? Could assign trees random max depths Could only give each tree a random subset of the splits Some work to optimize for no correlation as part of the object! (C) Stefan Lee

A Note From Statistics Bagging is a general method to reduce/estimate the variance of an estimator. Looking at the distribution of a estimator from multiple resamples of the data can give confidence intervals and bounds on that estimator. Typically just called Bootstrapping in this context. (C) Stefan Lee

Boosting Core Idea: Combine multiple weak learners to reduce error/bias by reweighting hard examples! (C) Stefan Lee

Some Intuition About Boosting Consider a weak learner h(x), for example a decision stump: xj xj >= t xj < t h(x) = a h(x) = b xj h(x) = b h(x) = a t Example for binary classification: xj h(x) = 1 h(x) = -1 t Example for regression: xj h(x) = 𝑤 𝑅 𝑇 𝑥 h(x) = 𝑤 𝐿 𝑇 𝑥 t (C) Stefan Lee

Some Intuition About Boosting Consider a weak learner h(x), for example a decision stump: xj xj >= t xj < t h(x) = a h(x) = b xj h(x) = b h(x) = a t This learner will make mistakes often but what if we combine multiple to combat these errors such that our final predictor is: 𝑓 𝑥 = 𝛼 1 ℎ 1 𝑥 + 𝛼 2 ℎ 2 𝑥 +⋯+ 𝛼 𝑀−1 ℎ 𝑚−1 𝑥 + 𝛼 𝑀 ℎ 𝑀 (𝑥) This is a big optimization problem now!! min 𝛼 1 ,…, 𝛼 𝑀 ℎ 𝑖 ∈𝑯 1 𝑁 𝒊 𝑳( 𝑦 𝑖 , 𝛼 1 ℎ 1 𝑥 + 𝛼 2 ℎ 2 𝑥 +⋯+ 𝛼 𝑀−1 ℎ 𝑚−1 𝑥 + 𝛼 𝑀 ℎ 𝑀 𝑥 ) Boosting will do this greedily, training one classifier at a time to correct the errors of the existing ensemble (C) Stefan Lee

Boosting Algorithm [Schapire, 1989] Pick a class of weak learners You have a black box that picks best weak learning unweighted sum weighted sum On each iteration t Compute error based on current ensemble Update weight of each training example based on it’s error. Learn a predictor ht and strength for this predictor 𝛼 𝑡 Update ensemble:

Boosting Demo Demo Matlab demo by Antonio Torralba http://people.csail.mit.edu/torralba/shortCourseRLOC/boosting/boosting.html (C) Dhruv Batra

Boosting: Weak to Strong Training Error 0,0 Size of Boosted Ensemble (M) As we add more boosted learners to our ensemble, error approaches zero (in the limit) need to decided when to stop based on a validation set don’t use this on already overfit strong learners, will just become worse (C) Stefan Lee

Boosting Algorithm [Schapire, 1989] Pick a class of weak learners You have a black box that picks best weak learning unweighted sum weighted sum On each iteration t Compute error based on current ensemble Update weight of each training example based on it’s error. Learn a predictor ht and strength for this predictor 𝛼 𝑡 Update ensemble:

Boosting Algorithm [Schapire, 1989] We’ve assumed we have some tools to find optimal learners, either ℎ ∗ =𝑎𝑟𝑔𝑚𝑖𝑛 1 𝑁 𝑖 𝐿( 𝑦 𝑖 , ℎ 𝑥 𝑖 𝑥 𝑖 , 𝑦 𝑖 𝑖=1 𝑁 ℎ ∗ ℎ ∗ =𝑎𝑟𝑔𝑚𝑖𝑛 1 𝑁 𝑖 𝑤 𝑖 ∗𝐿( 𝑦 𝑖 , ℎ 𝑥 𝑖 𝑥 𝑖 , 𝑦 𝑖 , 𝑤 𝑖 𝑖=1 𝑁 ℎ ∗ or To train the tth predictor, our job is to express the optimization for the new predictor in one of these forms min 𝛼 1 ,…, 𝛼 𝑀 ℎ 𝑖 ∈𝑯 1 𝑁 𝒊 𝑳( 𝑦 𝑖 , 𝑓 𝑡−1 𝑥 + 𝛼 𝑡 ℎ 𝑡 𝑥 ) Typically done by either changing yi or wi depending on L. (C) Stefan Lee

Types of Boosting Loss Name Loss Formula Boosting Name Regression: Squared Loss L2Boosting Regression: Absolute Loss Gradient Boosting Classification: Exponential Loss AdaBoost Log/Logistic Loss LogitBoost (C) Dhruv Batra

Regression: Squared Loss L2 Boosting Loss Name Loss Formula Boosting Name Regression: Squared Loss L2Boosting Algorithm On Board (C) Dhruv Batra

Adaboost Algorithm You will derive in HW4! Loss Name Loss Formula Boosting Name Classification: Exponential Loss AdaBoost Algorithm You will derive in HW4! (C) Dhruv Batra

What you should know Voting/Ensemble methods Bagging Boosting How to sample Under what conditions is error reduced Boosting General algorithm L2Boosting derivation Adaboost derivation (from HW4) (C) Dhruv Batra

Learning Theory Probably Approximately Correct (PAC) Learning What does it formally mean to learn? (C) Dhruv Batra

Slide Credit: Carlos Guestrin Learning Theory We have explored many ways of learning from data But… How good is our classifier, really? How much data do I need to make it “good enough”? (C) Dhruv Batra Slide Credit: Carlos Guestrin

Slide Credit: Carlos Guestrin A simple setting… Classification N data points Finite space H of possible hypothesis e.g. decision trees on categorical variables of depth d A learner finds a hypothesis h that is consistent with training data Gets zero error in training – errortrain(h) = 0 What is the probability that h has more than  true error? P(errortrue(h) ≥ 𝜖) (C) Dhruv Batra Slide Credit: Carlos Guestrin

Generalization error in finite hypothesis spaces [Haussler ’88] Theorem: Hypothesis space H finite dataset D with N i.i.d. samples 0 < 𝜖 < 1 For any learned hypothesis h that is consistent (0 training error) on the training data: Even if h makes zero errors in training data, may make errors in test (C) Dhruv Batra Slide Credit: Carlos Guestrin

Slide Credit: Carlos Guestrin Using a PAC bound Let max acceptable 𝑃 𝑒𝑟𝑟𝑜 𝑟 𝑡𝑟𝑢𝑒 𝑔 >𝜖 =𝛿 Typically, 2 use cases: 1: Pick 𝜖 and 𝛿 , give you N I want no more than 𝜖 error with probability 𝛿, how much data? 2: Pick N and 𝛿, give you 𝜖 I have N data points and want to know my error 𝜖 with 𝛿 confidence. (C) Dhruv Batra Slide Credit: Carlos Guestrin

Slide Credit: Carlos Guestrin Haussler ‘88 bound Strengths: Holds for all (finite) H Holds for all data distributions Weaknesses Consistent classifier (0 training error) Finite hypothesis space (C) Dhruv Batra Slide Credit: Carlos Guestrin

Generalization bound for |H| hypothesis Theorem: Hypothesis space H finite dataset D with N i.i.d. samples 0 < 𝜖 < 1 For any learned hypothesis h: (C) Dhruv Batra Slide Credit: Carlos Guestrin

PAC bound and Bias-Variance tradeoff or, after moving some terms around, with probability at least 1-𝛿 : Important: PAC bound holds for all h, but doesn’t guarantee that algorithm finds best h!!! (C) Dhruv Batra Slide Credit: Carlos Guestrin