Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.

Slides:



Advertisements
Similar presentations
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Advertisements

Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
1 Regression and the Bias-Variance Decomposition William Cohen April 2008 Readings: Bishop 3.1,3.2.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Bayesian Learning Rong Jin.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Oregon State University – Intelligent Systems Group 8/22/2003ICML Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.
Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon Bias-Variance Analysis.
Part I: Classification and Bayesian Learning
Classification and Prediction: Regression Analysis
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
CS 391L: Machine Learning: Ensembles
Lecture 7 Ensemble Algorithms MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University of South.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Scaling up Decision Trees. Decision tree learning.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
The Bias-Variance Trade-Off Oliver Schulte Machine Learning 726.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Machine Learning Chapter 5. Evaluating Hypotheses
Bayesian Averaging of Classifiers and the Overfitting Problem Rayid Ghani ML Lunch – 11/13/00.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Ensemble Methods in Machine Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Ensemble Methods for Machine Learning. COMBINING CLASSIFIERS: ENSEMBLE APPROACHES.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Regression Machine Learning. Outline Regression vs Classification Linear regression – another discriminative learning method –As optimization 
CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning: Ensemble Methods
ECE 5424: Introduction to Machine Learning
Bayesian Averaging of Classifiers and the Overfitting Problem
Data Mining Practical Machine Learning Tools and Techniques
Computational Learning Theory
Lecture 18: Bagging and Boosting
Computational Learning Theory
Overfitting and Underfitting
Lecture 06: Bagging and Boosting
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
The Bias-Variance Trade-Off
Presentation transcript:

Bias-Variance in Machine Learning

Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression –brief comments on how it extends to classification Measuring bias, variance and error Bagging - a way to reduce variance Bias-variance for classification

Bias/Variance is a Way to Understand Overfitting and Underfitting Error/Loss on training set D train Error/Loss on an unseen test set D test high error 3 complex classifier simple classifier “too simple” “too complex”

Bias-Variance: An Example

Example Tom Dietterich, Oregon St

Example Tom Dietterich, Oregon St Same experiment, repeated: with 50 samples of 20 points each

The true function f can’t be fit perfectly with hypotheses from our class H (lines)  Error 1 We don’t get the best hypothesis from H because of noise/small sample size  Error 2 Fix: more expressive set of hypotheses H Fix: less expressive set of hypotheses H noise is similar to error 1

Bias-Variance Decomposition: Regression

Bias and variance for regression For regression, we can easily decompose the error of the learned model into two parts: bias (error 1) and variance (error 2) –Bias: the class of models can’t fit the data. Fix: a more expressive model class. –Variance: the class of models could fit the data, but doesn’t because it’s hard to fit. Fix: a less expressive model class.

Bias – Variance decomposition of error learned from D true function dataset and noise Fix test case x, then do this experiment: 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear regressor h D using D 3. Draw one test example (x, f(x)+ε) 4. Measure squared error of h D on that example x What’s the expected error? 10 noise

Bias – Variance decomposition of error learned from D true function dataset and noise 11 noise Notation - to simplify this long-term expectation of learner’s prediction on this x averaged over many data sets D long-term expectation of learner’s prediction on this x averaged over many data sets D

Bias – Variance decomposition of error

Squared difference btwn our long- term expectation for the learners performance, E D [h D (x)], and what we expect in a representative run on a dataset D (hat y) Squared difference between best possible prediction for x, f(x), and our “long-term” expectation for what the learner will do if we averaged over many datasets D, E D [h D (x)] BIAS 2 VARIANCE 13

bias variance x=5

Bias-variance decomposition This is something real that you can (approximately) measure experimentally –if you have synthetic data Different learners and model classes have different tradeoffs –large bias/small variance: few features, highly regularized, highly pruned decision trees, large-k k- NN… –small bias/high variance: many features, less regularization, unpruned trees, small-k k-NN…

Bias-Variance Decomposition: Classification

A generalization of bias-variance decomposition to other loss functions “Arbitrary” real-valued loss L(y,y’) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ Define “optimal prediction”: y* = argmin y’ L(t,y’) Define “main prediction of learner” y m =y m,D = argmin y’ E D {L(y,y’)} Define “bias of learner”: Bias(x)=L(y*,y m ) Define “variance of learner” Var(x)=E D [L(y m,y)] Define “noise for x”: N(x) = E t [L(t,y*)] Claim: E D,t [L(t,y) = c 1 N(x)+Bias(x)+c 2 Var(x) where c 1 =Pr D [ y=y*] - 1 c 2 =1 if y m =y*, -1 else Claim: E D,t [L(t,y) = c 1 N(x)+Bias(x)+c 2 Var(x) where c 1 =Pr D [ y=y*] - 1 c 2 =1 if y m =y*, -1 else m=|D | Domingos, A Unified Bias-Variance Decomposition and its Applications, ICML 2000 For 0/1 loss, the main prediction is the most common class predicted by h D (x), weighting h’s by Pr(D)

Bias and variance For classification, we can also decompose the error of a learned classifier into two terms: bias and variance –Bias: the class of models can’t fit the data. –Fix: a more expressive model class. –Variance: the class of models could fit the data, but doesn’t because it’s hard to fit. –Fix: a less expressive model class.

Bias-Variance Decomposition: Measuring

Bias-variance decomposition This is something real that you can (approximately) measure experimentally –if you have synthetic data –…or if you’re clever –You need to somehow approximate E D {h D (x)} –I.e., construct many variants of the dataset D

Background: “Bootstrap” sampling Input: dataset D Output: many variants of D: D 1,…,D T For t=1,….,T: –D t = { } –For i=1…|D|: Pick (x,y) uniformly at random from D (i.e., with replacement) and add it to D t Some examples never get picked (~37%) Some are picked 2x, 3x, ….

Measuring Bias-Variance with “Bootstrap” sampling Create B bootstrap variants of D (approximate many draws of D) For each bootstrap dataset –T b is the dataset; U b are the “out of bag” examples –Train a hypothesis h b on T b –Test h b on each x in U b Now for each (x,y) example we have many predictions h 1 (x),h 2 (x), …. so we can estimate (ignoring noise) –variance: ordinary variance of h 1 (x),….,h n (x) –bias: average(h 1 (x),…,h n (x)) - y

Applying Bias-Variance Analysis By measuring the bias and variance on a problem, we can determine how to improve our model –If bias is high, we need to allow our model to be more complex –If variance is high, we need to reduce the complexity of the model Bias-variance analysis also suggests a way to reduce variance: bagging (later) 23

Bagging

Bootstrap Aggregation (Bagging) Use the bootstrap to create B variants of D Learn a classifier from each variant Vote the learned classifiers to predict on a test example

Bagging (bootstrap aggregation) Breaking it down: –input: dataset D and YFCL –output: a classifier h D-BAG –use bootstrap to construct variants D 1,…,D T –for t=1,…,T: train YFCL on D t to get h t –to classify x with h D-BAG classify x with h 1,….,h T and predict the most frequently predicted class for x (majority vote) Note that you can use any learner you like! You can also test h t on the “out of bag” examples

Experiments Freund and Schapire

Bagged, minimally pruned decision trees

Generally, bagged decision trees outperform the linear classifier eventually if the data is large enough and clean enough.

Bagging (bootstrap aggregation) Experimentally: –especially with minimal pruning: decision trees have low bias but high variance. –bagging usually improves performance for decision trees and similar methods –It reduces variance without increasing the bias (much).

More detail on bias-variance and bagging for classification Thanks Tom Dietterich MLSS 2014

A generalization of bias-variance decomposition to other loss functions “Arbitrary” real-valued loss L(y,y’) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ Define “optimal prediction”: y* = argmin y’ L(t,y’) Define “main prediction of learner” y m =y m,D = argmin y’ E D {L(y,y’)} Define “bias of learner”: Bias(x)=L(y*,y m ) Define “variance of learner” Var(x)=E D [L(y m,y)] Define “noise for x”: N(x) = E t [L(t,y*)] Claim: E D,t [L(t,y) = c 1 N(x)+Bias(x)+c 2 Var(x) where c 1 =Pr D [ y=y*] - 1 c 2 =1 if y m =y*, -1 else Claim: E D,t [L(t,y) = c 1 N(x)+Bias(x)+c 2 Var(x) where c 1 =Pr D [ y=y*] - 1 c 2 =1 if y m =y*, -1 else m=|D | Domingos, A Unified Bias-Variance Decomposition and its Applications, ICML 2000 For 0/1 loss, the main prediction is the most common class predicted by h D (x), weighting h’s by Pr(D)

More detail on Domingos’s model Noisy channel: y i = noise(f(x i )) –f(x i ) is true label of x i –Noise noise(.) may change y  y’ h=h D is learned hypothesis –from D={(x 1,y 1 ),…(x m,y m )} for test case (x*,y*), and predicted label h(x*), loss is L(h(x*),y*) –For instance, L(h(x*),y*) = 1 if error, else 0

More detail on Domingos’s model We want to decompose E D,P {L(h(x*),y*)} where m is size of D, (x*,y*)~P Main prediction of learner is y m (x*) –y m (x*) = argmin y’ E D,P {L(h(x*),y’)} –y m (x*) = “most common” h D (x*) among all possible D’s, weighted by Pr(D) Bias is B(x*) = L(y m (x*), f(x*)) Variance is V(x*) = E D,P {L(h D (x*), y m (x*) ) Noise is N(x*)= L(y*, f(x*))

More detail on Domingos’s model We want to decompose E D,P {L(h(x*),y*)} Main prediction of learner is y m (x*) –“most common” h D (x*) over D’s for 0/1 loss Bias is B(x*) = L(y m (x*), f(x*)) –main prediction vs true label Variance is V(x*) = E D,P {L(h D (x*), y m (x*) ) –this hypothesis vs main prediction Noise is N(x*)= L(y*, f(x*)) –true label vs observed label

More detail on Domingos’s model We will decompose E D,P {L(h(x*),y*)} into –Bias is B(x*) = L(y m (x*), f(x*)) main prediction vs true label this is 0/1, not a random variable –Variance is V(x*) = E D,P {L(h D (x*), y m (x*) ) this hypothesis vs main prediction –Noise is N(x*)= L(y*, f(x*)) true label vs observed label

Case analysis of error

Analysis of error: unbiased case Main predictio n is correct Noise but no variance Variance but no noise

Analysis of error: biased case Main predictio n is wrong Noise and variance No noise, no variance

Analysis of error: overall Interaction terms are usually small Hopefully we’ll be in this case more often, if we’ve chosen a good classifier

Analysis of error: without noise which is hard to estimate anyway As with regression, we can experimentally approximately measure bias and variance with bootstrap replicates Typically break variance down into biased variance, Vb, and unbiased variance, Vu. Vb Vu

K-NN Experiments

Tree Experiments

Tree “stump” experiments (depth 2) Bias is reduced (!)

Large tree experiments (depth 10) Bias is not changed much Variance is reduced