The Bias-Variance Trade-Off Oliver Schulte Machine Learning 726.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Pattern Recognition and Machine Learning
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Assessment and Selection
Model assessment and cross-validation - overview
Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these.
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
Recognition: A machine learning approach
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Bayesian Learning Rong Jin.
Oregon State University – Intelligent Systems Group 8/22/2003ICML Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.
Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon Bias-Variance Analysis.
Sampling Distributions & Point Estimation. Questions What is a sampling distribution? What is the standard error? What is the principle of maximum likelihood?
Lecture 3 Properties of Summary Statistics: Sampling Distribution.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
CpSc 881: Machine Learning Evaluating Hypotheses.
Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ensemble Methods in Machine Learning
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Classification Ensemble Methods 1
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
Validation methods.
Week 31 The Likelihood Function - Introduction Recall: a statistical model for some data is a set of distributions, one of which corresponds to the true.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
Regression Machine Learning. Outline Regression vs Classification Linear regression – another discriminative learning method –As optimization 
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Machine Learning with Spark MLlib
ECE 5424: Introduction to Machine Learning
Bias and Variance of the Estimator
Variance Variance: Standard deviation:
LECTURE 19: FOUNDATIONS OF MACHINE LEARNING
Model generalization Brief summary of methods
The Bias-Variance Trade-Off
Machine Learning: Lecture 5
Presentation transcript:

The Bias-Variance Trade-Off Oliver Schulte Machine Learning 726

2/n Estimating Generalization Error Presentation Title At Venue The basic problem: Once I’ve built a classifier, how accurate will it be on future test data? Problem of Induction: It’s hard to make predictions, especially about the future (Yogi Berra). Cross-validation: clever computation on the training data to predict test performance. Other variants: jackknife, bootstrapping. Today: Theoretical insights into generalization performance.

3/n The Bias-Variance Trade-off The Short Story: generalization error = bias 2 + variance + noise. Bias and variance typically trade off in relation to model complexity. Presentation Title At Venue Bias 2 Variance Error Model complexity

4/n Dart Example Presentation Title At Venue

5/n Analysis Set-up Random Training Data Learned Model y(x;D) True Model h Average Squared Difference {y(x;D)-h(x)} 2 for fixed input features x.

6/n Presentation Title At Venue

7/n Formal Definitions E[{y(x;D)-h(x)} 2 ] = average squared error (over random training sets). E[y(x;D)] = average prediction E[y(x;D)] - h(x) = bias = average prediction vs. true value = E[{y(x;D) - E[y(x;D)]} 2 ] = variance= average squared diff between average prediction and true value. Theorem average squared error = bias 2 + variance For set of input features x 1,..,x n, take average squared error for each x i. Presentation Title At Venue

8/n Bias-Variance Decomposition for Target Values Observed Target Value t(x) = h(x) + noise. Can do the same analysis for t(x) rather than h(x). Result: average squared prediction error = bias 2 + variance+ average noise Presentation Title At Venue

9/n Training Error and Cross-Validation Suppose we use the training error to estimate the difference between the true model prediction and the learned model prediction. The training error is downward biased: on average it underestimates the generalization error. Cross-validation is nearly unbiased; it slightly overestimates the generalization error. Presentation Title At Venue

10/n Classification Can do bias-variance analysis for classifiers as well. General principle: variance dominates bias. Very roughly, this is because we only need to make a discrete decision rather than get an exact value. Presentation Title At Venue

11/n Presentation Title At Venue