The Bias-Variance Trade-Off

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Model generalization Test error Bias, variance and complexity
Model Assessment, Selection and Averaging
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
The Bias-Variance Trade-Off Oliver Schulte Machine Learning 726.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning Chapter 5. Evaluating Hypotheses
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Data Science Credibility: Evaluating What’s Been Learned
Support Vector Machines
Oliver Schulte Machine Learning 726
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Oliver Schulte Machine Learning 726
Probability Theory and Parameter Estimation I
Background on Classification
CH 5: Multivariate Methods
Bayes Net Learning: Bayesian Approaches
ECE 5424: Introduction to Machine Learning
Oliver Schulte Machine Learning 726
Nonparametric Methods: Support Vector Machines
Evaluating Classifiers
ECE 5424: Introduction to Machine Learning
Overview of Supervised Learning
Pattern Classification, Chapter 3
Bias and Variance of the Estimator
Logistic Regression Classification Machine Learning.
Bayesian Averaging of Classifiers and the Overfitting Problem
Machine Learning Today: Reading: Maria Florina Balcan
Data Mining Practical Machine Learning Tools and Techniques
Variance Variance: Standard deviation:
CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Linear Model Selection and regularization
The Bias Variance Tradeoff and Regularization
Bias-variance Trade-off
Ensemble learning.
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Where does the error come from?
Shih-Yang Su Virginia Tech
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Support Vector Machines 2
Machine Learning: Lecture 5
Presentation transcript:

The Bias-Variance Trade-Off Oliver Schulte If you use “insert slide number” under “Footer”, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master. This is a version of “Equity”.

Estimating Generalization Error The basic problem: Once I’ve built a classifier, how accurate will it be on future test data? Problem of Induction: It’s hard to make predictions, especially about the future (Yogi Berra). Cross-validation: clever computation on the training data to predict test performance. Other variants: jackknife, bootstrapping. Today: Theoretical insights into generalization performance. building a classifier may involve setting parameters

The Bias-Variance Trade-off The Short Story: generalization error = bias2 + variance + noise. Bias and variance typically trade off in relation to model complexity. Model complexity - + Bias2 Variance + + Error

Dart Example

Analysis Set-up Learned Model y(x;D) Random Training Data True Model h show Bayes net analysis from Yuke Fix input to keep things simple for now. Average Squared Difference {y(x;D)-h(x)}2 for fixed input features x. True Model h

insert Duda and Hart Figure 9.4. maybe try tiff. see also ParametLearningStat.xls Legend: red g(x): learned. black F(x) = truth. poor model, fixed, high bias, low variance. better model, also fixed. cubic model, trained. Lower bias, higher variance. Other extreme. linear model, trained. Intermedate bias, intermediate variance.

Formal Definitions E[{y(x;D)-h(x)}2] = average squared error (over random training sets). E[y(x;D)] = average prediction E[y(x;D)] - h(x) = bias = average prediction vs. true value = E[{y(x;D) - E[y(x;D)]}2] = variance= average squared diff between average prediction and true value. Theorem average squared error = bias2 + variance For set of input features x1,..,xn, take average squared error for each xi. Go back to example from Duda and Hart.

Bias-Variance Decomposition for Observed Target Values Observed Target Value t(x) = h(x) + noise. Can do the same analysis for t(x) rather than h(x). Result: average squared prediction error = bias2 + variance+ average noise insert Bishop’s figure As we increase the trade-off parameter, we overfit less, so bias goes up and variance goes down. make sure the figure works on the notebook.

Training Error and Cross-Validation Suppose we use the training error to estimate the difference between the true model prediction and the learned model prediction. The training error is downward biased: on average it underestimates the generalization error. Cross-validation is nearly unbiased; it slightly overestimates the generalization error. the average difference over datasets, between training error and average generalization error.

Classification Can do bias-variance analysis for classifiers as well. General principle: variance dominates bias. Very roughly, this is because we only need to make a discrete decision rather than get an exact value. (not in Bishop; see Duda and Hart)

Classification Legend. a) full Gaussian model, trained. High variance in decision boundaries and in errors. b) intermediate Gaussian model with diagonal covariance. Lower variance in boundaries and errors. c) Unit covariance (linear model), decision boundaries do not change much. Higher bias.

Variance and Big Parameters NNs are models with many parameters Generally many parameters means Low bias High variance Big data reduces variance. Why? Consider model trained until 0 training error. 0 bias Prediction error entirely due to variance in data i.e. is the training data representative of the true model?

Variance and Big Data Suppose we have independent and identically distributed data points (random samples) 𝑉𝑎𝑟 𝑑𝑎𝑡𝑎𝑠𝑒𝑡𝑠 𝑜𝑓 𝑠𝑖𝑧𝑒 𝑛 =𝑉𝑎𝑟(𝑜𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡𝑠 𝑜𝑓 𝑠𝑖𝑧𝑒1)/√(𝑛) E.g. n = 104 ⇒ variance reduced by 100 Overfitting an infinite dataset ⇒ correct predictions