The Bias Variance Tradeoff and Regularization

Slides:



Advertisements
Similar presentations
Multiple Regression Analysis
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
The Simple Regression Model
Regularization David Kauchak CS 451 – Fall 2013.
Pattern Recognition and Machine Learning
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
CMPUT 466/551 Principal Source: CMU
Chapter 2: Lasso for linear models
Overfitting and Regularization Chapters 11 and 12 on amlbook.com.
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Classification and Prediction: Regression Analysis
Objectives of Multiple Regression
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 FORECASTING Regression Analysis Aslı Sencer Graduate Program in Business Information Systems.
Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Regularized Least-Squares and Convex Optimization.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
The simple linear regression model and parameter estimation
Chapter 4: Basic Estimation Techniques
Chapter 4 Basic Estimation Techniques
Regression Analysis AGEC 784.
Linear Regression CSC 600: Data Mining Class 12.
Statistical Data Analysis - Lecture /04/03
Boosting and Additive Trees (2)
Generalized regression techniques
Inference for Regression
CSE 4705 Artificial Intelligence
CH 5: Multivariate Methods
Linear Regression (continued)
ECE 5424: Introduction to Machine Learning
Multiple Regression Analysis
Bias and Variance of the Estimator
Linear Regression Models
I271B Quantitative Methods
Regression Analysis Week 4.
Regression Models - Introduction
No notecard for this quiz!!
10701 / Machine Learning Today: - Cross validation,
What is Regression Analysis?
Introduction to Predictive Modeling
Linear Model Selection and regularization
Biointelligence Laboratory, Seoul National University
Contact: Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact:
Overfitting and Underfitting
Bias-variance Trade-off
Support Vector Machine I
Model generalization Brief summary of methods
One-Factor Experiments
The Bias-Variance Trade-Off
MGS 3100 Business Analysis Regression Feb 18, 2016
Regression Models - Introduction
F test for Lack of Fit The lack of fit test..
Presentation transcript:

The Bias Variance Tradeoff and Regularization ? Slides by: Joseph E. Gonzalez jegonzal@cs.berkeley.edu Spring’18 updates: Fernando Perez fernando.perez@berkeley.edu

Linear models for non-linear relationships Advice for people who are dealing with non-linear relationship issues but would really prefer the simplicity of a linear relationship.

Is this data Linear? What does it mean to be linear?

What does it mean to be a linear model? In what sense is the above model linear? Are linear models linear in the the features? the parameters?

Introducing Non-linear Feature Functions One reasonable feature function might be: That is: This is still a linear model, in the parameters

What are the fundamental challenges in learning?

Fundamental Challenges in Learning? Fit the Data Provide an explanation for what we observe Generalize to the World Predict the future Explain the unobserved Is this cat grumpy or are we overfitting to human faces?

Fundamental Challenges in Learning? Bias: the expected deviation between the predicted value and the true value Variance: two sources Observation Variance: the variability of the random noise in the process we are trying to model. Estimated Model Variance: the variability in the predicted value across different training datasets.

Bias The expected deviation between the predicted value and the true value Depends on both the: choice of f learning procedure Under-fitting All possible functions Possible θ values True Function Bias

Observation Variance the variability of the random noise in the process we are trying to model measurement variability stochasticity missing information Beyond our control (usually)

Estimated Model Variance variability in the predicted value across different training datasets Sensitivity to variation in the training data Poor generalization Overfitting

Estimated Model Variance variability in the predicted value across different training datasets Sensitivity to variation in the training data Poor generalization Overfitting

The Bias-Variance Tradeoff Estimated Model Variance Bias

Visually The red center of the board: the true behavior of the universe. Each dart: a specific function, that models the universe Our parameters select these functions.. From Understanding the Bias-Variance Tradeoff, by Scott Fortmann-Roe

Demo

Analysis of the Bias-Variance Trade-off

Analysis of Squared Error Noise term: For the test point x the expected error: Random variables are red True Function Assume noisy observations  y is a random variable Assume training data is random  θ is a random variable

Analysis of Squared Error Goal: = Obs. Var. + (Bias)2 + Mod. Var. Other terminology: “Noise” + (Bias)2 + Variance

= Subtracting and adding h(x) Useful Eqns:

= Expanding in terms of a and b: = Useful Eqns:

= Expanding in terms of a and b: = Independence of ϵ and θ Useful Eqns:

= Obs. Variance Model Estimation Error = “Noise” Term Obs. Value True Value Model Estimation Error Useful Eqns: True Value Pred. Value

(Bias)2 Model Variance Next we will show…. How? Adding and Subtracting what?

Expanding in terms of a and b:

Constant

Constant

Constant

Model Variance (Bias)2

= Obs. Variance “Noise” (Bias)2 Model Variance

Bias Variance Plot Test Error Variance Test Error Variance (Bias)2 Optimal Value Optimal Value Test Error Variance (Bias)2 Image Increasing Model Complexity

Bias Variance Increasing Data Test Error Variance Optimal Value Optimal Value Test Error (Bias)2 Variance Image Increasing Model Complexity

How do we control model complexity? Test Error Variance Optimal Value Optimal Value So far: Number of features Choices of features Next: Regularization Test Error Variance (Bias)2 Increasing Model Complexity

Bias Variance Derivation Quiz http://bit.ly/ds100-fa18-bva Match each of the following: Bias2 Model Variance Obs. Variance h(x) h(x) + ϵ (1) (2) (3) (4)

Bias Variance Derivation Quiz http://bit.ly/ds100-fa18-bva Match each of the following: Bias2 Model Variance Obs. Variance h(x) h(x) + ϵ (1) (2) (3) (4)

Regularization Parametrically Controlling the Model Complexity 𝜆 Tradeoff: Increase bias Decrease variance

Basic Idea of Regularization Penalize Complex Models Fit the Data Regularization Parameter How should we define R(θ)? How do we determine 𝜆?

The Regularization Function R(θ) Goal: Penalize model complexity Recall earlier: More features  overfitting … How can we control overfitting through θ Proposal: set weights = 0 to remove features Notes: - Any regularization is a type of bias: It tells us how to select a model among others, absent additional evidence from the data. - Regularization terms are independent of the data. Therefore it reduces the variance of the model.

Common Regularization Functions Ridge Regression (L2-Reg) LASSO (L1-Reg) Least Absolute Shrinkage and Selection Operator Distributes weight across related features (robust) Analytic solution (easy to compute) Does not encourage sparsity  small but non-zero weights. Encourages sparsity by setting weights = 0 Used to select informative features Does not have an analytic solution  numerical methods

Regularization and Norm Balls Loss 𝜃opt 𝜃2 𝜃1

Regularization and Norm Balls 𝜃opt 𝜃1 𝜃2 L1 Norm (LASSO) Sparsity inducing Snaps to corners Loss Loss 𝜃opt 𝜃1 𝜃2 L1 + L2 Norm (Elastic Net) Compromise… Two parameters … Snaps to corners Loss 𝜃opt 𝜃2 𝜃1 - Note the isosurfaces for the various Lp norms - the L1 isosurface shows why any increase in value of any parameter has to be offset by an equal decrease in the value of another. Weight sharing L2 Norm (Ridge)

Python Demo! The shapes of the norm balls. Maybe show reg. effects on actual models.

Determining the Optimal 𝜆 Value of 𝜆 determines bias-variance tradeoff Larger values  more regularization  more bias  less variance

Summary Regularization