1 Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2.

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Pattern Recognition and Machine Learning
Model Assessment, Selection and Averaging
CMPUT 466/551 Principal Source: CMU
Regression Tree Learning Gabor Melli July 18 th, 2013.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Extensions of Decision Trees (Reading:
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Machine Learning: Ensemble Methods
Oregon State University – Intelligent Systems Group 8/22/2003ICML Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.
Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon Bias-Variance Analysis.
Classification and Prediction: Regression Analysis
Lecture 3: Bivariate Data & Linear Regression 1.Introduction 2.Bivariate Data 3.Linear Analysis of Data a)Freehand Linear Fit b)Least Squares Fit c)Interpolation/Extrapolation.
Dependency networks Sushmita Roy BMI/CS 576 Nov 26 th, 2013.
Regression and Correlation Methods Judy Zhong Ph.D.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Classification Part 3: Artificial Neural Networks
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Linear Discrimination Reading: Chapter 2 of textbook.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Regression Methods. Linear Regression  Simple linear regression (one predictor)  Multiple linear regression (multiple predictors)  Ordinary Least Squares.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.
Regression Machine Learning. Outline Regression vs Classification Linear regression – another discriminative learning method –As optimization 
CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
The simple linear regression model and parameter estimation
CSE 4705 Artificial Intelligence
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
ECE 5424: Introduction to Machine Learning
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Biointelligence Laboratory, Seoul National University
Overfitting and Underfitting
Graphing Linear Equations
Presentation transcript:

1 Regression and the Bias-Variance Decomposition William Cohen April 2008 Readings: Bishop 3.1,3.2

2 Regression Technically: learning a function f(x)=y where y is real-valued, rather than discrete. –Replace livesInSquirrelHill(x 1,x 2,…,x n ) with averageCommuteDistanceInMiles(x 1,x 2,…,x n ) –Replace userLikesMovie(u,m) with usersRatingForMovie(u,m) –…

3 Example: univariate linear regression Example: predict age from number of publications

4 Linear regression Model: y i = ax i + b + ε i where ε i ~ N(0,σ) Training Data: (x 1,y 1 ),….(x n,y n ) Goal: estimate a,b with w=(a,b) ^ ^ assume MLE

5 Linear regression Model: y i = ax i + b + ε i where ε i ~ N(0,σ) Training Data: (x 1,y 1 ),….(x n,y n ) Goal: estimate a,b with w=(a,b) Ways to estimate parameters –Find derivative wrt parameters a,b –Set to zero and solve Or use gradient ascent to solve Or …. ^ ^

6 Linear regression d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3 How to estimate the slope? n*var(X) n*cov(X,Y)

7 Linear regression How to estimate the intercept? d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3

8 Bias/Variance Decomposition of Error

9 Return to the simple regression problem f:X  Y y = f(x) +  What is the expected error for a learned h? noise N(0,  ) deterministic Bias – Variance decomposition of error

10 Bias – Variance decomposition of error learned from D true fctdataset Experiment (the error of which I’d like to predict): 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear function h D using D 3. Draw a test example (x,f(x)+ε) 4. Measure squared error of h D on that example

11 Bias – Variance decomposition of error (2) learned from D true fctdataset Fix x, then do this experiment: 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear function h D using D 3. Draw the test example (x,f(x)+ε) 4. Measure squared error of h D on that example

12 Bias – Variance decomposition of error t f y ^ why not? really y D ^

13 Bias – Variance decomposition of error Depends on how well learner approximates f Intrinsic noise

14 Bias – Variance decomposition of error Squared difference btwn our long- term expectation for the learners performance, E D [h D (x)], and what we expect in a representative run on a dataset D (hat y) Squared difference between best possible prediction for x, f(x), and our “long-term” expectation for what the learner will do if we averaged over many datasets D, E D [h D (x)] BIAS 2 VARIANCE

15 Bias-variance decomposition How can you reduce bias of a learner? How can you reduce variance of a learner? Make the long-term average better approximate the true function f(x ) Make the learner less sensitive to variations in the data

16 A generalization of bias-variance decomposition to other loss functions “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ Define “optimal prediction”: y* = argmin y’ L(t,y’) Define “main prediction of learner” y m =y m,D = argmin y’ E D {L(t,y’)} Define “bias of learner”: B(x)=L(y*,y m ) Define “variance of learner” V(x)=E D [L(y m,y)] Define “noise for x”: N(x) = E t [L(t,y*)] Claim: E D,t [L(t,y) ] = c 1 N(x)+B(x)+c 2 V(x) where c 1 =Pr D [ y=y*] - 1 c 2 =1 if y m =y*, -1 else m=n=|D |

17 Other regression methods

18 Example: univariate linear regression Example: predict age from number of publications Paul Erdős Hungarian mathematician, x ~ 1500 age about 240

19 Linear regression d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3 Summary: To simplify: assume zero-centered data, as we did for PCA let x=(x 1,…,x n ) and y =(y 1,…,y n ) then…

20 Onward: multivariate linear regression Univariate Multivariate row is example col is feature

21 Onward: multivariate linear regression regularized ^

22 Onward: multivariate linear regression Multivariate, multiple outputs

23 Onward: multivariate linear regression regularized ^ What does increasing λ do?

24 Onward: multivariate linear regression regularized ^ w=(w 1,w 2) What does fixing w 2 =0 do (if λ=0)?

25 Regression trees - summary Growing tree: –Split to optimize information gain At each leaf node –Predict the majority class Pruning tree: –Prune to reduce error on holdout Prediction: –Trace path to a leaf and predict associated majority class build a linear model, then greedily remove features estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features estimated error on training data using to a linear interpolation of every prediction made by every node on the path [Quinlan’s M5]

26 Regression trees: example - 1

27 Regression trees – example 2 What does pruning do to bias and variance?

28 Kernel regression aka locally weighted regression, locally linear regression, LOESS, …

29 Kernel regression aka locally weighted regression, locally linear regression, … Close approximation to kernel regression: –Pick a few values z 1,…,z k up front –Preprocess: for each example (x,y), replace x with x= where K(x,z) = exp( -(x-z) 2 / 2σ 2 ) –Use multivariate regression on x,y pairs

30 Kernel regression aka locally weighted regression, locally linear regression, LOESS, … What does making the kernel wider do to bias and variance?

31 Additional readings P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp ), Stanford, CA: Morgan Kaufmann.A Unified Bias-Variance Decomposition and its Applications J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.Learning with Continuous Classes Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997Inducing model trees for continuous classes D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, Active Learning with Statistical Models