Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2.

Similar presentations


Presentation on theme: "1 Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2."— Presentation transcript:

1 1 Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2

2 2 Regression Technically: learning a function f(x)=y where y is real-valued, rather than discrete. –Replace livesInSquirrelHill(x 1,x 2,…,x n ) with averageCommuteDistanceInMiles(x 1,x 2,…,x n ) –Replace userLikesMovie(u,m) with usersRatingForMovie(u,m) –…

3 3 Example: univariate linear regression Example: predict age from number of publications

4 4 Linear regression Model: y i = ax i + b + ε i where ε i ~ N(0,σ) Training Data: (x 1,y 1 ),….(x n,y n ) Goal: estimate a,b with w=(a,b) ^ ^ assume MLE

5 5 Linear regression Model: y i = ax i + b + ε i where ε i ~ N(0,σ) Training Data: (x 1,y 1 ),….(x n,y n ) Goal: estimate a,b with w=(a,b) Ways to estimate parameters –Find derivative wrt parameters a,b –Set to zero and solve Or use gradient ascent to solve Or …. ^ ^

6 6 Linear regression d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3 How to estimate the slope? n*var(X) n*cov(X,Y)

7 7 Linear regression How to estimate the intercept? d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3

8 8 Bias/Variance Decomposition of Error

9 9 Return to the simple regression problem f:X  Y y = f(x) +  What is the expected error for a learned h? noise N(0,  ) deterministic Bias – Variance decomposition of error

10 10 Bias – Variance decomposition of error learned from D true fctdataset Experiment (the error of which I’d like to predict): 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear function h D using D 3. Draw a test example (x,f(x)+ε) 4. Measure squared error of h D on that example

11 11 Bias – Variance decomposition of error (2) learned from D true fctdataset Fix x, then do this experiment: 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear function h D using D 3. Draw the test example (x,f(x)+ε) 4. Measure squared error of h D on that example

12 12 Bias – Variance decomposition of error t f y ^ why not? really y D ^

13 13 Bias – Variance decomposition of error Depends on how well learner approximates f Intrinsic noise

14 14 Bias – Variance decomposition of error Squared difference btwn our long- term expectation for the learners performance, E D [h D (x)], and what we expect in a representative run on a dataset D (hat y) Squared difference between best possible prediction for x, f(x), and our “long-term” expectation for what the learner will do if we averaged over many datasets D, E D [h D (x)] BIAS 2 VARIANCE

15 15 Bias-variance decomposition How can you reduce bias of a learner? How can you reduce variance of a learner? Make the long-term average better approximate the true function f(x ) Make the learner less sensitive to variations in the data

16 16 A generalization of bias-variance decomposition to other loss functions “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ Define “optimal prediction”: y* = argmin y’ L(t,y’) Define “main prediction of learner” y m =y m,D = argmin y’ E D {L(t,y’)} Define “bias of learner”: B(x)=L(y*,y m ) Define “variance of learner” V(x)=E D [L(y m,y)] Define “noise for x”: N(x) = E t [L(t,y*)] Claim: E D,t [L(t,y) ] = c 1 N(x)+B(x)+c 2 V(x) where c 1 =Pr D [ y=y*] - 1 c 2 =1 if y m =y*, -1 else m=n=|D |

17 17 Other regression methods

18 18 Example: univariate linear regression Example: predict age from number of publications Paul Erdős Hungarian mathematician, 1913-1996 x ~ 1500 age about 240

19 19 Linear regression d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3 Summary: To simplify: assume zero-centered data, as we did for PCA let x=(x 1,…,x n ) and y =(y 1,…,y n ) then…

20 20 Onward: multivariate linear regression Univariate Multivariate row is example col is feature

21 21 Onward: multivariate linear regression regularized ^

22 22 Onward: multivariate linear regression Multivariate, multiple outputs

23 23 Onward: multivariate linear regression regularized ^ What does increasing λ do?

24 24 Onward: multivariate linear regression regularized ^ w=(w 1,w 2) What does fixing w 2 =0 do (if λ=0)?

25 25 Regression trees - summary Growing tree: –Split to optimize information gain At each leaf node –Predict the majority class Pruning tree: –Prune to reduce error on holdout Prediction: –Trace path to a leaf and predict associated majority class build a linear model, then greedily remove features estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features estimated error on training data using to a linear interpolation of every prediction made by every node on the path [Quinlan’s M5]

26 26 Regression trees: example - 1

27 27 Regression trees – example 2 What does pruning do to bias and variance?

28 28 Kernel regression aka locally weighted regression, locally linear regression, LOESS, …

29 29 Kernel regression aka locally weighted regression, locally linear regression, … Close approximation to kernel regression: –Pick a few values z 1,…,z k up front –Preprocess: for each example (x,y), replace x with x= where K(x,z) = exp( -(x-z) 2 / 2σ 2 ) –Use multivariate regression on x,y pairs

30 30 Kernel regression aka locally weighted regression, locally linear regression, LOESS, … What does making the kernel wider do to bias and variance?

31 31 Additional readings P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.A Unified Bias-Variance Decomposition and its Applications J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.Learning with Continuous Classes Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997Inducing model trees for continuous classes D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996. Active Learning with Statistical Models


Download ppt "1 Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2."

Similar presentations


Ads by Google