Lecture 02: Linear Regression CS489/698: Intro to ML Lecture 02: Linear Regression 9/14/17 Yao-Liang Yu
I’d rather die than telling you my password! Transfer success! 9/14/17 Yao-Liang Yu
Outline Announcements Linear Regression Regularization Cross-validation 9/14/17 Yao-Liang Yu
Outline Announcements Linear Regression Regularization Cross-validation 9/14/17 Yao-Liang Yu
Announcements Assignment 1 is out. TA office hour? Enrollment Due in two weeks TA office hour? Enrollment CS698: permission numbers sent CS489: ~10 seats available on Quest, ask CS advisors! 9/14/17 Yao-Liang Yu
Outline Announcements Linear Regression Regularization Cross-validation 9/14/17 Yao-Liang Yu
How much should I bid for? Interpolation vs. Extrapolation Linear vs. Nonlinear 9/14/17 Yao-Liang Yu
Regression Given a pair (X, Y), find function f such that X: feature vector, d-dim real vector Y: response, m-dim real vector (m=1 say) Two problems: (X,Y) is uncertain: samples from an unknown distribution How to measure the error: need a loss function 9/14/17 Yao-Liang Yu
Risk minimization Minimize the expected loss, aka risk Which loss to use? Not always clear; convenience dominates for now Least squares: 9/14/17 Yao-Liang Yu
The regression function Many ways to estimate m(X) Simplest: Let’s assume it is linear (affine)! Inherent noise variance 9/14/17 Yao-Liang Yu
Linear regression Assumption: Dream: Law of Large Numbers: Reality: distribution unknown… empirical risk 9/14/17 Yao-Liang Yu
Simplification, again 9/14/17 Yao-Liang Yu
Finally Sum of square residuals True responses Hyperplane (again!) parameterized by W 9/14/17 Yao-Liang Yu
Why least squares? Theorem (Sondermann’86; Friedland and Torokhti’07; Yu and Schuurmans’11) Among all minimizers of minW ||AWB – C||F, W=A+CB+ is the one that has minimal F-norm. Pseudo-inverse A+ is the unique matrix G such that AGA = A, GAG = G, (AG)T=AG, (GA)T=GA Singular Value Decomposition A=USVT A+=VS-1UT 9/14/17 Yao-Liang Yu
Optimization detour Fermat’s Theorem. Necessarily (Fréchet) Derivative at x. Example. f(x) = xTAx + xTb + c Df(x) = (A+AT)x + b 9/14/17 Yao-Liang Yu
Solving least squares Normal Equation XTX may not be invertible, but there is always a solution Even invertible, never ever compute W = (XTX)-1XTY ! Instead, solve the linear system 9/14/17 Yao-Liang Yu
Prediction Once have W, can predict How to evaluate? Sometimes we evaluate using a different Leads to a beautiful theory of calibration 9/14/17 Yao-Liang Yu
Robustness 9/14/17 Yao-Liang Yu
Gauss vs. Laplace 9/14/17 Yao-Liang Yu
Multi-task learning Everything we’ve shown still holds if Y is m-dim But, can solve each column of Y independently Things are more interesting if we had regularization 9/14/17 Yao-Liang Yu
Outline Announcements Linear Regression Regularization Cross-validation 9/14/17 Yao-Liang Yu
Ill-posedness Let x1=0, x2=ε, y1=1, y2=-1 X = y= w=X-1y= Slight perturbation leads to chaotic behaviour 9/14/17 Yao-Liang Yu
Tiknohov regularization (Hoerl and Kennard’70) Reg. constant (hyperparameter) Ridge regression With positive lambda, slight perturbation in input leads to proportional (wrt 1/lambda) perturbation in output 9/14/17 Yao-Liang Yu
Data augmentation 9/14/17 Yao-Liang Yu
Sparsity Ridge regression weight is always dense Lasso (Tibshirani’96) Computationally heavy Interpretationally cumbersome Lasso (Tibshirani’96) 9/14/17 Yao-Liang Yu
Regularization vs. Constraint Computationally appealing Always true Mild conditions Theoretically appealing 9/14/17 Yao-Liang Yu
Outline Announcements Linear Regression Regularization Cross-validation 9/14/17 Yao-Liang Yu
Cross-validation … Training set Validation Test set 1 5 k-1 k 2 3 4 9/14/17 Yao-Liang Yu
Cross-validation … Training set Test set 1 2 3 4 5 k-1 k For each lambda, perf1 9/14/17 Yao-Liang Yu
Cross-validation … Training set Test set 1 2 3 4 5 k-1 k For each lambda, perf1 + perf2 9/14/17 Yao-Liang Yu
Cross-validation … Training set Test set 1 2 3 4 5 k-1 k For each lambda, perf1 + perf2 + … + perfk 9/14/17 Yao-Liang Yu
Cross-validation … Training set Test set 1 2 3 4 5 k-1 k Wlambda* For each lambda, perf(lambda) = perf1 + perf2 + … + perfk lambda* = argmaxlambda perf(lambda) 9/14/17 Yao-Liang Yu
Questions? 9/14/17 Yao-Liang Yu