1 Linear Methods for Regression Lecture Notes for CMPUT 466/551 Nilanjan Ray
2 Assumption: Linear Regression Function Model assumption: Output Y is linear in the inputs X=(X 1, X 2, X 3,…, X p ) Predict the output by: Vector notation, 1 included in X Where, Also known as multiple-regression when p>1
3 Least Square Solution residual Known as least square solution For a new input The regression output is Residual sum of squares: In matrix-vector notation: Vector differentiation: Solution:
4 Bias-Variance Decomposition Estimator: Unbiased estimator!Ex. Show the last step Model: has zero expectation same variance uncorrelated where Bias: Variance: Decomposition of EPE: Irreducible error= 2 Sq. bias=0 Variance= 2 (p/N) Linear
5 Gauss-Markov Theorem Gauss-Markov Theorem: least square estimate has the minimum variance among all linear unbiased estimators Interpretation: The estimator found by least squares is linear in y We have noticed that this estimator is unbiased, i.e., If we find any other unbiased estimator g(x 0 ) of f(x 0 ) that is linear in y too, i.e., then and Question: Is the LS the best estimator for the given linear additive model?
6 Subset Selection LS solution often has large variance (remember that variance is proportional to the number of inputs p, i.e., model complexity) If we decrease the number of input variables p, we can decrease the variance, however we then sacrifice the zero bias If this trade-off decreases test error, the solution can be accepted This reasoning leads to subset selection, i.e., select a subset from the p inputs for the regression computation Subset selection has another advantage– easy and focused interpretation of the input variables on the output
7 Subset Selection… Can we determine which j s are insignificant? Yes, we can by statistical hypothesis testing! However, we need a model assumption: is zero mean Gaussian with standard deviation
8 Subset Selection: Statistical Significance Test The linear model with additive Gaussian noise has the following properties: Ex. Show this. So we can form a standardized coefficient or Z-score test for each coefficient: and v j is the j th diagonal element of (X T X) -1 Hypothesis testing principle says that a large value of Z-score should retain The coefficient, a small value should discard the coefficient How large/small – depends on the significance level where
9 Case Study: Prostate Cancer Output = log prostate-specific antigen Input = (log cancer volume, log prostate weight, age, log of benign prostatic hyperplacia, seminal vesicle invasion, log of capsular penetration, Gleason score, % of Gleason score 4 or 5) Goal: (1) predict the output given a novel input (2) Interpret the influence of the inputs on the output
10 Case Study… Scatter plot Hard to interpret which ones are most influencing Also we want to find out how the inputs jointly influence the output
11 Subset Selection on Prostate Cancer Data TermCoefficientStd. ErrorZ-score Intercept Lcavol Lweight Age Lbph Svi Lcp Gleasson Pgg Scores with magnitude greater than 2 indicate significant variables at 5% significance level
12 Coefficient Shrinkage: Ridge Regression Method One computational advantage is that the matrix is always invertible If L 2 norm is replaced by L 1 norm, the corresponding regression is called LASSO (see [HTF]) Non-negative penalty
13 Ridge Regression… coefficient Decreasing One way to determine is cross validation – we’ll learn about it later