J.-F. Pâris University of Houston Linear regression J.-F. Pâris University of Houston
Introduction Special case of regression analysis
Regression Analysis Models the relationship between Values of a dependent variable (also called a response variable) Values of one or more independent variables Main outcome is a function y = f(x1, …, xn)
Linear regression Studies linear dependencies y = ax + b And more y = ax2 + bx + c Is linear in a and b Uses Least-Square Method Assumes that departures from ideal line are to be random noise
Basic Assumptions (I) Sample is representative of the whole population The error is assumed to be a random variable with a mean of zero conditional on the independent variables. Independent variables are error-free and linearly independent. Errors are uncorrelated
Basic Assumptions (II) The variance of the error is constant across observations For very small samples, the errors must be Gaussian Does not apply to large samples ( 30)
General Formulation y1, y2, …, yn x11, x12, …, x1n x21, x22, …, x2n … n samples of the dependent variable: y1, y2, …, yn n samples of each of the p dependent variables: x11, x12, …, x1n x21, x22, …, x2n … xp1, xp2, …, xpn
Objective Si (yi - b0 - b1x1i - b2x2i -… - bpxpi)2 Finding Y = b0 + b1X1 + b2X2 +… + b2Xp Minimizing the sum of squares of the deviations Si (yi - b0 - b1x1i - b2x2i -… - bpxpi)2
Why the sum of squares It favors big deviations Less likely to result from random noise than large variations Our objective is to estimate the function linking the dependent variable to the independent variable assuming that the experimental points represent random variations
Simplest case (I) One independent variable We must find Y = a + bX Minimizing the sum of squares of errors Si (yi - a - bxi)2
Simplest case (II) Derive the previous expression with respect to the parameters a and b: Si -2a(yi - a - bxi) or na – Si xi b = Si yi Si 2 xi(yi - a - bxi) or Si xi a + Si xi2 b = Si xi yi
Simplest case (III) We obtain The second expression can be rewritten
More notations
Simplest case (IV) Solution can be rewritten
Coefficient of correlation r = 1 would indicate a perfect fit r = 0 would indicate no linear dependency
More complex case (I) Use matrix formulation Y= Xb + e where Y is a column vector and X is
More complex case (II) Solution to the problem is b = (XTX)-1XTy
Non-linear dependencies Can use polynomial model Y = b0 + b1X + b2X2 +… + b2Xp Or do a logarithmic transform Replace y = Keat by log y = K + at