OLS Regression What is it? Closely allied with correlation – interested in the strength of the linear relationship between two variables One variable is specified as the dependent variable The other variable is the independent (or explanatory) variable
Regression Model Y = a + bx + e What is Y? What is a? What is b? What is x? What is e? What is Y-hat?
Elements of the Regression Line a = Y intercept (what Y is predicted to equal when X = 0) b = Slope (indicates the change in Y associated with a unit increase in X) e = error (the difference between the predicted Y (Y hat) and the observed Y
Regression Has the ability to quantify precisely the relative importance of a variable Has the ability to quantify how much variance is explained by a variable(s) Use more often than any other statistical technique
The Regression Line Y = a + bx + e Y = sentence length X = prior convictions Each point represents the number of priors (X) and sentence length (Y) of a particular defendant The regression line is the best fit line through the overall scatter of points
X and Y are observed. We need to estimate a & b
Calculus 101 Least Squares Method and differential calculus Differentiation is a very powerful tool that is used extensively in model estimation. Practical examples of differentiation are usually in the form of minimization/optimization problems or rate of change problems.
Calculus 101: Calculating the rate of change or slope of a line For a straight line it is relatively simple to calculate the slope
Calculating the rate of change or slope of a line for a curve is a bit harder Differential Calculus: We have a curve describing the variable Y as some function of the variable X: y = x 2
It is possible to find a general expression involving the function f(x) that describes the slopes of the approximating sequence of secant lines h = x1 – x0 (represents a small difference from a point of interest)
Lets take a cost curve example: C(x) = x 2 what is the derivative if x = 3 = f(3+h) – f(3) / h = (3+h) 2 – (3) 2 / h = (9 + 6h + h 2 ) – 9 / h = 6h + h 2 / h = 6 + h = 6 (as h approaches 0) ∆y/∆x = 6
How does this relate to our Regression model that is a straight line?
How do you draw a line when the line can be drawn in almost any direction? The Method of Least Squares: drawing a line that minimizing the squared distances from the line (Σe 2 ) This is a minimization problem and therefore we can use differential calculus to estimate this line.
X and Y are observed. We need to estimate a & b
Least Squares Method xy Deviation =y-(a+bx)d a(1 - a) 2 1-2a+a a - b(3 - a - b) a + a 2 - 6b + 2ab + b a - 2b(2 - a - 2b) a - a 2 - 8b + 4ab + 4b a - 3b(4 - a - 3b) a + a b + 6ab +9b a - 4b(5 - a - 4b) a +a 2 -40b +8ab +16b 2
Summing the squares of the deviations yields: f(a, b) = 55-30a + 5a2 - 78b + 20ab + 30b2 Calculate the first order partial derivatives of f(a,b) f b = a + 60b and f a = a + 20b
Set each partial derivative to zero: Manipulate fa: 0 = a + 20b 10a = b a= 3 - 2b
Substitute (3-2b) into f b : 0 = a + 60b = (3-2b) + 60b = b + 60b = b 20b = 18 b = 0.9 Slope =.09
Substituting this value of b back into f a to obtain a: 10a = (.09) 10a = a = 12 a= 1.2 Y-intercept = 1.2
Estimating the model (the easy way) Calculating the slope (b)
Sum of Squares for X Some of Squares for Y Sum of produces
Calculating the Y-intersept (a) Calculating the error term (e) Y hat = predicted value of Y e will be different for every observation. It is a measure of how much we are off in are prediction.
Regression is strongly related to Correlation