Lesson #32 Simple Linear Regression
Regression is used to model and/or predict a variable; called the dependent variable, Y; based on one or more independent variable(s), X’ s. ere, all independent variables are numeric Simple Linear Regression - one independent variable, X
Assume the “true” relationship is Y = + X + is the intercept ( or Y = 0 + 1 X + is the slope represents random error ~ N(0, 2 ) regression coefficients
Y = + X Y X
Assumptions LINELINE - Linear relationship - Independent observations - Normal errors - Equal variances
Estimation We want to estimate the regression coefficients The estimates are the estimated intercept the estimated slope The estimated line is then
Y X We want the line that “best fits” the data
Y X XiXi YiYi eiei
The predicted value at the i th point is: The residual at the i th point is: Ordinary least squares (OLS) chooses the line that minimizes the sum of the squared residuals
For OLS: The estimated line, or prediction equation, is
Sums of Squares SS TOTAL SS ERROR = SS RESIDUAL SS REGRESSION = SS TOTAL - SS ERROR - total variation in dependent variable - unexplained variation after fitting model - variation “explained” by model
ANOVA Table Source df SS MS F Regression Error Total 1 n–2 n–1 SS REG SS ERROR SS TOTAL MS REG MS ERROR F0F0
E ( MS ERROR ) = 2 Reject H 0 if F 0 > F (1,n-2),1- F 0 = tests H 0 : = 0 H 1 : 0 R 2 = R 2 = (r) 2
X = fat cal. Y = chol = ( )( ) + ( )( ) = 2180 = (28-34) 2 +…+ (40-34) 2 = 800
Y = Cholesterol X = % Calories from Fat
= = 199 – (2.725)(34)= = (fat cal.) = (30)= = (28)=
X = fat cal. Y = chol e SS ERROR = (-5.650) 2 + … + (-9.350) 2 SS TOTAL = ( ) 2 +…+ ( ) 2 = = 7320 SS REGRESSION = 7320 – =
Source df SS MS F Regression Error Total Reject H 0 if F 0 > F (1,6),.95 = 5.99 Reject H 0 Conclude there is a positive linear relationship between calories from fat and cholesterol R 2 = =.8115