Basic Statistics Linear Regression
X Y Simple Linear Regression
Predicting Y from X Recall when we looked at scatter plots in our discussion of correlation, we showed generally the estimate of Y given a value for X, when the correlation was not perfect. We will now look at how to use our knowledge of the correlation to predict a value for Y, when we know a value for X.
Variable X Variable Y The GREEN line shows our prediction or regression line. high low Scatter Plot of Y and X Estimated Y value
Prediction Equation The green line in the previous slide showed us our prediction line. We will use the mathematical formula for a straight line as the method for predicting a value for Y when we know the value for X. The process is called “Linear Regression” because, in this class, we will only deal with relationships that can be fitted by a straight line. The general formula for a straight line is:
The Prediction Equation a y = the intercept or where the prediction line crosses the Y-axis (the value of Y when X = 0) b y = the regression coefficient that indicates the amount of change in Y when the value of X increases one unit.
A Simple Example Suppose that a club charges a flat $25 to use their facilities. They also charge a $10 fee per hour for using the tennis courts. Now, assume that you want to play tennis for 2 hours at this club. How much would you have to pay? Ŷ= $25 + (2) $10 = $25 + $20 = $45 for two hours of tennis
Linking the Simple Example to Regression Ŷ= $25 + (2) $10 = $25 + $20 = $45 for two hours of tennis In our example: –$25 is a y, the intercept. Even if we didn’t play any tennis (X = 0), it would cost $25 to use the club. –$10 is b y, the regression coefficient (it costs $10 for each hour of tennis played) In this case we predicted how much it would cost (Y) when we knew how long we wanted to play tennis.
Formulae for Sums of Squares These were introduced in our discussion of correlation.
Calculating the Regression Coefficient (b) or
Calculating the Intercept (a) You will notice that you must calculate the regression coefficient (b) before you can calculate the intercept (a), since the calculation of a uses b.
An Example From our earlier example, suppose that our college statistics professor is interested in predicting how many errors students might make on the mid-term examination based on how many hours they studied. Specifically, the professor wants to know how many errors a student might make if the student studied for 5 hours.
The Stats Professor’s Data StudentXYX2X2 Y2Y2 XY Total X = 70 Y = 73 X 2 =546 Y2=695 XY=429
The Resulting Sum of Squares StudentXYX2X2 Y2Y2 XY Total X = 70 Y = 73 X 2 =546 Y2=695 XY=429 = /10 = = 56 = /10 = = = 429 – (70)(73)/10 = 429 – 511 = -82
Calculating the Regression Coefficient (b) = - 82 / 56 = This can be interpreted as the change in the value of Y (in our case, errors made on the mid-term), for a unit change in X, or for us, each additional hour studied! Thus, study for another hour and make 1.46 fewer mistakes (on average!).
Calculating the Intercept (a) = 7.3 – (-1.46)(7) = = Therefore, our prediction equation is Ŷ = (-1.46) (X)
Using Our Prediction Equation Ŷ = (-1.46) (X) If the professor wanted to predict the number of errors a student might make if the student had studied for 5 hours, then we would substitute 5 for X in the above equation and obtain: Ŷ = (-1.46) (5) = (-7.3) = Thus, the professor would predict errors for a student who had studied for 5 hours.
Measuring Prediction Errors: The Standard Error of the Estimate OR Since we know that the estimate is not exact, as statisticians, we must report how much error we feel is in our estimate. The formula is:
Calculating the Standard Error of the Estimate = (162.1) / 8 = 2.29 Thus, when we estimated errors, we also would report that the Standard Error of the Estimate is 2.29.
Summarizing Prediction Equations The existence of a relationship between two variables allows us to use that knowledge to make predictions. The prediction based on our equation will result in less error in prediction than using the mean of the dependent variable. Two sums of squares are required to calculate the regression coefficient and the intercept.