LSRL Least Squares Regression Line
Bivariate data REMEMBER: x – variable: is the explanatory variable Y - variable: is the response variable Use x to predict y
Regression Line The line that gives the best fit to the data set Definition: A regression line is a line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. The line that gives the best fit to the data set
Be sure to put the hat on the y - (y-hat) is the predicted value of the response variable y for a given value of explanatory value x b – is the slope it is the approximate amount by which y is predicted to change when x increases by 1 unit a – is the y-intercept it is the predicted value of the line when x = 0 in some situations, the y-intercept has no meaning
Interpretations Slope: For each one unit increase in x, there is an approximate increase/decrease of b amount in y. Y-Intercept: The value of y if there is zero value of x. Sometime is nonsensical in the context of the question.
Use your calculator to find the LSRL. The ages (in months) and heights (in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60
Interpretations Correlation coefficient: There is a direction, strength, and type of association between x and y. There is a strong, positive, linear association between the age and height of children.
Interpretations Slope: For each unit increase in x, there is an approximate increase/decrease of b in y. For each one month increase in age, there is an approximate increase of .34 inches in heights of children.
Interpretations Y-Intercept: The value of y if there is zero value of x. We expect that a child that is 0 years old would be 20.4 inches tall. (Child’s length at birth).
The ages (in months) and heights (in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 Predict the height of a child who is 4.5 years old. Predict the height of someone who is 20 years old. Graph, find lsrl, also examine mean of x & y 38.5 inches 102.48 inches or 8.5 feet?
Interpolation (good): Using a regression line for estimating predicted values between known values.
Extrapolation (bad) The LSRL should not be used to predict y for values of x outside the data set. It is unknown whether the pattern observed in the scatterplot continues outside this range and are often not accurate.
Least-Squares Regression Residuals In most cases, no line will pass exactly through all the points in a scatterplot. A good regression line makes the vertical distances of the points from the line as small as possible. Think AP: Actual y – predicted y Definition: A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is, residual = observed y – predicted y residual = y - ŷ Least-Squares Regression Positive residuals (above line) Negative residuals (below line) residual
What would happen if we added up all the residuals? Find the residuals: observed y – predicted y from the regressions line (0,0) (3,10) (6,2) What would happen if we added up all the residuals? y =.5(6) + 4 = 7 2 – 7 = -5 4.5 y =.5(0) + 4 = 4 0 – 4 = -4 -5 y =.5(3) + 4 = 5.5 10 – 5.5 = 4.5 -4 (0,0) Find the sum of the squared residuals = 61.25
What is the sum of the deviations from the line? Will it always be zero? Use a calculator to find the line of best fit (0,0) (3,10) (6,2) 6 Find y - y -3 The line that minimizes the sum of the squares of the deviations from the line is the LSRL. -3 Sum of the squared residuals = 54
Least-Squares Regression Line Different regression lines produce different residuals. The regression line we want is the one that minimizes the sum of the squared residuals. Definition: The least-squares regression line of y on x is the line that makes the sum of the squared residuals as small as possible.
Will this point always be on the LSRL? The ages (in months) and heights (in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 Calculate x & y and sx and sy : Plot the point (x, y) on the scatterplot. Will this point always be on the LSRL? Graph, find lsrl, also examine mean of x & y
Least-Squares Regression Least-Squares Regression Line We can use technology to find the equation of the least-squares regression line. We can also write it in terms of the means and standard deviations of the two variables and their correlation. Least-Squares Regression Definition: Equation of the least-squares regression line We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means and standard deviations of the two variables and their correlation. The least squares regression line is the line ŷ = a + bx with Slope: and y intercept:
Slope = correlation coefficient (st. dev. of y / st.dev. of x) Formulas – on chart Predicted y = y-intercept + slope(x) Slope formula Y-intercept = mean of y – slope (mean of x) Slope = correlation coefficient (st. dev. of y / st.dev. of x)
What are the explanatory and response variables? The following statistics are found for the variables posted speed limit and the average number of accidents. What are the explanatory and response variables? Find and interpret the LSRL & predict the number of accidents for a posted speed limit of 50 mph.
b0 = 18 - .7228(40) = -10.92 For LSRL need slope and y-intercept: Predicted # of accidents = – 10.92 + .723(posted speed limit) For each one mph increase in the posted speed limit, the predicted number of accidents goes up .723 accidents. If the speed limit was 0 mph, there is a predicted -10.92 accidents (nonsensical). Predicted number of accidents if posted speed limit is 50 mph.
The correlation coefficient and the LSRL are both non-resistant measures. They are changed by outliers and influential points.
Example: The average annual cost per person due to traffic delays for 70 US cities in 2000 was $298.96 with a standard deviation of $180.83. The peak period average freeway speed is 54.34 mph with a standard deviation of 4.494 mph. The correlation between cost per person and freeway speed is -0.90. Write a regression model to estimate costs per person associated with traffic delays. What are the explanatory and response variables? Predicted cost = 2266.61 – 36.21(freeway speed) r = -.9
c) The regression model is Price = 9.564 + 122.74 size Example: A scatterplot of house prices (in thousands of dollars) vs. house size (in thousands of square feet) shows a relationship that is straight, with only moderate scatter and no outliers. The correlation between house price and house size is 0.85. If a house is 1 SD above the mean in size (making it about 2170 sq ft), how many SDs above the mean would you predict its sale price to be? b) What would you predict about the sale price of a house that’s 2 SDs below average in size? c) The regression model is Price = 9.564 + 122.74 size What does the slope of 122.74 mean? d) What are the units? e) How much can a homeowner expect the value of his house to increase if he builds on an additional 2000 sq ft? f) How much would you expect to pay for a house of 3000 sq ft? Predicted cost (in thousands) = a + b(size in 1000 sq ft) If sx goes up 1, sy would go up by r amount. (.85 st.dev. above mean) The sale price would be 1.7 st. dev. below the mean. For every 1000 sq ft increase in size , price goes up $122,740 on average. Thousand of dollars per thousand of sq ft. 2000(122.74) = $245,480 Predicted cost = 9564 + 122.74(3000) = $377,784