(Residuals and

(Residuals and 𝑹 𝟐 - The Variation Accounted For)
Week 5 Lecture 1 Chapter 7. Linear Regression (Residuals and 𝑹 𝟐 - The Variation Accounted For)

Residuals A residual is the difference between an observed value of the response and the value predicted by the regression line. That is, residual = observed y – predicted y = 𝒚 - 𝒚 We denote residual with the lower-case letter e e = 𝒚 - 𝒚 Some residuals are negative and some are positive Some residuals are really close to zero or sometimes zero (when we have no error of prediction). The mean (and the sum) of residuals is zero. The residual value is positive when 𝒚 > 𝒚 , which means that we underestimated our prediction. The residual value is negative when 𝒚 < 𝒚 , which means that we overestimated our prediction. The residual value is zero when 𝒚 = 𝒚 (no error of prediction)

Example A survey was conducted in the United States and 10 countries of Western Europe determined the percentage of teenagers who had used marijuana and other drugs. The results are summarized in the table. We saw (based on the scatterplot) that the regression model was appropriate. The regression equation (fitted line) was: 𝒚 =−𝟑.𝟎𝟔𝟖 + 𝟎.𝟔𝟏𝟓𝒙

Example A survey was conducted in the United States and 10 countries of Western Europe determined the percentage of teenagers who had used marijuana and other drugs. The results are summarized in the table. . The regression equation was: 𝒚 =−𝟑.𝟎𝟔𝟖 + 𝟎.𝟔𝟏𝟓𝒙 for country USA the percent of marijuana usage was 𝒙 = 34, the percent of other drug usage was 𝒚 = 24. The predicted percent of other drug usage was: 𝒚 =−𝟑.𝟎𝟔𝟖 + 𝟎.𝟔𝟏𝟓(𝒙=𝟑𝟒) = 17.84 The residual (observed – predicted) is: 24 – = 6.16 (%) The residual value is positive, because, 𝒚 > 𝒚 , which means that we underestimated the percent of teens who use other drugs.

In StatCrunch Below table shows residual values in the data using StatCrunch. StatCrunch command: Stat>regression>simple linear > X-variable: Marijuana% Y-variable: Other drug% Save “Residuals” Click Compute

Residual Plots Residual plots help us assess the regression model assumptions. We check the model assumptions (assumption of random errors) using residual points. There are three assumptions to check: Residual points are approx. normally distributed (e.g., check the normal quantile or histogram of residuals or histogram of standardized residuals). Residuals have mean zero. Residual points are randomly plotted around the zero line (mean of residuals) – use the plot of residuals verses predictor or fitted value (this is a scatterplot except for here we do not want to see any obvious pattern). Residuals have constant variances. Residual points are evenly spread out around the zero line - use the plot of residuals verses predictor or fitted value (this is a scatterplot except for here we do not want to see any obvious pattern like fanning: an increasing dispersion as the fitted values increase).

Checking the Assumptions In StatCrunch
StatCrunch command: Stat>regression>simple linear > X-variable: Marijuana% Y-variable: Other drug% Graphs: Histogram of residuals QQ plot of residuals Residuals vs X-values Click Compute

Histogram of Residuals

Checking the Assumption of Normality
There are no major departure from the straight line; therefore, the normal distribution assumption of residuals is met.

Checking the Assumption of Mean Zero and Constant Variances
Residual points are randomly plotted around the zero horizontal line (mean zero for the residuals). No major pattern is seen. Therefore, the assumption that residuals have mean zero is met. Assumption #3: Residual points are evenly spread out. Therefore, the assumption that residuals have constant variances is met.

Regression Model is Correct All Three Assumptions Are Met.

Example of Regression Model NOT Correct
The curvature pattern suggests the need for higher order model or transformations.

Example of Regression Model NOT Correct
The trend in dispersion: An increasing dispersion as the fitted values increase, in which case a transformation of the response may help. For example, taking log or square root.

𝑹 𝟐 - The Coefficient of Determination
The square of the correlation ( 𝒓 𝟐 ) is the fraction of variation in the values of response (y) that is explained by the least-squares regression of y on x (explanatory variable). In our example: 𝒓 = 𝒓 𝟐 = (𝟎.𝟗𝟑) 𝟐 = 0.87 1 - R2 is the proportion of the model variability not explained with the linear relationship with X (left in the residual). In our example: R2 = = 0.13 Interpretation: About 87% of the variation in the percent of teens who used other drugs (other than marijuana) is explained by the linear regression with percent of teens who used marijuana. Checking the above numbers with StatCrunch (actually reading from StatCrunch):

𝐅𝐢𝐧𝐝𝐢𝐧𝐠 𝐫 𝐟𝐫𝐨𝐦 𝐑 𝟐 Recall the association between Adult smokers % and ACT scores. The regression function was: Adult Smokers % = ACT 𝑹 𝟐 = 0.20 What is r (estimate of correlation)?

𝐅𝐢𝐧𝐝𝐢𝐧𝐠 𝐫 𝐟𝐫𝐨𝐦 𝐑 𝟐 Recall the association between Adult smokers % and ACT scores. The regression function was: Adult Smokers % = ACT 𝑹 𝟐 = 0.20 What is r (estimate of correlation)? r = sign of slope 𝑹 𝟐 r = - 𝟎.𝟐𝟎 =

Steps in Doing Regression
Start with a scatterplot. If the scatterplot does not look like a straight line relationship, stop. Otherwise, you can calculate correlation and also intercept and the slope of the regression line. Check whether regression is OK by looking at plot of residuals against anything relevant. If it is not OK, do not use regression. We cannot say that the explanatory variable is a useful predictor. Our aim: We want regression for which line is OK and we confirm that by looking at scatterplot and residual plots.

(Residuals and

Similar presentations

Presentation on theme: "(Residuals and "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(Residuals and

Similar presentations

Presentation on theme: "(Residuals and "— Presentation transcript:

Similar presentations

About project

Feedback