5/18/ lecture 101 STATS 330: Lecture 10
5/18/ lecture 102 Diagnostics 2 Aim of today’s lecture To describe some more remedies for non-planar data To look at diagnostics and remedies for non-constant scatter.
5/18/ lecture 103 Remedies for non-planar data (cont) Last time we looked at diagnostics for non- planar data We discussed what to do if the diagnostics indicate a problem. The short answer was: we transform, so that the model fits the transformed data. How to choose a transformation? Theory Ladder of powers Polynomials We illustrate with a few examples
5/18/ lecture 104 Example: Using theory - cherry trees A tree trunk is a bit like a cylinder Volume = (diameter/2) 2 height Log volume = log( 4) + 2 log(diameter) + log(height) so a linear regression using the logged variables should work! In fact R 2 increases from 95% to 98%, and residual plots are better
5/18/ lecture 105 Example: cherry trees (cont) > new.reg<- lm(log(volume)~log(diameter)+log(height), data=cherry.df) > summary(new.reg) Call: lm(formula = log(volume) ~ log(diameter) + log(height), data = cherry.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-09 *** log(diameter) < 2e-16 *** log(height) e-06 *** --- Residual standard error: on 28 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 2 and 28 DF, p-value: < 2.2e-16 Previously 94.8%
5/18/ lecture 106 Example: cherry trees (original)
5/18/ lecture 107 Example: cherry trees (logs)
5/18/ lecture 108 Tyre abrasion data: gam plots rubber.gam = gam(abloss~s(hardness)+s(tensile), data=rubber.df) par(mfrow=c(1,2)); plot(rubber.gam)
5/18/ lecture 109 Tyre abrasion data: polynomial GAM curve is like a polynomial, so fit a polynomial (ie include terms tensile 2, tensile 3,…) lm(abloss~hardness+poly(tensile,4), data=rubber.df) Usually a lot of trial and error involved! We have succeeded when R 2 improves Residual plots show no pattern 4 th deg polynomial works for the rubber data: R2 increases from 84% to 94% Degree of polynomial
Why 4 th degree? > rubber.lm = lm(abloss~poly(tensile,5)+hardness, data=rubber.df) > summary(rubber.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-16 *** poly(tensile, 5) e-10 *** poly(tensile, 5) poly(tensile, 5) e-05 *** poly(tensile, 5) *** poly(tensile, 5) hardness e-13 *** Residual standard error: on 23 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 6 and 23 DF, p-value: 3.931e-13 5/18/ lecture 1010 Try 5 th degree Highest significant power
Ladder of powers Rather than fit polynomials in some independent variables, guided by gam plots, we can transform the response using the “ladder of powers” (i.e. use y p as the response rather than y for some power p) Choose p either by trial and error using R 2 or use a “Box-Cox plot – see later in this lecture 5/18/ lecture 1011
5/18/ lecture 1012 Checking for equal scatter The model specifies that the scatter about the regression plane is uniform In practice this means that the scatter doesn’t depend on the explanatory variables or the mean of the response All tests, confidence intervals rely on this
5/18/ lecture 1013 Scatter Scatter is measured by the size of the residuals A common problem is where the scatter increases as the mean response increases This is means the big residuals happen when the fitted values are big Recognize this by a “funnel effect” in the residuals versus fitted value plot
5/18/ lecture 1014 Example: Education expenditure data Data for the 50 states of the USA Variables are Per capita expenditure on education (response), variable educ Per capita Income, variable percap Number of residents per 1000 under 18, variable under18 Number of residents per 1000 in urban areas, variable urban Fit model educ~ percap+under18+urban
5/18/ lecture 1015 response Outlier! (response)
5/18/ lecture 1016 Outlier, pt 50 (California)
Basic fit, outlier in 5/18/ lecture 1017 > educ.lm = lm(educ~urban + percap + under18, data=educ.df) > summary(educ.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** urban percap e-07 *** under e-05 *** Residual standard error: on 46 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 46 DF, p-value: 5.271e-09 R 2 is 59%
Basic fit, outlier out 5/18/ lecture 1018 > educ50.lm = lm(educ~urban + percap + under18, data=educ.df, subset=-50) > summary(educ50.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * urban percap *** under * --- Residual standard error: on 45 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 45 DF, p-value: 8.365e-07 R 2 is now 49% See how we exclude pt 50
5/18/ lecture 1019 > par(mfrow=c(1,2)) > plot(educ50.lm, which = c(1,3)) Funnel effect Increasing relationship
5/18/ lecture 1020 Remedies Either Transform the response Or Estimate the variances of the observations and use “weighted least squares”
5/18/ lecture 1021 > tr.educ50.lm <- lm(I(1/educ)~urban + percap + under18,data=educ.df[-50,]) > plot(tr.educ50.lm) Transforming the response Transform to reciprocal Better!
What power to choose? How did we know to use reciprocals? Think of a more general model I(educ^p)~percap + under18 + urban where p is some power Then estimate p from the data using a Box- Cox plot 5/18/ lecture 1022
5/18/ lecture 1023 boxcoxplot(educ~urban + percap + under18, educ.df[-50,]) Transforming the response (how?) Draws “Box-Cox plot” A “R330” function Min at about -1
5/18/ lecture 1024 Weighted least squares Tests are invalid if observations do not have constant variance If the ith observation has variance v i 2, then we can get a valid test by using “weighted least squares”, minimising the sum of the weighted squared residuals r i 2 /v i rather than the sum of squared residuals r i 2 Need to know the variances v i
5/18/ lecture 1025 Finding the weights Step 1: Plot the squared residuals versus the fitted values Step 2: Smooth the plot Step 3: Estimate the variance of an observation by the smoothed squared residual Step 4: weight is reciprocal of smoothed squared residual Rationale: variance is a function of the mean Use “R330” function funnel
5/18/ lecture 1026 Doing it in R (1) vars = funnel(educ50.lm)# a“R330” function Slope 1.7, indicates p=1-1.7=-0.7
5/18/ lecture 1027 > educ50.lm<-lm(educ~urban+percap+under18data=educ.df[-50,]) > summary(educ50.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * urban percap *** under * --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 45 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 3 and 45 DF, p-value: 8.365e-07 Note p- values
5/18/ lecture 1028 > weighted.lm<-lm(educ~urban+percap+under18, weights=1/vars, data=educ.df[-50,]) > summary(weighted.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * urban percap e-07 *** under ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 45 degrees of freedom Multiple R-squared: 0.629, Adjusted R-squared: F-statistic: on 3 and 45 DF, p-value: 8.944e-10 Note changes! Conclusion: unequal variances matter! Can change results! Note reciprocals!