Download presentation
Presentation is loading. Please wait.
Published byMarjorie Blake Modified over 9 years ago
1
5/18/2015330 lecture 101 STATS 330: Lecture 10
2
5/18/2015330 lecture 102 Diagnostics 2 Aim of today’s lecture To describe some more remedies for non-planar data To look at diagnostics and remedies for non-constant scatter.
3
5/18/2015330 lecture 103 Remedies for non-planar data (cont) Last time we looked at diagnostics for non- planar data We discussed what to do if the diagnostics indicate a problem. The short answer was: we transform, so that the model fits the transformed data. How to choose a transformation? Theory Ladder of powers Polynomials We illustrate with a few examples
4
5/18/2015330 lecture 104 Example: Using theory - cherry trees A tree trunk is a bit like a cylinder Volume = (diameter/2) 2 height Log volume = log( 4) + 2 log(diameter) + log(height) so a linear regression using the logged variables should work! In fact R 2 increases from 95% to 98%, and residual plots are better
5
5/18/2015330 lecture 105 Example: cherry trees (cont) > new.reg<- lm(log(volume)~log(diameter)+log(height), data=cherry.df) > summary(new.reg) Call: lm(formula = log(volume) ~ log(diameter) + log(height), data = cherry.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.63162 0.79979 -8.292 5.06e-09 *** log(diameter) 1.98265 0.07501 26.432 < 2e-16 *** log(height) 1.11712 0.20444 5.464 7.81e-06 *** --- Residual standard error: 0.08139 on 28 degrees of freedom Multiple R-Squared: 0.9777, Adjusted R-squared: 0.9761 F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16 Previously 94.8%
6
5/18/2015330 lecture 106 Example: cherry trees (original)
7
5/18/2015330 lecture 107 Example: cherry trees (logs)
8
5/18/2015330 lecture 108 Tyre abrasion data: gam plots rubber.gam = gam(abloss~s(hardness)+s(tensile), data=rubber.df) par(mfrow=c(1,2)); plot(rubber.gam)
9
5/18/2015330 lecture 109 Tyre abrasion data: polynomial GAM curve is like a polynomial, so fit a polynomial (ie include terms tensile 2, tensile 3,…) lm(abloss~hardness+poly(tensile,4), data=rubber.df) Usually a lot of trial and error involved! We have succeeded when R 2 improves Residual plots show no pattern 4 th deg polynomial works for the rubber data: R2 increases from 84% to 94% Degree of polynomial
10
Why 4 th degree? > rubber.lm = lm(abloss~poly(tensile,5)+hardness, data=rubber.df) > summary(rubber.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 615.3617 29.8178 20.637 2.44e-16 *** poly(tensile, 5)1 -264.3933 25.0612 -10.550 2.76e-10 *** poly(tensile, 5)2 23.6148 25.3437 0.932 0.361129 poly(tensile, 5)3 119.9500 24.6356 4.869 6.46e-05 *** poly(tensile, 5)4 -91.6951 23.6920 -3.870 0.000776 *** poly(tensile, 5)5 9.3811 23.6684 0.396 0.695495 hardness -6.2608 0.4199 -14.911 2.59e-13 *** Residual standard error: 23.67 on 23 degrees of freedom Multiple R-squared: 0.9427, Adjusted R-squared: 0.9278 F-statistic: 63.11 on 6 and 23 DF, p-value: 3.931e-13 5/18/2015330 lecture 1010 Try 5 th degree Highest significant power
11
Ladder of powers Rather than fit polynomials in some independent variables, guided by gam plots, we can transform the response using the “ladder of powers” (i.e. use y p as the response rather than y for some power p) Choose p either by trial and error using R 2 or use a “Box-Cox plot – see later in this lecture 5/18/2015330 lecture 1011
12
5/18/2015330 lecture 1012 Checking for equal scatter The model specifies that the scatter about the regression plane is uniform In practice this means that the scatter doesn’t depend on the explanatory variables or the mean of the response All tests, confidence intervals rely on this
13
5/18/2015330 lecture 1013 Scatter Scatter is measured by the size of the residuals A common problem is where the scatter increases as the mean response increases This is means the big residuals happen when the fitted values are big Recognize this by a “funnel effect” in the residuals versus fitted value plot
14
5/18/2015330 lecture 1014 Example: Education expenditure data Data for the 50 states of the USA Variables are Per capita expenditure on education (response), variable educ Per capita Income, variable percap Number of residents per 1000 under 18, variable under18 Number of residents per 1000 in urban areas, variable urban Fit model educ~ percap+under18+urban
15
5/18/2015330 lecture 1015 response Outlier! (response)
16
5/18/2015330 lecture 1016 Outlier, pt 50 (California)
17
Basic fit, outlier in 5/18/2015330 lecture 1017 > educ.lm = lm(educ~urban + percap + under18, data=educ.df) > summary(educ.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -555.92562 123.46634 -4.503 4.56e-05 *** urban -0.00476 0.05174 -0.092 0.927 percap 0.07236 0.01165 6.211 1.40e-07 *** under18 1.55134 0.31545 4.918 1.16e-05 *** Residual standard error: 40.53 on 46 degrees of freedom Multiple R-squared: 0.5902, Adjusted R-squared: 0.5634 F-statistic: 22.08 on 3 and 46 DF, p-value: 5.271e-09 R 2 is 59%
18
Basic fit, outlier out 5/18/2015330 lecture 1018 > educ50.lm = lm(educ~urban + percap + under18, data=educ.df, subset=-50) > summary(educ50.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278.06430 132.61422 -2.097 0.041664 * urban 0.06624 0.04966 1.334 0.188948 percap 0.04827 0.01220 3.958 0.000266 *** under18 0.88983 0.33159 2.684 0.010157 * --- Residual standard error: 35.88 on 45 degrees of freedom Multiple R-squared: 0.4947, Adjusted R-squared: 0.461 F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07 R 2 is now 49% See how we exclude pt 50
19
5/18/2015330 lecture 1019 > par(mfrow=c(1,2)) > plot(educ50.lm, which = c(1,3)) Funnel effect Increasing relationship
20
5/18/2015330 lecture 1020 Remedies Either Transform the response Or Estimate the variances of the observations and use “weighted least squares”
21
5/18/2015330 lecture 1021 > tr.educ50.lm <- lm(I(1/educ)~urban + percap + under18,data=educ.df[-50,]) > plot(tr.educ50.lm) Transforming the response Transform to reciprocal Better!
22
What power to choose? How did we know to use reciprocals? Think of a more general model I(educ^p)~percap + under18 + urban where p is some power Then estimate p from the data using a Box- Cox plot 5/18/2015330 lecture 1022
23
5/18/2015330 lecture 1023 boxcoxplot(educ~urban + percap + under18, educ.df[-50,]) Transforming the response (how?) Draws “Box-Cox plot” A “R330” function Min at about -1
24
5/18/2015330 lecture 1024 Weighted least squares Tests are invalid if observations do not have constant variance If the ith observation has variance v i 2, then we can get a valid test by using “weighted least squares”, minimising the sum of the weighted squared residuals r i 2 /v i rather than the sum of squared residuals r i 2 Need to know the variances v i
25
5/18/2015330 lecture 1025 Finding the weights Step 1: Plot the squared residuals versus the fitted values Step 2: Smooth the plot Step 3: Estimate the variance of an observation by the smoothed squared residual Step 4: weight is reciprocal of smoothed squared residual Rationale: variance is a function of the mean Use “R330” function funnel
26
5/18/2015330 lecture 1026 Doing it in R (1) vars = funnel(educ50.lm)# a“R330” function Slope 1.7, indicates p=1-1.7=-0.7
27
5/18/2015330 lecture 1027 > educ50.lm<-lm(educ~urban+percap+under18data=educ.df[-50,]) > summary(educ50.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -278.06430 132.61422 -2.097 0.041664 * urban 0.06624 0.04966 1.334 0.188948 percap 0.04827 0.01220 3.958 0.000266 *** under18 0.88983 0.33159 2.684 0.010157 * --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 35.88 on 45 degrees of freedom Multiple R-Squared: 0.4947, Adjusted R-squared: 0.461 F-statistic: 14.68 on 3 and 45 DF, p-value: 8.365e-07 Note p- values
28
5/18/2015330 lecture 1028 > weighted.lm<-lm(educ~urban+percap+under18, weights=1/vars, data=educ.df[-50,]) > summary(weighted.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -270.29363 102.61073 -2.634 0.0115 * urban 0.01197 0.04030 0.297 0.7677 percap 0.05850 0.01027 5.694 8.88e-07 *** under18 0.82384 0.27234 3.025 0.0041 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.019 on 45 degrees of freedom Multiple R-squared: 0.629, Adjusted R-squared: 0.6043 F-statistic: 25.43 on 3 and 45 DF, p-value: 8.944e-10 Note changes! Conclusion: unequal variances matter! Can change results! Note reciprocals!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.