5/13/ lecture 91 STATS 330: Lecture 9
5/13/ lecture 92 Diagnostics Aim of today’s lecture: To give you an overview of the modelling cycle, and begin the detailed discussion of diagnostic procedures
5/13/ lecture 93 The modelling cycle We have seen that the regression model describes rather specialized forms of data Data are “planar” Scatter is uniform over the plane We have looked at some plots that help us decide if the data is suitable for regression modelling pairs reg3d coplots
5/13/ lecture 94 Residual analysis Another approach is to fit the model and examine the residuals If the model is appropriate the residuals have no pattern A pattern in the residuals usually indicates that the model is not appropriate If this is the case we have two options Option 1: Select another form of model eg non-linear regression – see other courses Option 2: transform the data so that the regression model fits the transformed data – see next slide
5/13/ lecture 95 Modelling cycle This leads us into a modelling cycle Fit Examine residuals Transform data or change model if necessary This cycle is repeated until we are happy with the fitted model Diagramatically….
5/13/ lecture 96 Modelling cycle Examine residuals Fit model Transform/ change Choose Model Bad fit Good fit Use model Plots, theory
5/13/ lecture 97 What makes a bad fit? Non-planar data Outliers in the data Errors depend on covariates (non-constant scatter) Errors not independent Errors not normally distributed First two points can be serious as they affect the meaning and accuracy of the estimated coefficients. The others affect mainly standard errors, not estimated coefficients – see subsequent lectures
5/13/ lecture 98 Detecting non-planar data For the next 2 lectures, we look at diagnosing non-planar data and choosing a transformation. We can diagnose non-planar data (nonlinearity) by fitting the model, and plotting residuals versus fitted values, residuals against explanatory variables fitting additive models In each case, a curved plot indicates non-planar data
5/13/ lecture 99 Plotting residuals versus fitted values plot(cherry.lm, which=1) which = 1 selects a plot of residuals versus fitted values
5/13/ lecture 910 Indication of a curve: so data non-planar
5/13/ lecture 911 Additive models These are models of the form Y=g 1 (x 1 ) + g 2 (x 2 ) + … + g k (x k ) + where g 1,…,g k are transformations Fitted using the gam function in R The transformations are estimated by the software Use the function to suggest good transformations
5/13/ lecture 912 Example: cherry trees > library(mgcv) This is mgcv > cherry.gam<- gam(volume~s(height) + s(diameter),data=cherry.df) > plot(cherry.gam) Don’t forget the s if you want to estimate the transformation
5/13/ lecture 913 Example Strong suggestion that diameter be transformed
5/13/ lecture 914 Fitting polynomials To fit model y = b 0 +b 1 x + b 2 x 2, use y ~ poly(x,2) To fit model y = b 0 +b 1 x + b 2 x 2 + b 3 x 3, use y ~ poly(x,3) and so on
5/13/ lecture 915 Orthogonal polynomials The model fitted by y~poly(x,2) is of the form Y = 0 + 1 p 1 (x) + 2 p 2 (x) where p 1 is a polynomial of degree 1 ie of form ax + b and p 2 is a polynomial of degree 2 ie of form ax 2 + bx + c p 1, p 2 chosen to be uncorrelated (best possible estimation)
5/13/ lecture 916 Adding a quadratic term (cherry trees) Suggestion of quadratic behaviour in diameter, so fit a quadratic model in diameter (add a term diameter^2) cherry2.lm=lm(volume~poly(diameter,2)+height,data=cherry.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) poly(diameter, 2) < 2e-16 *** poly(diameter, 2) e-06 *** height *** --- Residual standard error: on 27 degrees of freedom Multiple R-Squared: , Adjusted R-squared: How to add a quadratic term R2 improved from 94.8% to 97.7%
5/13/ lecture 917 Equation of quadratic To see which quadratic is fitted, use > cherry.quad = lm(volume ~ diameter + I(diameter^2) + height, data=cherry.df) > summary(cherry.quad) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) diameter * I(diameter^2) e-06 *** height *** Residual standard error: on 27 degrees of freedom Multiple R-Squared: , Adjusted R-squared: Surface fitted fitted is x diameter x diameter x height Same model!
Splines An alternative to polynomials are splines – these are piecewise cubics, which join smoothly at “knots” Give a more flexible fit to the data Values at one point not affected by values at distant points, unlike polynomials 5/13/ lecture 918
Example with 4 knots 5/13/ lecture 919
5/13/ lecture 920 Adding a spline term (cherry trees) library(splines) cherry3.lm=lm(volume~bs(diameter,6)+height,data=cherry.df) Coefficients: (Intercept) * bs(diameter, df = 6) bs(diameter, df = 6) bs(diameter, df = 6) * bs(diameter, df = 6) e-06 *** bs(diameter, df = 6) e-07 *** bs(diameter, df = 6) e-12 *** height *** Residual standard error: 2.8 on 23 degrees of freedom Multiple R-squared: , Adjusted R-squared: How to add a spline term (3 knots) R2 improved from 94.8% to 97.78%
5/13/ lecture 921
5/13/ lecture 922 Example: Tyre abrasion data Data collected in an experiment to study the abrasion resistance of tyres Variables are Hardness: Hardness of rubber Tensile: Tensile strength of rubber Abrasion Loss: Amount of rubber worn away in a standard test (response)
5/13/ lecture 923 Tyre abrasion data > rubber.df hardness tensile abloss … 30 observations in all
5/13/ lecture 924 Tyre abrasion data > rubber.lm<-lm(abloss~hardness + tensile, data=rubber.df) > summary(rubber.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-14 *** hardness e-11 *** tensile e-07 *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 27 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11
5/13/ lecture 925 Tyre abrasion data We will use this example to illustrate all the methods we have discussed so far to check if the data are planar, scattered about a flat regression plane i.e. Pairs Spinning plot Coplot Residual versus fitted value plot Fitting GAMS
5/13/ lecture 926 Tyre abrasion data (pairs)
Spinning 5/13/ lecture 927
5/13/ lecture 928 Tyre abrasion data (coplot)
5/13/ lecture 929 Tyre abrasion data: residuals v fitted values data(rubber.df) rubber.lm = lm(abloss~., data=rubber.df) plot(rubber.lm, which=1)
5/13/ lecture 930 Tyre abrasion data: gams library(mgcv) rubber.gam<- gam(abloss~s(tensile) + s(hardness),data=rubber.df) plot(rubber.gam)
5/13/ lecture 931 Tyre abrasion data: Conclusions Pairs plot : not very informative Spinner: Hint of a “kink” Coplot: suggestion of non-linearity, a “kink” Residuals versus fitted values: weak suggestion that regression is not planar gam plots: hardness is OK, but strong suggestion that tensile needs transforming Suggested transformation: looks a bit like a cubic or 4 th degree polynomial, so try these (or a spline). Using a spline, the R 2 is increased from 84% to 94.3%.