Presentation is loading. Please wait.

Presentation is loading. Please wait.

STATS 330: Lecture 16 Case Study 7/17/ lecture 16

Similar presentations


Presentation on theme: "STATS 330: Lecture 16 Case Study 7/17/ lecture 16"— Presentation transcript:

1 STATS 330: Lecture 16 Case Study 7/17/2018 330 lecture 16

2 Case study Aim of today’s lecture
To illustrate the modelling process using the evaporation data. 7/17/2018 330 lecture 16 STATS 330 lect 16

3 The Evaporation data Data in data frame evap.df Aims of the analysis:
Understand relationships between explanatory variables and the response Be able to predict evaporation loss given the other variables 7/17/2018 330 lecture 16 STATS 330 lect 16

4 Case Study: Evaporation data
Recall from Lecture 15: variables are evap: the amount of moisture evaporating from the soil in the 24 hour period (response) maxst: maximum soil temperature over the 24 hour period minst: minimum soil temperature over the 24 hour period avst: average soil temperature over the 24 hour period maxat: maximum air temperature over the 24 hour period minat: minimum air temperature over the 24 hour period avat: average air temperature over the 24 hour period maxh: maximum humidity over the 24 hour period minh: minimum humidity over the 24 hour period avh: average humidity over the 24 hour period wind: average wind speed over the 24 hour period. 7/17/2018 330 lecture 16 STATS 330 lect 16

5 Modelling cycle Choose Model Fit model Examine residuals Transform
Bad fit Good fit Use model Plots, theory 7/17/2018 330 lecture 16 STATS 330 lect 16

6 Modelling cycle (2) Our plan of attack: Graphical check
Suitability for regression Gross outliers Preliminary fit Model selection (for prediction) Transforming if required Outlier check Use model for prediction 7/17/2018 330 lecture 16 STATS 330 lect 16

7 Step 1: Plots Preliminary plots
Want to get an initial idea of suitability of data for regression modelling Check for linear relationships, outliers Pairs plots, coplots Data looks OK to proceed, but evap/maxh plot looks curved 7/17/2018 330 lecture 16 STATS 330 lect 16

8 7/17/2018 330 lecture 16

9 Points to note Avh has very few values
Strong relationships between response and some variables (particularly maxh, avst) Not much relationship between response and minst, minat, wind strong relationships between min, av and max No obvious outliers 7/17/2018 330 lecture 16 STATS 330 lect 16

10 Step 2: preliminary fit Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) avst * minst maxst * avat minat maxat avh minh maxh ** wind Residual standard error: on 35 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 10 and 35 DF, p-value: 2.073e-11 7/17/2018 330 lecture 16

11 7/17/2018 330 lecture 16

12 Findings Plots OK, normality dubious
Gam plots indicated no transformations Point 31 has quite high Cooks distance but removing it doesn’t change regression much Model is OK. Could interpret coefficients, but variables highly correlated. 7/17/2018 330 lecture 16

13 Step 3: Model selection Use APR Model selected was
evap ~ maxat + maxh + wind However, this model does not fit all that well (outliers, non-normality) Try “best AIC” model evap ~ avst + maxst + maxat + minh+maxh Now proceed to step 4 7/17/2018 330 lecture 16 STATS 330 lect 16

14 Step 4: Diagnostic checks
For a quick check, plot the regression object produced by lm model1.lm<-lm(evap ~ avst + maxst + maxat + minh+maxh, data=evap.df) plot(model1.lm) 7/17/2018 330 lecture 16 STATS 330 lect 16

15 Outliers ? Non-normal? 7/17/2018 330 lecture 16 STATS 330 lect 16

16 Conclusions? No real evidence of non-linearity, but will check further with gams Normal plot looks curved Some largish outliers Points 2, 41 have largish Cooks D 7/17/2018 330 lecture 16 STATS 330 lect 16

17 Checking linearity Check for linearity with gams > library(mgcv)
>plot(gam(evap ~ s(avst) + s(maxst) + s(maxat) + s(maxh) + s(wind), data=evap.df)) 7/17/2018 330 lecture 16 STATS 330 lect 16

18 Transform avst, maxh ? 7/17/2018 330 lecture 16 STATS 330 lect 16

19 Remedy Gam plots for avst and maxh are curved
Try cubics in these variables Plots look better Cubic terms are significant 7/17/2018 330 lecture 16 STATS 330 lect 16

20 7/17/2018 330 lecture 16

21 > summary(model2.lm) Coefficients:
> model2.lm<-lm(evap ~ poly(avst,3) + maxst + maxat + minh+poly(maxh,3), data=evap.df) > summary(model2.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ** poly(avst, 3) ** poly(avst, 3) * poly(avst, 3) maxst e-05 *** maxat ** minh poly(maxh, 3) e-05 *** poly(maxh, 3) poly(maxh, 3) * --- Residual standard error: on 36 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 36 DF, p-value: 4.459e-15 7/17/2018 330 lecture 16

22 New model > influenceplots(model2.lm) Lets now adopt model
lm(evap~poly(avst,3)+maxst+maxat+poly(maxh,3) + wind Outliers are not too bad but lets check > influenceplots(model2.lm) 7/17/2018 330 lecture 16 STATS 330 lect 16

23 7/17/2018 330 lecture 16

24 7/17/2018 330 lecture 16

25 Deletion of points Points 2, 6, 7, 41 are affecting the fitted values, some coefficients. Removing these one at a time and refitting indicates that the cubics are not very robust, so we revert to the non-polynomial model The coefficients of the non-polynomial model are fairly stable when we delete these points one at a time, so we decide to retain them. 7/17/2018 330 lecture 16 STATS 330 lect 16

26 Normality? However, the normal plot for the non-polynomial model is not very straight – WB test has p-value 0. Normality of polynomial model is better Try predictions with both 7/17/2018 330 lecture 16 STATS 330 lect 16

27 predict.df = data.frame(avst = mean(evap.df$avst),
maxst = mean(evap.df$maxst), maxat = mean(evap.df$maxat), maxh = mean(evap.df$maxh), minh = mean(evap.df$minh)) rbind(predict(model1.lm, predict.df,interval="p" ), predict(model2.lm, predict.df,interval="p" )) fit lwr upr CV fit: 7/17/2018 330 lecture 16


Download ppt "STATS 330: Lecture 16 Case Study 7/17/ lecture 16"

Similar presentations


Ads by Google