Download presentation
Presentation is loading. Please wait.
Published byElaine Burns Modified over 6 years ago
1
STATS 330: Lecture 16 Case Study 7/17/2018 330 lecture 16
2
Case study Aim of today’s lecture
To illustrate the modelling process using the evaporation data. 7/17/2018 330 lecture 16 STATS 330 lect 16
3
The Evaporation data Data in data frame evap.df Aims of the analysis:
Understand relationships between explanatory variables and the response Be able to predict evaporation loss given the other variables 7/17/2018 330 lecture 16 STATS 330 lect 16
4
Case Study: Evaporation data
Recall from Lecture 15: variables are evap: the amount of moisture evaporating from the soil in the 24 hour period (response) maxst: maximum soil temperature over the 24 hour period minst: minimum soil temperature over the 24 hour period avst: average soil temperature over the 24 hour period maxat: maximum air temperature over the 24 hour period minat: minimum air temperature over the 24 hour period avat: average air temperature over the 24 hour period maxh: maximum humidity over the 24 hour period minh: minimum humidity over the 24 hour period avh: average humidity over the 24 hour period wind: average wind speed over the 24 hour period. 7/17/2018 330 lecture 16 STATS 330 lect 16
5
Modelling cycle Choose Model Fit model Examine residuals Transform
Bad fit Good fit Use model Plots, theory 7/17/2018 330 lecture 16 STATS 330 lect 16
6
Modelling cycle (2) Our plan of attack: Graphical check
Suitability for regression Gross outliers Preliminary fit Model selection (for prediction) Transforming if required Outlier check Use model for prediction 7/17/2018 330 lecture 16 STATS 330 lect 16
7
Step 1: Plots Preliminary plots
Want to get an initial idea of suitability of data for regression modelling Check for linear relationships, outliers Pairs plots, coplots Data looks OK to proceed, but evap/maxh plot looks curved 7/17/2018 330 lecture 16 STATS 330 lect 16
8
7/17/2018 330 lecture 16
9
Points to note Avh has very few values
Strong relationships between response and some variables (particularly maxh, avst) Not much relationship between response and minst, minat, wind strong relationships between min, av and max No obvious outliers 7/17/2018 330 lecture 16 STATS 330 lect 16
10
Step 2: preliminary fit Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) avst * minst maxst * avat minat maxat avh minh maxh ** wind Residual standard error: on 35 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 10 and 35 DF, p-value: 2.073e-11 7/17/2018 330 lecture 16
11
7/17/2018 330 lecture 16
12
Findings Plots OK, normality dubious
Gam plots indicated no transformations Point 31 has quite high Cooks distance but removing it doesn’t change regression much Model is OK. Could interpret coefficients, but variables highly correlated. 7/17/2018 330 lecture 16
13
Step 3: Model selection Use APR Model selected was
evap ~ maxat + maxh + wind However, this model does not fit all that well (outliers, non-normality) Try “best AIC” model evap ~ avst + maxst + maxat + minh+maxh Now proceed to step 4 7/17/2018 330 lecture 16 STATS 330 lect 16
14
Step 4: Diagnostic checks
For a quick check, plot the regression object produced by lm model1.lm<-lm(evap ~ avst + maxst + maxat + minh+maxh, data=evap.df) plot(model1.lm) 7/17/2018 330 lecture 16 STATS 330 lect 16
15
Outliers ? Non-normal? 7/17/2018 330 lecture 16 STATS 330 lect 16
16
Conclusions? No real evidence of non-linearity, but will check further with gams Normal plot looks curved Some largish outliers Points 2, 41 have largish Cooks D 7/17/2018 330 lecture 16 STATS 330 lect 16
17
Checking linearity Check for linearity with gams > library(mgcv)
>plot(gam(evap ~ s(avst) + s(maxst) + s(maxat) + s(maxh) + s(wind), data=evap.df)) 7/17/2018 330 lecture 16 STATS 330 lect 16
18
Transform avst, maxh ? 7/17/2018 330 lecture 16 STATS 330 lect 16
19
Remedy Gam plots for avst and maxh are curved
Try cubics in these variables Plots look better Cubic terms are significant 7/17/2018 330 lecture 16 STATS 330 lect 16
20
7/17/2018 330 lecture 16
21
> summary(model2.lm) Coefficients:
> model2.lm<-lm(evap ~ poly(avst,3) + maxst + maxat + minh+poly(maxh,3), data=evap.df) > summary(model2.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ** poly(avst, 3) ** poly(avst, 3) * poly(avst, 3) maxst e-05 *** maxat ** minh poly(maxh, 3) e-05 *** poly(maxh, 3) poly(maxh, 3) * --- Residual standard error: on 36 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 36 DF, p-value: 4.459e-15 7/17/2018 330 lecture 16
22
New model > influenceplots(model2.lm) Lets now adopt model
lm(evap~poly(avst,3)+maxst+maxat+poly(maxh,3) + wind Outliers are not too bad but lets check > influenceplots(model2.lm) 7/17/2018 330 lecture 16 STATS 330 lect 16
23
7/17/2018 330 lecture 16
24
7/17/2018 330 lecture 16
25
Deletion of points Points 2, 6, 7, 41 are affecting the fitted values, some coefficients. Removing these one at a time and refitting indicates that the cubics are not very robust, so we revert to the non-polynomial model The coefficients of the non-polynomial model are fairly stable when we delete these points one at a time, so we decide to retain them. 7/17/2018 330 lecture 16 STATS 330 lect 16
26
Normality? However, the normal plot for the non-polynomial model is not very straight – WB test has p-value 0. Normality of polynomial model is better Try predictions with both 7/17/2018 330 lecture 16 STATS 330 lect 16
27
predict.df = data.frame(avst = mean(evap.df$avst),
maxst = mean(evap.df$maxst), maxat = mean(evap.df$maxat), maxh = mean(evap.df$maxh), minh = mean(evap.df$minh)) rbind(predict(model1.lm, predict.df,interval="p" ), predict(model2.lm, predict.df,interval="p" )) fit lwr upr CV fit: 7/17/2018 330 lecture 16
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.