STATS 330: Lecture 16 Case Study 7/17/ lecture 16

Slides:



Advertisements
Similar presentations
Simple linear models Straight line is simplest case, but key is that parameters appear linearly in the model Needs estimates of the model parameters (slope.
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Forecasting Using the Simple Linear Regression Model and Correlation
Polynomial Regression and Transformations STA 671 Summer 2008.
Inference for Regression
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
5/11/ lecture 71 STATS 330: Lecture 7. 5/11/ lecture 72 Prediction Aims of today’s lecture  Describe how to use the regression model to.
5/13/ lecture 91 STATS 330: Lecture 9. 5/13/ lecture 92 Diagnostics Aim of today’s lecture: To give you an overview of the modelling cycle,
5/16/ lecture 141 STATS 330: Lecture 14. 5/16/ lecture 142 Variable selection Aim of today’s lecture  To describe some techniques for selecting.
5/18/ lecture 101 STATS 330: Lecture 10. 5/18/ lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar.
A Short Introduction to Curve Fitting and Regression by Brad Morantz
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Multiple Regression Predicting a response with multiple explanatory variables.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Multiple Regression Models. The Multiple Regression Model The relationship between one dependent & two or more independent variables is a linear function.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Regression Diagnostics Checking Assumptions and Data.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Lecture 14 Multiple Regression Model
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
10/1/ lecture 151 STATS 330: Lecture /1/ lecture 152 Variable selection Aim of today’s lecture  To describe some further techniques.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Economics 173 Business Statistics Lecture 20 Fall, 2001© Professor J. Petry
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Multiple regression.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Basic Statistics Linear Regression. X Y Simple Linear Regression.
Linear Models Alan Lee Sample presentation for STATS 760.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6 Least squares line Interpreting coefficients.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
© 2000 Prentice-Hall, Inc. Chap Chapter 10 Multiple Regression Models Business Statistics A First Course (2nd Edition)
DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.
Introduction Many problems in Engineering, Management, Health Sciences and other Sciences involve exploring the relationships between two or more variables.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Chapter 15 Multiple Regression Model Building
Statistics 200 Lecture #6 Thursday, September 8, 2016
Lecture 11: Simple Linear Regression
Chapter 14 Introduction to Multiple Regression
Chapter 15 Multiple Regression and Model Building
Inference for Least Squares Lines
Modeling in R Sanna Härkönen.
Multiple Regression Analysis and Model Building
Correlation and regression
Chapter 12: Regression Diagnostics
Correlation and Simple Linear Regression
Diagnostics and Transformation for SLR
Stats Club Marnie Brennan
Console Editeur : myProg.R 1
Unit 3 – Linear regression
^ y = a + bx Stats Chapter 5 - Least Squares Regression
Linear regression Fitting a straight line to observations.
Correlation and Simple Linear Regression
Multi Linear Regression Lab
Simple Linear Regression
Simple Linear Regression and Correlation
Adequacy of Linear Regression Models
Adequacy of Linear Regression Models
Adequacy of Linear Regression Models
Diagnostics and Transformation for SLR
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

STATS 330: Lecture 16 Case Study 7/17/2018 330 lecture 16

Case study Aim of today’s lecture To illustrate the modelling process using the evaporation data. 7/17/2018 330 lecture 16 STATS 330 lect 16

The Evaporation data Data in data frame evap.df Aims of the analysis: Understand relationships between explanatory variables and the response Be able to predict evaporation loss given the other variables 7/17/2018 330 lecture 16 STATS 330 lect 16

Case Study: Evaporation data Recall from Lecture 15: variables are evap: the amount of moisture evaporating from the soil in the 24 hour period (response) maxst: maximum soil temperature over the 24 hour period minst: minimum soil temperature over the 24 hour period avst: average soil temperature over the 24 hour period maxat: maximum air temperature over the 24 hour period minat: minimum air temperature over the 24 hour period avat: average air temperature over the 24 hour period maxh: maximum humidity over the 24 hour period minh: minimum humidity over the 24 hour period avh: average humidity over the 24 hour period wind: average wind speed over the 24 hour period. 7/17/2018 330 lecture 16 STATS 330 lect 16

Modelling cycle Choose Model Fit model Examine residuals Transform Bad fit Good fit Use model Plots, theory 7/17/2018 330 lecture 16 STATS 330 lect 16

Modelling cycle (2) Our plan of attack: Graphical check Suitability for regression Gross outliers Preliminary fit Model selection (for prediction) Transforming if required Outlier check Use model for prediction 7/17/2018 330 lecture 16 STATS 330 lect 16

Step 1: Plots Preliminary plots Want to get an initial idea of suitability of data for regression modelling Check for linear relationships, outliers Pairs plots, coplots Data looks OK to proceed, but evap/maxh plot looks curved 7/17/2018 330 lecture 16 STATS 330 lect 16

7/17/2018 330 lecture 16

Points to note Avh has very few values Strong relationships between response and some variables (particularly maxh, avst) Not much relationship between response and minst, minat, wind strong relationships between min, av and max No obvious outliers 7/17/2018 330 lecture 16 STATS 330 lect 16

Step 2: preliminary fit Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -54.074877 130.720826 -0.414 0.68164 avst 2.231782 1.003882 2.223 0.03276 * minst 0.204854 1.104523 0.185 0.85393 maxst -0.742580 0.349609 -2.124 0.04081 * avat 0.501055 0.568964 0.881 0.38452 minat 0.304126 0.788877 0.386 0.70219 maxat 0.092187 0.218054 0.423 0.67505 avh 1.109858 1.133126 0.979 0.33407 minh 0.751405 0.487749 1.541 0.13242 maxh -0.556292 0.161602 -3.442 0.00151 ** wind 0.008918 0.009167 0.973 0.33733 Residual standard error: 6.508 on 35 degrees of freedom Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023 F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11 7/17/2018 330 lecture 16

7/17/2018 330 lecture 16

Findings Plots OK, normality dubious Gam plots indicated no transformations Point 31 has quite high Cooks distance but removing it doesn’t change regression much Model is OK. Could interpret coefficients, but variables highly correlated. 7/17/2018 330 lecture 16

Step 3: Model selection Use APR Model selected was evap ~ maxat + maxh + wind However, this model does not fit all that well (outliers, non-normality) Try “best AIC” model evap ~ avst + maxst + maxat + minh+maxh Now proceed to step 4 7/17/2018 330 lecture 16 STATS 330 lect 16

Step 4: Diagnostic checks For a quick check, plot the regression object produced by lm model1.lm<-lm(evap ~ avst + maxst + maxat + minh+maxh, data=evap.df) plot(model1.lm) 7/17/2018 330 lecture 16 STATS 330 lect 16

Outliers ? Non-normal? 7/17/2018 330 lecture 16 STATS 330 lect 16

Conclusions? No real evidence of non-linearity, but will check further with gams Normal plot looks curved Some largish outliers Points 2, 41 have largish Cooks D 7/17/2018 330 lecture 16 STATS 330 lect 16

Checking linearity Check for linearity with gams > library(mgcv) >plot(gam(evap ~ s(avst) + s(maxst) + s(maxat) + s(maxh) + s(wind), data=evap.df)) 7/17/2018 330 lecture 16 STATS 330 lect 16

Transform avst, maxh ? 7/17/2018 330 lecture 16 STATS 330 lect 16

Remedy Gam plots for avst and maxh are curved Try cubics in these variables Plots look better Cubic terms are significant 7/17/2018 330 lecture 16 STATS 330 lect 16

7/17/2018 330 lecture 16

> summary(model2.lm) Coefficients: > model2.lm<-lm(evap ~ poly(avst,3) + maxst + maxat + minh+poly(maxh,3), data=evap.df) > summary(model2.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.6521 25.4308 2.935 0.00577 ** poly(avst, 3)1 83.0106 27.3221 3.038 0.00441 ** poly(avst, 3)2 21.4666 8.3097 2.583 0.01399 * poly(avst, 3)3 14.1680 7.2199 1.962 0.05749 . maxst -0.8167 0.1697 -4.814 2.65e-05 *** maxat 0.4175 0.1177 3.546 0.00111 ** minh 0.4580 0.3253 1.408 0.16766 poly(maxh, 3)1 -89.0809 20.0297 -4.447 8.02e-05 *** poly(maxh, 3)2 -10.7374 7.9265 -1.355 0.18398 poly(maxh, 3)3 15.1172 6.3209 2.392 0.02212 * --- Residual standard error: 5.276 on 36 degrees of freedom Multiple R-squared: 0.8961, Adjusted R-squared: 0.8701 F-statistic: 34.49 on 9 and 36 DF, p-value: 4.459e-15 7/17/2018 330 lecture 16

New model > influenceplots(model2.lm) Lets now adopt model lm(evap~poly(avst,3)+maxst+maxat+poly(maxh,3) + wind Outliers are not too bad but lets check > influenceplots(model2.lm) 7/17/2018 330 lecture 16 STATS 330 lect 16

7/17/2018 330 lecture 16

7/17/2018 330 lecture 16

Deletion of points Points 2, 6, 7, 41 are affecting the fitted values, some coefficients. Removing these one at a time and refitting indicates that the cubics are not very robust, so we revert to the non-polynomial model The coefficients of the non-polynomial model are fairly stable when we delete these points one at a time, so we decide to retain them. 7/17/2018 330 lecture 16 STATS 330 lect 16

Normality? However, the normal plot for the non-polynomial model is not very straight – WB test has p-value 0. Normality of polynomial model is better Try predictions with both 7/17/2018 330 lecture 16 STATS 330 lect 16

predict.df = data.frame(avst = mean(evap.df$avst), maxst = mean(evap.df$maxst), maxat = mean(evap.df$maxat), maxh = mean(evap.df$maxh), minh = mean(evap.df$minh)) rbind(predict(model1.lm, predict.df,interval="p" ), predict(model2.lm, predict.df,interval="p" )) fit lwr upr 1 34.67391 21.75680 47.59103 1 32.38471 21.39857 43.37084 CV fit: 1 34.67391 21.02628 48.32154 7/17/2018 330 lecture 16