5/13/2015330 lecture 91 STATS 330: Lecture 9. 5/13/2015330 lecture 92 Diagnostics Aim of today’s lecture: To give you an overview of the modelling cycle,

Slides:



Advertisements
Similar presentations
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Advertisements

4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Polynomial Regression and Transformations STA 671 Summer 2008.
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
5/11/ lecture 71 STATS 330: Lecture 7. 5/11/ lecture 72 Prediction Aims of today’s lecture  Describe how to use the regression model to.
1 Statistics & R, TiP, 2011/12 Linear Models & Smooth Regression  Linear models  Diagnostics  Robust regression  Bootstrapping linear models  Scatterplot.
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
5/18/ lecture 101 STATS 330: Lecture 10. 5/18/ lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar.
Multiple Regression Predicting a response with multiple explanatory variables.
Section 4.2 Fitting Curves and Surfaces by Least Squares.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Regression Diagnostics Checking Assumptions and Data.
Empirical Estimation Review EconS 451: Lecture # 8 Describe in general terms what we are attempting to solve with empirical estimation. Understand why.
BCOR 1020 Business Statistics Lecture 24 – April 17, 2008.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
9/15/ Lecture 11 STATS 330: Lecture 1. 9/15/ Lecture 12 Today’s agenda: Introductory Comments: –Housekeeping –Computer details –Plan of.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
Chapter 8: Regression Analysis PowerPoint Slides Prepared By: Alan Olinsky Bryant University Management Science: The Art of Modeling with Spreadsheets,
M23- Residuals & Minitab 1  Department of ISM, University of Alabama, ResidualsResiduals A continuation of regression analysis.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Non-Linear Regression. The data frame trees is made available in R with >data(trees) These record the girth in inches, height in feet and volume of timber.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
Regression. Population Covariance and Correlation.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Linear Models Alan Lee Sample presentation for STATS 760.
Non-Linear Regression. The data frame trees is made available in R with >data(trees) These record the girth in inches, height in feet and volume of timber.
Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Inference for Least Squares Lines
Modeling in R Sanna Härkönen.
Correlation and Simple Linear Regression
STATS 330: Lecture 16 Case Study 7/17/ lecture 16
Unit 3 – Linear regression
The greatest blessing in life is
Regression Forecasting and Model Building
Presentation transcript:

5/13/ lecture 91 STATS 330: Lecture 9

5/13/ lecture 92 Diagnostics Aim of today’s lecture: To give you an overview of the modelling cycle, and begin the detailed discussion of diagnostic procedures

5/13/ lecture 93 The modelling cycle  We have seen that the regression model describes rather specialized forms of data Data are “planar” Scatter is uniform over the plane  We have looked at some plots that help us decide if the data is suitable for regression modelling pairs reg3d coplots

5/13/ lecture 94 Residual analysis  Another approach is to fit the model and examine the residuals  If the model is appropriate the residuals have no pattern  A pattern in the residuals usually indicates that the model is not appropriate  If this is the case we have two options Option 1: Select another form of model eg non-linear regression – see other courses Option 2: transform the data so that the regression model fits the transformed data – see next slide

5/13/ lecture 95 Modelling cycle This leads us into a modelling cycle Fit Examine residuals Transform data or change model if necessary This cycle is repeated until we are happy with the fitted model Diagramatically….

5/13/ lecture 96 Modelling cycle Examine residuals Fit model Transform/ change Choose Model Bad fit Good fit Use model Plots, theory

5/13/ lecture 97 What makes a bad fit?  Non-planar data  Outliers in the data  Errors depend on covariates (non-constant scatter)  Errors not independent  Errors not normally distributed  First two points can be serious as they affect the meaning and accuracy of the estimated coefficients.  The others affect mainly standard errors, not estimated coefficients – see subsequent lectures

5/13/ lecture 98 Detecting non-planar data For the next 2 lectures, we look at diagnosing non-planar data and choosing a transformation. We can diagnose non-planar data (nonlinearity) by fitting the model, and plotting residuals versus fitted values, residuals against explanatory variables fitting additive models In each case, a curved plot indicates non-planar data

5/13/ lecture 99 Plotting residuals versus fitted values plot(cherry.lm, which=1) which = 1 selects a plot of residuals versus fitted values

5/13/ lecture 910 Indication of a curve: so data non-planar

5/13/ lecture 911 Additive models  These are models of the form Y=g 1 (x 1 ) + g 2 (x 2 ) + … + g k (x k ) +  where g 1,…,g k are transformations  Fitted using the gam function in R  The transformations are estimated by the software  Use the function to suggest good transformations

5/13/ lecture 912 Example: cherry trees > library(mgcv) This is mgcv > cherry.gam<- gam(volume~s(height) + s(diameter),data=cherry.df) > plot(cherry.gam) Don’t forget the s if you want to estimate the transformation

5/13/ lecture 913 Example Strong suggestion that diameter be transformed

5/13/ lecture 914 Fitting polynomials  To fit model y = b 0 +b 1 x + b 2 x 2, use y ~ poly(x,2)  To fit model y = b 0 +b 1 x + b 2 x 2 + b 3 x 3, use y ~ poly(x,3) and so on

5/13/ lecture 915 Orthogonal polynomials The model fitted by y~poly(x,2) is of the form Y =  0 +  1 p 1 (x) +  2 p 2 (x) where p 1 is a polynomial of degree 1 ie of form ax + b and p 2 is a polynomial of degree 2 ie of form ax 2 + bx + c p 1, p 2 chosen to be uncorrelated (best possible estimation)

5/13/ lecture 916 Adding a quadratic term (cherry trees)  Suggestion of quadratic behaviour in diameter, so fit a quadratic model in diameter (add a term diameter^2) cherry2.lm=lm(volume~poly(diameter,2)+height,data=cherry.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) poly(diameter, 2) < 2e-16 *** poly(diameter, 2) e-06 *** height *** --- Residual standard error: on 27 degrees of freedom Multiple R-Squared: , Adjusted R-squared: How to add a quadratic term R2 improved from 94.8% to 97.7%

5/13/ lecture 917 Equation of quadratic To see which quadratic is fitted, use > cherry.quad = lm(volume ~ diameter + I(diameter^2) + height, data=cherry.df) > summary(cherry.quad) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) diameter * I(diameter^2) e-06 *** height *** Residual standard error: on 27 degrees of freedom Multiple R-Squared: , Adjusted R-squared: Surface fitted fitted is x diameter x diameter x height Same model!

Splines  An alternative to polynomials are splines – these are piecewise cubics, which join smoothly at “knots”  Give a more flexible fit to the data  Values at one point not affected by values at distant points, unlike polynomials 5/13/ lecture 918

Example with 4 knots 5/13/ lecture 919

5/13/ lecture 920 Adding a spline term (cherry trees) library(splines) cherry3.lm=lm(volume~bs(diameter,6)+height,data=cherry.df) Coefficients: (Intercept) * bs(diameter, df = 6) bs(diameter, df = 6) bs(diameter, df = 6) * bs(diameter, df = 6) e-06 *** bs(diameter, df = 6) e-07 *** bs(diameter, df = 6) e-12 *** height *** Residual standard error: 2.8 on 23 degrees of freedom Multiple R-squared: , Adjusted R-squared: How to add a spline term (3 knots) R2 improved from 94.8% to 97.78%

5/13/ lecture 921

5/13/ lecture 922 Example: Tyre abrasion data Data collected in an experiment to study the abrasion resistance of tyres Variables are Hardness: Hardness of rubber Tensile: Tensile strength of rubber Abrasion Loss: Amount of rubber worn away in a standard test (response)

5/13/ lecture 923 Tyre abrasion data > rubber.df hardness tensile abloss … 30 observations in all

5/13/ lecture 924 Tyre abrasion data > rubber.lm<-lm(abloss~hardness + tensile, data=rubber.df) > summary(rubber.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-14 *** hardness e-11 *** tensile e-07 *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 27 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11

5/13/ lecture 925 Tyre abrasion data We will use this example to illustrate all the methods we have discussed so far to check if the data are planar, scattered about a flat regression plane i.e. Pairs Spinning plot Coplot Residual versus fitted value plot Fitting GAMS

5/13/ lecture 926 Tyre abrasion data (pairs)

Spinning 5/13/ lecture 927

5/13/ lecture 928 Tyre abrasion data (coplot)

5/13/ lecture 929 Tyre abrasion data: residuals v fitted values data(rubber.df) rubber.lm = lm(abloss~., data=rubber.df) plot(rubber.lm, which=1)

5/13/ lecture 930 Tyre abrasion data: gams library(mgcv) rubber.gam<- gam(abloss~s(tensile) + s(hardness),data=rubber.df) plot(rubber.gam)

5/13/ lecture 931 Tyre abrasion data: Conclusions Pairs plot : not very informative Spinner: Hint of a “kink” Coplot: suggestion of non-linearity, a “kink” Residuals versus fitted values: weak suggestion that regression is not planar gam plots: hardness is OK, but strong suggestion that tensile needs transforming Suggested transformation: looks a bit like a cubic or 4 th degree polynomial, so try these (or a spline). Using a spline, the R 2 is increased from 84% to 94.3%.