Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.

Slides:

Advertisements

Similar presentations

Things to do in Lecture 1 Outline basic concepts of causality

Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.

4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.

1 Outliers and Influential Observations KNN Ch. 10 (pp )

Statistical Techniques I EXST7005 Multiple Regression.

 Population multiple regression model  Data for multiple regression  Multiple linear regression model  Confidence intervals and significance tests.

Review of Univariate Linear Regression BMTRY 726 3/4/14.

1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce

Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11

The Simple Linear Regression Model: Specification and Estimation

Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.

Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.

Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,

7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,

Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.

Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.

Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.

1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.

BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.

9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.

© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.

Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.

7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.

Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.

Some matrix stuff.

Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.

Lecture 4: Inference in SLR (continued) Diagnostic approaches in SLR BMTRY 701 Biostatistical Methods II.

Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.

Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.

1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.

2 Multicollinearity Presented by: Shahram Arsang Isfahan University of Medical Sciences April 2014.

Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.

Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.

Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.

Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.

Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)

Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.

6-3 Multiple Regression Estimation of Parameters in Multiple Regression.

Simple Linear Regression (SLR)

Simple Linear Regression (OLS). Types of Correlation Positive correlationNegative correlationNo correlation.

Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.

Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.

Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.

Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.

Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.

Outliers and influential data points. No outliers?

Applied Quantitative Analysis and Practices LECTURE#31 By Dr. Osman Sadiq Paracha.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.

Linear Models Alan Lee Sample presentation for STATS 760.

I271B QUANTITATIVE METHODS Regression and Diagnostics.

Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.

Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.

Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 24 Building Regression Models.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.

1 Topic 3 – Multiple Regression Analysis Regression on Several Predictor Variables (Chapter 8)

Stat 1510: Statistical Thinking and Concepts REGRESSION.

Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.

DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.

Unit 9: Dealing with Messy Data I: Case Analysis

Chapter 6 Diagnostics for Leverage and Influence

CHAPTER 7 Linear Correlation & Regression Methods

Chapter 9 Multiple Linear Regression

Multiple Linear Regression

Regression Diagnostics

Regression Model Building - Diagnostics

Lecture 12 Model Building

Three Measures of Influence

Regression Model Building - Diagnostics

Model Adequacy Checking

Presentation transcript:

Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II

Variance Inflation Factor (VIF)  Diagnostic for multicollinearity  Describes the amount of an X that is explained by the other X’s in the model  If VIF is high, then it suggests that the covariate should not be added.  Why? it is redundant it adds variance to the model it creates ‘instability’ in the estimation

How to calculate VIF?  Simple idea:  That is, the VIF for the j th covariate is the coefficient of determination (R 2 ) obtained from regressing x j on the remaining x’s in the model

Sounds like a lot of work!  You don’t actually have to estimate the regressions for each x j.  Some matrix notation: X = matrix of covariates including a column for the intercept X T = transpose of X. That is, flip X on its diagonal X -1 = the inverse of X. That is, what you multiply X by to get the identity matrix I = the identity matrix. A matrix with 0’s on the off-diagonal and 1’s on the diagonal  Useful matrix: X T X. (see chapter 3 for lots on it!)  Another useful matrix: (X T X) -1

XTXXTX  Recall what it means to standardize a variable: subtract off the mean divide by the standard deviation  Imagine that you standardize all of the variables in your model (x’s).  Call the new covariate matrix W  Now, if calculate W T W (and divide by n-1), it is the correlation matrix  Lastly, take the inverse of W T W (i.e., (W T W) -1 )

VIFs  The diagonals of the (W T W) -1 matrix are the VIFs  This is a natural by-product of the regression  The (W T W) -1 matrix is estimated when the regression is estimated  Rules of thumb: VIF larger than 10 implies a serious multicollinearity problem VIFs of 5 or greater suggest that coefficient estimates may be misleading due to multicollinearity

Getting the VIFs the old-fashioned way # standardize variables ages <- (AGE-mean(AGE))/sqrt(var(AGE)) censuss <- (CENSUS - mean(CENSUS))/sqrt(var(CENSUS)) xrays <- (XRAY - mean(XRAY))/sqrt(var(XRAY)) infrisks <- (INFRISK-mean(INFRISK))/sqrt(var(INFRISK)) sqrtcults <- (sqrtCULT-mean(sqrtCULT))/sqrt(var(sqrtCULT)) nurses <- (NURSE - mean(NURSE))/sqrt(var(NURSE)) # create matrix of covariates xmat <- data.frame(ages, censuss, xrays, infrisks, sqrtcults, nurses) xmat <- as.matrix(xmat) n <- nrow(xmat) # estimate x-transpose x and divide by n-1 cormat <- t(xmat)%*%xmat/(n-1) # solve finds the inverse of a matrix vifmat <- solve(cormat) round(diag(vifmat), 2)

More practical way. library(HH) mlr <- lm(logLOS ~ AGE + CENSUS + XRAY + INFRISK + sqrtCULT + NURSE) round(diag(vifmat), 2) ages censuss xrays infrisks sqrtcults nurses vif(mlr) AGE CENSUS XRAY INFRISK sqrtCULT NURSE

What to do?  Unlikely that only one variable will have high VIF  You need to then determine which to include, which to remove  Judgement should be based on science + statistics!

More diagnostics: the added variable plots  These can help check for adequacy of model  Is there curvature between Y and X after adjusting for the other X’s?  “Refined” residual plots  They show the marginal importance of an individual predictor  Help figure out a good form for the predictor

Example: SENIC  Recall the difficulty determining the form for INFIRSK in our regression model.  Last time, we settled on including one term, INFRISK^2  But, we could do an adjusted variable plot approach.  How?  We want to know, adjusting for all else in the model, what is the right form for INFRISK?

R code av1 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION) ) av2 <- lm(INFRISK ~ AGE + XRAY + CENSUS + factor(REGION) ) resy <- av1$residuals resx <- av2$residuals plot(resx, resy, pch=16) abline(lm(resy~resx), lwd=2)

Added Variable Plot

What does that show?  The relationship between logLOS and INFRISK if you added INFRISK to the regression  But, is that what we want to see?  How about looking at residuals versus INFRISK (before including INFRISK in the model)?

R code mlr8 <- lm(logLOS ~ AGE + XRAY + CENSUS + factor(REGION)) smoother <- lowess(INFRISK, mlr8$residuals) plot(INFRISK, mlr8$residuals) lines(smoother)

R code > infrisk.star 4,INFRISK-4,0) > mlr9 <- lm(logLOS ~ INFRISK + infrisk.star + AGE + XRAY + >CENSUS + factor(REGION)) > summary(mlr9) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.798e e < 2e-16 *** INFRISK 1.836e e infrisk.star 6.795e e * AGE 5.554e e * XRAY 1.361e e * CENSUS 3.718e e e-06 *** factor(REGION) e e * factor(REGION) e e *** factor(REGION) e e e-07 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 104 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 8 and 104 DF, p-value: < 2.2e-16

Residual Plots SPLINE FOR INFRISK INFRISK 2

Which is better?  Cannot compare via ANOVA because they are not nested!  But, we can compare statistics qualitatively  R-squared: MLR7: 0.60 MLR9: 0.62  Partial R-squared: MLR7: 0.17 MLR9: 0.19

Identifying Outliers  Harder to do in the MLR setting than in the SLR setting.  Recall two concepts that make outliers important: Leverage is a function of the explanatory variable(s) alone and measures the potential for a data point to affect the model parameter estimates. Influence is a measure of how much a data point actually does affect the estimated model.  Leverage and influence both may be defined in terms of matrices

“Hat” matrix  We must do some matrix stuff to understand this  Notation for a MLR with p predictors and data on n patients.  The data:

 More notation:  THE MODEL:  What are the dimensions of each? Matrix Format for the MLR model

“Transpose” and “Inverse”  X-transpose: X’ or X T  X-inverse: X -1  Hat matrix = H  Why is H important? It transforms Y’s to Yhat’s:

Estimating, based on fitted model Variance-Covariance Matrix of residuals: Variance of ith residual: Covariance of ith and jth residual:

Other uses of H I = identity matrix Variance-Covariance Matrix of residuals: Variance of ith residual: Covariance of ith and jth residual:

Property of hij’s This means that each row of H sums to 1 And, that each column of H sums to 1

Other use of H  Identifies points of leverage

Using the Hat Matrix to identify outliers  Look at hii to see if a datapoint is an outlier  Large values of hii imply small values of var(ei)  As hii gets close to 1, var(ei) approaches 0.  Note that  As hii approaches 1, yhat approaches y  This gives hii the name “leverage”  HIGH HAT VALUE IMPLIES POTENTIAL FOR OUTLIER!

R code hat <- hatvalues(reg) plot(1:102, hat) highhat 0.10,1,0) plot(x,y) points(x[highhat==1], y[highhat==1], col=2, pch=16, cex=1.5)

Hat values versus index

Identifying points with high hii

Does a high hat mean it has a large residual?  No.  hii measures leverage, not influence  Recall what hii is made of it depends ONLY on the X’s it does not depend on the actual Y value  Look back at the plot: which of these is probably most “influential”  Standard cutoffs for “large” hii: 2p/n 0.5 very high, high

Let’s look at our MLR9  Any outliers?

Using the hat matrix in MLR  Studentized residuals  Acknowledge: each residual has a different variance magnitude of residual should be made relative to its variance (or sd)  Studentized residuals recognize differences in sampling errors

Defining Studentized Residuals  From slide 15,  We then define  Comparing ei and ri ei have different variance due to sampling variations ri have constant variance

Deleted Residuals  Influence is more intuitively quantified by how things change when an observation is in versus out of the estimation process  Would be more useful to have residuals in the situation when the observation is removed.  Example: if a Yi is far out then it may be very influential in the regression and the residual will be small but, if that case is removed before estimating and then the residual is calculated based on the fit, the residual would be large

Deleted Residuals, di  Process: delete ith case fit regression with all other cases obtain estimate of E(Yi) based on its X’s and fitted model

Deleted Residuals, di  Nice result: you don’t actually have to refit without the ith case! where ei is the ‘plain’ residual from the ith case and hii is the hat value. Both are from the regression INCLUDING the case  For small hii: ei and di will be similar  For large hii: ei and di will be different

Studentized Deleted Residuals  Recall the need to standardize, based on the knowledge of the variance  The difference between ti and ri?

Another nice result  You can calculate MSE (i) without refitting the model

Testing for outliers  outlier = Y observations whose studentized deleted residuals are large (in absolute value)  t i ~ t with n-p-1 degrees of freedom  Two examples: simulated data mlr9