Download presentation
Presentation is loading. Please wait.
Published byJemima Stevens Modified over 9 years ago
1
Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II
2
From last lecture What were the problems we diagnosed? We shouldn’t just give up! Some possible approaches for improvement remove the outliers: does the model change? transform LOS: do we better adhere to model assumptions?
3
Outlier Quandry To remove or not to remove outliers Are they real data? If they are truly reflective of the data, then what does removing them imply? Use caution! better to be true to the data having a perfect model should not be at the expense of using ‘real’ data!
4
Removing the outliers: How to? I am always reluctant. my approach in this example: remove each separately remove both together compare each model with the model that includes outliers How to decide: compare slope estimates.
5
SENIC Data > par(mfrow=c(1,2)) > hist(data$LOS) > plot(data$BEDS, data$LOS)
6
How to fit regression removing outlier(s)? > keep.remove.both <- ifelse(data$LOS<16,1,0) > keep.remove.20 <- ifelse(data$LOS<19,1,0) > keep.remove.18 <- ifelse(data$LOS<16 | data$BEDS<600,1,0) > > table(keep.remove.both) keep.remove.both 0 1 2 111 > table(keep.remove.20) keep.remove.20 0 1 1 112 > table(keep.remove.18) keep.remove.18 0 1 1 112
7
Regression Fitting reg <- lm(LOS ~ BEDS, data=data) reg.remove.both <- lm(LOS ~ BEDS, data=data[keep.remove.both==1,]) reg.remove.20 <- lm(LOS ~ BEDS, data=data[keep.remove.20==1,]) reg.remove.18 <- lm(LOS ~ BEDS, data=data[keep.remove.18==1,])
8
How much do our inferences change? regremove both remove 20 remove 18 β1 estimate 0.004060.002990.003930.00314 se(β1)0.000860.000700.000730.00085 % change0 (ref) 26%3%23% Why is “18” a bigger outlier than “20”?
9
Leverage and Influence Leverage is a function of the explanatory variable(s) alone and measures the potential for a data point to affect the model parameter estimates. Influence is a measure of how much a data point actually does affect the estimated model. Leverage and influence both may be defined in terms of matrices More later in MLR (MPV ch. 6)
10
Graphically
11
R code par(mfrow=c(1,1)) plot(data$BEDS, data$LOS, pch=16) # old plain old regression model abline(reg, lwd=2) # plot “20” to show which point we are removing, then # add regression line points(data$BEDS[keep.remove.20==0], data$LOS[keep.remove.20==0], col=2, cex=1.5, pch=16) abline(reg.remove.20, col=2, lwd=2) # plot “18” and then add regressionline points(data$BEDS[keep.remove.18==0], data$LOS[keep.remove.18==0], col=4, cex=1.5, pch=16) abline(reg.remove.18, col=4, lwd=2) # add regression line where we removed both outliers abline(reg.remove.both, col=5, lwd=2) # add a legend to the plot legend(1,19, c("reg","w/out 18","w/out 20","w/out both"), lwd=rep(2,4), lty=rep(1,4), col=c(1,2,4,5))
12
What to do? Let’s try something else What was our other problem? heteroskedasticity (great word…try that at scrabble) non-normality of outliers Common way to solve: transform the outcome
13
Determining the Transformation Box-Cox transformation approach Finds the “best” power transformation to achieve closest distribution to normality Can apply to a variable to a linear regression model When applied to a regression model, result tells you what is the ‘best’ power transform of Y to achieve normal residuals
14
Review of power transformation Assume we want to transform Y Box-Cox considers Y a for all values of a Solution is the a that provides the “most normal” looking Y a Practical powers a = 1: identity a = ½ : square-root a = 0: log a = -1: 1/Y. usually we also take negative so that order is maintained (see example) Often not practical interpretation: Y -0.136
15
Box-Cox for linear regression library(MASS) bc <- boxcox(reg)
16
Transform ty <- -1/data$LOS plot(data$LOS, ty)
17
New regression: transform is -1/LOS plot(data$BEDS, ty, pch=16) reg.ty <- lm(ty ~ data$BEDS) abline(reg.ty, lwd=2)
18
More interpretable? LOS is often analyzed in the literature Common transform is log it is well-known that LOS is skewed in most applications most people take the log people are used to seeing and interpreting it on the log scale How good is our model if we just take the log?
19
Regression with log(LOS)
20
Let’s compare: residual plots
21
Let’s compare: distribution of residuals
22
Let’s Compare: |Residuals| p=0.59 p=0.12
23
Let’s Compare: QQ-plot
24
R code logy <- log(data$LOS) par(mfrow=c(1,2)) plot(data$LOS, logy) plot(data$BEDS, logy, pch=16) reg.logy <- lm(logy ~ data$BEDS) abline(reg.logy, lwd=2) par(mfrow=c(1,2)) plot(data$BEDS, reg.ty$residuals, pch=16) abline(h=0, lwd=2) plot(data$BEDS, reg.logy$residuals, pch=16) abline(h=0, lwd=2) boxplot(reg.ty$residuals) title("Residuals where Y = -1/LOS") boxplot(reg.logy$residuals) title("Residuals where Y = log(LOS)") qqnorm(reg.ty$residuals, main="TY") qqline(reg.ty$residuals) qqnorm(reg.logy$residuals, main="LogY") qqline(reg.logy$residuals)
25
Regression results > summary(reg.ty) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.169e-01 2.522e-03 -46.371 < 2e-16 *** data$BEDS 3.953e-05 7.957e-06 4.968 2.47e-06 *** --- > summary(reg.logy) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1512591 0.0251328 85.596 < 2e-16 *** data$BEDS 0.0003921 0.0000793 4.944 2.74e-06 *** ---
26
Let’s compare: results ‘untransformed’
27
R code par(mfrow=c(1,2)) plot(data$BEDS, data$LOS, pch=16) abline(reg, lwd=2) lines(sort(data$BEDS), -1/sort(reg.ty$fitted.values),lwd=2, lty=2) lines(sort(data$BEDS), exp(sort(reg.logy$fitted.values)), lwd=2, lty=3) plot(data$BEDS, data$LOS, pch=16, ylim=c(7,12)) abline(reg, lwd=2) lines(sort(data$BEDS), -1/sort(reg.ty$fitted.values),lwd=2, lty=2) lines(sort(data$BEDS), exp(sort(reg.logy$fitted.values)), lwd=2, lty=3)
28
So, what to do? What are the pros and cons of each transform? Should we transform at all?!
29
Switching Gears: Correlation “Pearson” Correlation Measures linear association between two variables A natural by-product of linear regression Notation: r or ρ (rho)
30
Correlation versus slope? Measure different aspects of the association between X and Y Slope: measures if there is a linear trend Correlation: provides measure of how close the datapoints fall to the line Statistical significance is IDENTICAL p-value for testing that correlation is 0 is the SAME as the p-value for testing that the slope is 0.
31
Example: Same slope, different correlation r = 0.46, b1=2r = 0.95, b1=2
32
Example: Same correlation, different slope r = 0.46, b1=4r = 0.46, b1=2
33
Correlation Scaled version of Covariance between X and Y Recall Covariance: Estimating the Covariance:
34
Correlation
35
Interpretation Correlation tells how closely two variables “track” one another Provides information about ability to predict Y from X Regression output: look for R 2 for SLR, sqrt(R 2 ) = correlation Can have low correlation yet significant association With correlation, 95% confidence interval is helpful
36
LOS ~ BEDS > summary(lm(data$LOS ~ data$BEDS)) Call: lm(formula = data$LOS ~ data$BEDS) Residuals: Min 1Q Median 3Q Max -2.8291 -1.0028 -0.1302 0.6782 9.6933 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.6253643 0.2720589 31.704 < 2e-16 *** data$BEDS 0.0040566 0.0008584 4.726 6.77e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.752 on 111 degrees of freedom Multiple R-squared: 0.1675, Adjusted R-squared: 0.16 F-statistic: 22.33 on 1 and 111 DF, p-value: 6.765e-06
37
95% Confidence Interval for Correlation The computation of a confidence interval on the population value of Pearson's correlation (ρ) is complicated by the fact that the sampling distribution of r is not normally distributed. The solution lies with Fisher's z' transformation described in the section on the sampling distribution of Pearson's r. The steps in computing a confidence interval for ρ are: sampling distribution of Pearson's r Convert r to z' Compute a confidence interval in terms of z' Convert the confidence interval back to r. freeware! http://www.danielsoper.com/statcalc/calc28.aspx http://glass.ed.asu.edu/stats/analysis/rci.html http://faculty.vassar.edu/lowry/rho.html
38
log(LOS) ~ BEDS > summary(lm(log(data$LOS) ~ data$BEDS)) Call: lm(formula = log(data$LOS) ~ data$BEDS) Residuals: Min 1Q Median 3Q Max -0.296328 -0.106103 -0.005296 0.084177 0.702262 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1512591 0.0251328 85.596 < 2e-16 *** data$BEDS 0.0003921 0.0000793 4.944 2.74e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1618 on 111 degrees of freedom Multiple R-squared: 0.1805, Adjusted R-squared: 0.1731 F-statistic: 24.44 on 1 and 111 DF, p-value: 2.737e-06
39
Multiple Linear Regression Most regression applications include more than one covariate Allows us to make inferences about the relationship between two variables (X and Y) adjusting for other variables Used to account for confounding. Especially important in observational studies smoking and lung cancer we know people who smoke tend to expose themselves to other risks and harms if we didn’t adjust, we would overestimate the effect of smoking on the risk of lung cancer.
40
Importance of including ‘important’ covariates If you leave out relevant covariates, your estimate of β 1 will be biased How biased? Assume: true model: fitted model:
41
Fun derivation
44
Implications The bias is a function of the correlation between the two covariates, X 1 and X 2 If the correlation is high, the bias will be high If the correlation is low, the bias may be quite small. If there is no correlation between X 1 and X 2, then omitting X 2 does not bias inferences However, it is not a good model for prediction if X 2 is related to Y
45
Example: LOS ~ BEDS analysis. > cor(cbind(data$BEDS, data$NURSE, data$LOS)) [,1] [,2] [,3] [1,] 1.0000000 0.9155042 0.4092652 [2,] 0.9155042 1.0000000 0.3403671 [3,] 0.4092652 0.3403671 1.0000000
46
R code reg.beds <- lm(log(data$LOS) ~ data$BEDS) reg.nurse <- lm(log(data$LOS) ~ data$NURSE) reg.beds.nurse <- lm(log(data$LOS) ~ data$BEDS + data$NURSE) summary(reg.beds) summary(reg.nurse) summary(reg.beds.nurse)
47
SLRs Call: lm(formula = log(data$LOS) ~ data$BEDS) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1512591 0.0251328 85.596 < 2e-16 *** data$BEDS 0.0003921 0.0000793 4.944 2.74e-06 *** --- Call: lm(formula = log(data$LOS) ~ data$NURSE) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1682138 0.0250054 86.710 < 2e-16 *** data$NURSE 0.0004728 0.0001127 4.195 5.51e-05 *** ---
48
BEDS + NURSE > summary(reg.beds.nurse) Call: lm(formula = log(data$LOS) ~ data$BEDS + data$NURSE) Residuals: Min 1Q Median 3Q Max -0.291537 -0.108447 -0.006711 0.087594 0.696747 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1522361 0.0252758 85.150 <2e-16 *** data$BEDS 0.0004910 0.0001977 2.483 0.0145 * data$NURSE -0.0001497 0.0002738 -0.547 0.5857 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1624 on 110 degrees of freedom Multiple R-squared: 0.1827, Adjusted R-squared: 0.1678 F-statistic: 12.29 on 2 and 110 DF, p-value: 1.519e-05
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.