Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.

Slides:



Advertisements
Similar presentations
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Advertisements

 Population multiple regression model  Data for multiple regression  Multiple linear regression model  Confidence intervals and significance tests.
Objectives 10.1 Simple linear regression
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
BA 275 Quantitative Business Methods
EPI 809/Spring Probability Distribution of Random Error.
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
From last time….. Basic Biostats Topics Summary Statistics –mean, median, mode –standard deviation, standard error Confidence Intervals Hypothesis Tests.
Multiple Regression Predicting a response with multiple explanatory variables.
Linear Regression Exploring relationships between two metric variables.
Multiple Linear Regression Model
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Chapter 12 Simple Regression
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
Chapter 4 Multiple Regression.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Dr. Mario MazzocchiResearch Methods & Data Analysis1 Correlation and regression analysis Week 8 Research Methods & Data Analysis.
Correlation and Regression Analysis
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Chapter 12 Section 1 Inference for Linear Regression.
Simple Linear Regression Analysis
Multiple Regression Analysis. General Linear Models  This framework includes:  Linear Regression  Analysis of Variance (ANOVA)  Analysis of Covariance.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Correlation & Regression
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Copyright © 2011 Pearson Education, Inc. Multiple Regression Chapter 23.
Regression and Correlation Methods Judy Zhong Ph.D.
Chapter 11 Simple Regression
PCA Example Air pollution in 41 cities in the USA.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Lecture 4: Inference in SLR (continued) Diagnostic approaches in SLR BMTRY 701 Biostatistical Methods II.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Economics 173 Business Statistics Lecture 20 Fall, 2001© Professor J. Petry
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Using R for Marketing Research Dan Toomey 2/23/2015
Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Lecture 10: Correlation and Regression Model.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Linear Models Alan Lee Sample presentation for STATS 760.
Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
Chapter 14: Correlation and Regression
Chapter 12 Simple Linear Regression and Correlation
Ch11 Curve Fitting II.
CHAPTER 12 More About Regression
Presentation transcript:

Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II

From last lecture  What were the problems we diagnosed?  We shouldn’t just give up!  Some possible approaches for improvement remove the outliers: does the model change? transform LOS: do we better adhere to model assumptions?

Outlier Quandry  To remove or not to remove outliers  Are they real data?  If they are truly reflective of the data, then what does removing them imply?  Use caution! better to be true to the data having a perfect model should not be at the expense of using ‘real’ data!

Removing the outliers: How to?  I am always reluctant.  my approach in this example: remove each separately remove both together compare each model with the model that includes outliers  How to decide: compare slope estimates.

SENIC Data > par(mfrow=c(1,2)) > hist(data$LOS) > plot(data$BEDS, data$LOS)

How to fit regression removing outlier(s)? > keep.remove.both <- ifelse(data$LOS<16,1,0) > keep.remove.20 <- ifelse(data$LOS<19,1,0) > keep.remove.18 <- ifelse(data$LOS<16 | data$BEDS<600,1,0) > > table(keep.remove.both) keep.remove.both > table(keep.remove.20) keep.remove > table(keep.remove.18) keep.remove

Regression Fitting reg <- lm(LOS ~ BEDS, data=data) reg.remove.both <- lm(LOS ~ BEDS, data=data[keep.remove.both==1,]) reg.remove.20 <- lm(LOS ~ BEDS, data=data[keep.remove.20==1,]) reg.remove.18 <- lm(LOS ~ BEDS, data=data[keep.remove.18==1,])

How much do our inferences change? regremove both remove 20 remove 18 β1 estimate se(β1) % change0 (ref) 26%3%23% Why is “18” a bigger outlier than “20”?

Leverage and Influence  Leverage is a function of the explanatory variable(s) alone and measures the potential for a data point to affect the model parameter estimates.  Influence is a measure of how much a data point actually does affect the estimated model.  Leverage and influence both may be defined in terms of matrices  More later in MLR (MPV ch. 6)

Graphically

R code par(mfrow=c(1,1)) plot(data$BEDS, data$LOS, pch=16) # old plain old regression model abline(reg, lwd=2) # plot “20” to show which point we are removing, then # add regression line points(data$BEDS[keep.remove.20==0], data$LOS[keep.remove.20==0], col=2, cex=1.5, pch=16) abline(reg.remove.20, col=2, lwd=2) # plot “18” and then add regressionline points(data$BEDS[keep.remove.18==0], data$LOS[keep.remove.18==0], col=4, cex=1.5, pch=16) abline(reg.remove.18, col=4, lwd=2) # add regression line where we removed both outliers abline(reg.remove.both, col=5, lwd=2) # add a legend to the plot legend(1,19, c("reg","w/out 18","w/out 20","w/out both"), lwd=rep(2,4), lty=rep(1,4), col=c(1,2,4,5))

What to do?  Let’s try something else  What was our other problem? heteroskedasticity (great word…try that at scrabble) non-normality of outliers  Common way to solve: transform the outcome

Determining the Transformation  Box-Cox transformation approach  Finds the “best” power transformation to achieve closest distribution to normality  Can apply to a variable to a linear regression model  When applied to a regression model, result tells you what is the ‘best’ power transform of Y to achieve normal residuals

Review of power transformation  Assume we want to transform Y  Box-Cox considers Y a for all values of a  Solution is the a that provides the “most normal” looking Y a  Practical powers a = 1: identity a = ½ : square-root a = 0: log a = -1: 1/Y. usually we also take negative so that order is maintained (see example)  Often not practical interpretation: Y

Box-Cox for linear regression library(MASS) bc <- boxcox(reg)

Transform ty <- -1/data$LOS plot(data$LOS, ty)

New regression: transform is -1/LOS plot(data$BEDS, ty, pch=16) reg.ty <- lm(ty ~ data$BEDS) abline(reg.ty, lwd=2)

More interpretable?  LOS is often analyzed in the literature  Common transform is log it is well-known that LOS is skewed in most applications most people take the log people are used to seeing and interpreting it on the log scale  How good is our model if we just take the log?

Regression with log(LOS)

Let’s compare: residual plots

Let’s compare: distribution of residuals

Let’s Compare: |Residuals| p=0.59 p=0.12

Let’s Compare: QQ-plot

R code logy <- log(data$LOS) par(mfrow=c(1,2)) plot(data$LOS, logy) plot(data$BEDS, logy, pch=16) reg.logy <- lm(logy ~ data$BEDS) abline(reg.logy, lwd=2) par(mfrow=c(1,2)) plot(data$BEDS, reg.ty$residuals, pch=16) abline(h=0, lwd=2) plot(data$BEDS, reg.logy$residuals, pch=16) abline(h=0, lwd=2) boxplot(reg.ty$residuals) title("Residuals where Y = -1/LOS") boxplot(reg.logy$residuals) title("Residuals where Y = log(LOS)") qqnorm(reg.ty$residuals, main="TY") qqline(reg.ty$residuals) qqnorm(reg.logy$residuals, main="LogY") qqline(reg.logy$residuals)

Regression results > summary(reg.ty) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e e < 2e-16 *** data$BEDS 3.953e e e-06 *** --- > summary(reg.logy) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** data$BEDS e-06 *** ---

Let’s compare: results ‘untransformed’

R code par(mfrow=c(1,2)) plot(data$BEDS, data$LOS, pch=16) abline(reg, lwd=2) lines(sort(data$BEDS), -1/sort(reg.ty$fitted.values),lwd=2, lty=2) lines(sort(data$BEDS), exp(sort(reg.logy$fitted.values)), lwd=2, lty=3) plot(data$BEDS, data$LOS, pch=16, ylim=c(7,12)) abline(reg, lwd=2) lines(sort(data$BEDS), -1/sort(reg.ty$fitted.values),lwd=2, lty=2) lines(sort(data$BEDS), exp(sort(reg.logy$fitted.values)), lwd=2, lty=3)

So, what to do?  What are the pros and cons of each transform?  Should we transform at all?!

Switching Gears: Correlation  “Pearson” Correlation  Measures linear association between two variables  A natural by-product of linear regression  Notation: r or ρ (rho)

Correlation versus slope?  Measure different aspects of the association between X and Y  Slope: measures if there is a linear trend  Correlation: provides measure of how close the datapoints fall to the line  Statistical significance is IDENTICAL p-value for testing that correlation is 0 is the SAME as the p-value for testing that the slope is 0.

Example: Same slope, different correlation r = 0.46, b1=2r = 0.95, b1=2

Example: Same correlation, different slope r = 0.46, b1=4r = 0.46, b1=2

Correlation  Scaled version of Covariance between X and Y  Recall Covariance:  Estimating the Covariance:

Correlation

Interpretation  Correlation tells how closely two variables “track” one another  Provides information about ability to predict Y from X  Regression output: look for R 2 for SLR, sqrt(R 2 ) = correlation  Can have low correlation yet significant association  With correlation, 95% confidence interval is helpful

LOS ~ BEDS > summary(lm(data$LOS ~ data$BEDS)) Call: lm(formula = data$LOS ~ data$BEDS) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** data$BEDS e-06 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 111 degrees of freedom Multiple R-squared: , Adjusted R-squared: 0.16 F-statistic: on 1 and 111 DF, p-value: 6.765e-06

95% Confidence Interval for Correlation  The computation of a confidence interval on the population value of Pearson's correlation (ρ) is complicated by the fact that the sampling distribution of r is not normally distributed. The solution lies with Fisher's z' transformation described in the section on the sampling distribution of Pearson's r. The steps in computing a confidence interval for ρ are: sampling distribution of Pearson's r Convert r to z' Compute a confidence interval in terms of z' Convert the confidence interval back to r.  freeware!

log(LOS) ~ BEDS > summary(lm(log(data$LOS) ~ data$BEDS)) Call: lm(formula = log(data$LOS) ~ data$BEDS) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** data$BEDS e-06 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 111 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 111 DF, p-value: 2.737e-06

Multiple Linear Regression  Most regression applications include more than one covariate  Allows us to make inferences about the relationship between two variables (X and Y) adjusting for other variables  Used to account for confounding.  Especially important in observational studies smoking and lung cancer we know people who smoke tend to expose themselves to other risks and harms if we didn’t adjust, we would overestimate the effect of smoking on the risk of lung cancer.

Importance of including ‘important’ covariates  If you leave out relevant covariates, your estimate of β 1 will be biased  How biased?  Assume: true model: fitted model:

Fun derivation

Implications  The bias is a function of the correlation between the two covariates, X 1 and X 2  If the correlation is high, the bias will be high  If the correlation is low, the bias may be quite small.  If there is no correlation between X 1 and X 2, then omitting X 2 does not bias inferences  However, it is not a good model for prediction if X 2 is related to Y

Example: LOS ~ BEDS analysis. > cor(cbind(data$BEDS, data$NURSE, data$LOS)) [,1] [,2] [,3] [1,] [2,] [3,]

R code reg.beds <- lm(log(data$LOS) ~ data$BEDS) reg.nurse <- lm(log(data$LOS) ~ data$NURSE) reg.beds.nurse <- lm(log(data$LOS) ~ data$BEDS + data$NURSE) summary(reg.beds) summary(reg.nurse) summary(reg.beds.nurse)

SLRs Call: lm(formula = log(data$LOS) ~ data$BEDS) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** data$BEDS e-06 *** --- Call: lm(formula = log(data$LOS) ~ data$NURSE) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** data$NURSE e-05 *** ---

BEDS + NURSE > summary(reg.beds.nurse) Call: lm(formula = log(data$LOS) ~ data$BEDS + data$NURSE) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** data$BEDS * data$NURSE Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 110 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 110 DF, p-value: 1.519e-05