Applied Biostatistics: Lecture 2 Brian Healy, Ph.D
Goals At the end of this lecture, you will be able to Perform a two sample comparison with a continuous outcome in R Perform a linear regression analysis in R Assess the assumptions of linear regression in R
Review In the previous class, we focused on: Reading data into R Calculating summary statistics Creating graphics Performing a chi-squared test
Results for review from previous class Let’s import the dataset from last week To assess the treatment effect on death, we can use the following commands: chisq.test(table(rct$group,rct$death), correct=F) prop.test(table(rct$group,rct$death), correct=F) What happens if we use this command: prop.test(table(rct$death,rct$group), correct=F)
Example For today’s class, we will focus on investigating whether there is a difference between male MS patients and female MS patients in terms of cognitive functioning The dataset we will use is the Practice_data.csv dataset from the course website
Variables We can see the variables in the dataset using the names command: names(data_prac) From this command, we can see that there seven variables in our dataset male: indicator variable of whether the subject is male or female (=1 for male; =0 for female) age: subject’s age dep: subject’s depression score mcs, pcs: subject’s mental and physical quality of life fatig: subject’s fatigue score sdmt: subject’s score on the symbol digit modalities test
Roles of variables Outcome variable: Explanatory variable: Symbol Digit Modalities Test (SDMT) is a brief measure of cognitive functioning that measures speed of information processing This is a continuous variable Explanatory variable: Male This is a dichotomous variable Other variables Age, fatigue score These are continuous variables that could influence the analysis
Descriptive statistics For a dichotomous variable, the descriptive statistics of interest are the number of subjects and the proportion table(data_prac$male) prop.table(table(data_prac$male)) For a continuous variable, the descriptive statistics of interest are the mean and standard deviation mean(data_prac$sdmt) sd(data_prac$sdmt)
Graphics For a continuous variable, it is also a good idea to plot the data Some plots look only at the outcome hist(data_prac$sdmt) boxplot(data_prac$sdmt) We can also plot the data along with predictor variables boxplot(data_prac$sdmt~data_prac$male) plot(x=data_prac$age,y=data_prac$sdmt, ylab="SDMT", xlab="Age (years)")
Study design In the previous lecture, we investigated the effect of pravastatin on specific events based on a randomized clinical trial (RCT) In an RCT, subjects are randomized to treatment group, which balances the groups on measured and unmeasured confounders We will infer that any statistically significant difference between the groups is due solely to the treatment Today, we are performing an analysis of an observational study We can’t randomize subjects to gender so we could observe differences for reasons other than gender A simple two group comparison might not be enough to analyze the data
Univariate analysis The next step in this analysis will be to assess if there is an association between gender and SDMT score Summary statistics in each group Males mean(data_prac[data_prac$male==1,]$sdmt) sd(data_prac[data_prac$male==1,]$sdmt) Females mean(data_prac[data_prac$male==0,]$sdmt) sd(data_prac[data_prac$male==0,]$sdmt) Based on these statistics, there seems to be not much difference between the groups
Two sample t-test Despite the limited difference between the groups, we would like to formally test if there is a difference To test for a difference between groups with a continuous outcome, we use a two sample t-test For the two sample t-test, we compare the mean SDMT in the males to the mean SDMT score in the females accounting for the variability in the data and the sample size If there was no difference, what would we expect the difference in the means to equal?
Formula Basic form of test statistic: Is the observed mean difference similar to the hypothesized mean difference relative to the variability in the data? The denominator is calculated differently based on whether we assume equal variance in the two groups
Two sample t-test in R To perform a two sample t-test in R, we use the following command: t.test(data_prac$sdmt~data_prac$male) The default in R is the unequal variance version of the two sample t-test This test is valid even if the variance in the two groups are not equal We can fit the equal variance t-test using this command: t.test(data_prac$sdmt~data_prac$male,var.equal=T) We will focus on the equal variance t-test for now
Steps for hypothesis testing State null hypothesis State type of data for explanatory and outcome variable Determine appropriate statistical test State summary statistics if possible Calculate p-value (stat package) Decide whether to reject or not reject the null hypothesis Write conclusion
Hypothesis test H0: meanfemales =meanmales Explanatory: dichotomous, outcome: continuous Two sample t-test Estimated mean (mêanfemales) =49.2, Estimated mean (mêanmales)=47.0 Calculate p-value: p = 0.23 Fail to reject the null hypothesis because p-value is greater than 0.05 Conclusion: The difference between the groups is not statistically significant.
Confidence interval In addition to the p-value from the hypothesis test, we might also like to estimate the difference in the means between the groups The point estimate is the difference in the group means Difference in group means=2.2 How should we interpret the positive sign? The confidence interval provides a range of plausible values for the difference in the means The t.test command provides this confidence interval as part of the output How should we interpret this confidence interval?
What happened here? Here is another t-test command: t.test(data_prac$sdmt, data_prac$male) What is the difference between this command and the previous command? Why was this command not appropriate?
Linear regression An alternative approach to compare the groups is to use linear regression Linear regression can be used to analyze a continuous outcome with dichotomous, continuous or categorical predictors The linear regression model to assess the association between gender and SDMT is
Indicator variables In the previous model, the variable male is equal to 0 for the females and equal to 1 for the males For the females, the model is: E(SDMTi|malei=0)=b0 For the males, the model is: E(SDMTi|malei=1)=b0+ b1 How should we interpret each coefficient? If there was no difference between the genders, what would b1 equal?
Fitting linear regression in R There are two equivalent ways to fit a linear regression model with an indicator variable Since the variable male is already equal to 0 or 1, we can enter it into the model directly model<-lm(sdmt~male,data=data_prac) This command stores the results of the linear regression under the name model Alternatively, we can tell R that male is a dichotomous/categorical variable using the factor notation model2<-lm(sdmt~factor(male),data=data_prac)
Linear regression results To print the results from R including the estimated coefficients and p-values, we use the summary command summary(model)
Hypothesis test H0: meanfemales =meanmales H0: b1 =0 Explanatory: dichotomous, outcome: continuous Linear regression Estimated b1 =-2.2 Calculate p-value: p = 0.23 Fail to reject the null hypothesis because p-value is greater than 0.05 Conclusion: The difference between the groups is not statistically significant.
Confidence interval To get the confidence interval for the coefficients, we can use the confint command applied to model This provides the confidence interval for both of the beta coefficients Does the confidence interval for male include the null value?
Try on your own In addition to assessing the association between gender and SDMT scores, we might be interested in the effect of age on SDMT scores Fit a linear regression in R with SDMT score as the outcome and age as the predictor What was the estimated regression equation? Was age a statistically significant predictor? How did SDMT score change with increasing age?
Additional variables Although we did not see an association in the univariate analysis of gender, this might not be the end of the story since we did not have a randomized trial We know that age and fatigue can have an important impact on SDMT scores If there is a difference between the males and females in terms of age or fatigue, we would want to control for these in the analysis
Multiple regression The most common approach for handling multiple predictors in the same model is using multiple regression Model Multiple regression includes extra predictors on the right side of the equation All coefficients must be interpreted based on all of the features in the model
Interpretation of coefficients For this model, b0 is the mean SDMT score when male, age and fatigue are all equal to 0 b1 is the change in the mean SDMT comparing males to females, HOLDING age constant b2 is the change in the mean SDMT for a one-unit increase in age, HOLDING male constant
Multiple linear regression in R To fit this model in R, we can use the same command adding the additional variable model3<-lm(sdmt~male+age,data=data_prac) summary(model3)
Hypothesis test H0: There is no difference between males and females controlling for age H0: b1 =0 Explanatory: dichotomous, outcome: continuous Linear regression Estimated b1 =-2.3 Calculate p-value: p = 0.20 Fail to reject the null hypothesis because p-value is greater than 0.05 Conclusion: The difference between the groups controlling for age is not statistically significant.
Confidence interval To get the confidence interval for the coefficients, we can use the confint command applied to model This provides the confidence interval for both of the beta coefficients Does the confidence interval for male include the null value?
Regression assumptions Linear regression has four main assumptions Independence Linearity/correct model Homoscedasticity of residuals Normality of residuals Each of the final three can be investigated by looking at regression diagnostic plots For more information about regression diagnostics, please see http://www.statmethods.net/stats/rdiagnostics.html
Regression diagnostics R also easily allows us to create a set of regression diagnostic plots par(mfrow=c(2,2)) plot(model3)
Plots In addition to the residual plots, we might want to plot the relationship between age and SDMT score with different symbols for the males and females plot(x=data_prac$age, y=data_prac$sdmt, type="n", xlab="Age (years)", ylab="SDMT") points(x=data_prac[data_prac$male==0,]$age, y=data_prac[data_prac$male==0,]$sdmt, col="red") points(x=data_prac[data_prac$male==1,]$age, y=data_prac[data_prac$male==1,]$sdmt, col="blue")
Add fitted lines We can add the fitted lines to the graph using these command data_prac$predval<-predict(model3) data_prac<-data_prac[order(data_prac$age),] lines(x=data_prac[data_prac$male==0,]$age, y=data_prac[data_prac$male==0,]$predval, col="red") lines(x=data_prac[data_prac$male==1,]$age, y=data_prac[data_prac$male==1,]$predval, col="blue") What can we say about the fitted lines?
Try on your own Fit a linear regression in R with SDMT score as the outcome and male, age, and fatigue as the predictors What was the estimated regression equation? Was male a statistically significant predictor in this new model?
Interaction In the previous model, we assumed that the lines in the two genders were parallel This means that the change with age was the same in the two genders We can relax this assumption by adding an interaction term
Interpretation of coefficients Females: Males: b3 is the difference in the slope comparing males and females If b3=0, what would that mean?
Multiple linear regression in R To fit this model in R, we can use the same command adding the additional variable model4<-lm(sdmt~male+age+male:age, data=data_prac) summary(model4)
Hypothesis test H0: There is no interaction between age and gender in terms of the effect on SDMT score H0: b3 =0 Explanatory: dichotomous, outcome: continuous Linear regression Estimated b1 =-0.27 Calculate p-value: p = 0.097 Fail to reject the null hypothesis because p-value is greater than 0.05 Conclusion: The interaction is not statistically significant.
Caution Care must be taken when fitting interaction terms because the meaning of the other coefficients has changed b1 is now the difference between the males and females when age=0 b2 is now the effect of a one unit in age in the females You must be careful to understand the meaning of your coefficients
Goals At the end of this lecture, you will be able to Perform a two sample comparison with a continuous outcome in R Perform a linear regression analysis in R Assess the assumptions of linear regression in R