Applied Biostatistics: Lecture 2

Slides:



Advertisements
Similar presentations
Forecasting Using the Simple Linear Regression Model and Correlation
Advertisements

Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
Objectives (BPS chapter 24)
Chapter 13 Multiple Regression
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Chapter 10 Simple Regression.
Chapter 12 Multiple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Simple Linear Regression Analysis
Ch. 14: The Multiple Regression Model building
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
ANOVA and Regression Brian Healy, PhD.
Linear regression Brian Healy, PhD BIO203.
Introduction to Regression Analysis, Chapter 13,
Simple Linear Regression Analysis
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Chapter 13: Inference in Regression
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Chapter 14 Introduction to Multiple Regression
Statistics and Quantitative Analysis U4320 Segment 12: Extension of Multiple Regression Analysis Prof. Sharyn O’Halloran.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Analysis of Variance (ANOVA) Brian Healy, PhD BIO203.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 14-1 Chapter 14 Introduction to Multiple Regression Statistics for Managers using Microsoft.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
Stats Methods at IC Lecture 3: Regression.
Chapter 13 Simple Linear Regression
Lecture #25 Tuesday, November 15, 2016 Textbook: 14.1 and 14.3
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Chapter 14 Introduction to Multiple Regression
Applied Biostatistics: Lecture 4
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Inference for Least Squares Lines
BUS 308 mentor innovative education/bus308mentor.com
Lecture Slides Elementary Statistics Twelfth Edition
Inference for Regression
Multiple Regression Analysis and Model Building
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Applied Statistical Analysis
Multiple logistic regression
Stats Club Marnie Brennan
CHAPTER 29: Multiple Regression*
3.2. SIMPLE LINEAR REGRESSION
Presentation transcript:

Applied Biostatistics: Lecture 2 Brian Healy, Ph.D

Goals At the end of this lecture, you will be able to Perform a two sample comparison with a continuous outcome in R Perform a linear regression analysis in R Assess the assumptions of linear regression in R

Review In the previous class, we focused on: Reading data into R Calculating summary statistics Creating graphics Performing a chi-squared test

Results for review from previous class Let’s import the dataset from last week To assess the treatment effect on death, we can use the following commands: chisq.test(table(rct$group,rct$death), correct=F) prop.test(table(rct$group,rct$death), correct=F) What happens if we use this command: prop.test(table(rct$death,rct$group), correct=F)

Example For today’s class, we will focus on investigating whether there is a difference between male MS patients and female MS patients in terms of cognitive functioning The dataset we will use is the Practice_data.csv dataset from the course website

Variables We can see the variables in the dataset using the names command: names(data_prac) From this command, we can see that there seven variables in our dataset male: indicator variable of whether the subject is male or female (=1 for male; =0 for female) age: subject’s age dep: subject’s depression score mcs, pcs: subject’s mental and physical quality of life fatig: subject’s fatigue score sdmt: subject’s score on the symbol digit modalities test

Roles of variables Outcome variable: Explanatory variable: Symbol Digit Modalities Test (SDMT) is a brief measure of cognitive functioning that measures speed of information processing This is a continuous variable Explanatory variable: Male This is a dichotomous variable Other variables Age, fatigue score These are continuous variables that could influence the analysis

Descriptive statistics For a dichotomous variable, the descriptive statistics of interest are the number of subjects and the proportion table(data_prac$male) prop.table(table(data_prac$male)) For a continuous variable, the descriptive statistics of interest are the mean and standard deviation mean(data_prac$sdmt) sd(data_prac$sdmt)

Graphics For a continuous variable, it is also a good idea to plot the data Some plots look only at the outcome hist(data_prac$sdmt) boxplot(data_prac$sdmt) We can also plot the data along with predictor variables boxplot(data_prac$sdmt~data_prac$male) plot(x=data_prac$age,y=data_prac$sdmt, ylab="SDMT", xlab="Age (years)")

Study design In the previous lecture, we investigated the effect of pravastatin on specific events based on a randomized clinical trial (RCT) In an RCT, subjects are randomized to treatment group, which balances the groups on measured and unmeasured confounders We will infer that any statistically significant difference between the groups is due solely to the treatment Today, we are performing an analysis of an observational study We can’t randomize subjects to gender so we could observe differences for reasons other than gender A simple two group comparison might not be enough to analyze the data

Univariate analysis The next step in this analysis will be to assess if there is an association between gender and SDMT score Summary statistics in each group Males mean(data_prac[data_prac$male==1,]$sdmt) sd(data_prac[data_prac$male==1,]$sdmt) Females mean(data_prac[data_prac$male==0,]$sdmt) sd(data_prac[data_prac$male==0,]$sdmt) Based on these statistics, there seems to be not much difference between the groups

Two sample t-test Despite the limited difference between the groups, we would like to formally test if there is a difference To test for a difference between groups with a continuous outcome, we use a two sample t-test For the two sample t-test, we compare the mean SDMT in the males to the mean SDMT score in the females accounting for the variability in the data and the sample size If there was no difference, what would we expect the difference in the means to equal?

Formula Basic form of test statistic: Is the observed mean difference similar to the hypothesized mean difference relative to the variability in the data? The denominator is calculated differently based on whether we assume equal variance in the two groups

Two sample t-test in R To perform a two sample t-test in R, we use the following command: t.test(data_prac$sdmt~data_prac$male) The default in R is the unequal variance version of the two sample t-test This test is valid even if the variance in the two groups are not equal We can fit the equal variance t-test using this command: t.test(data_prac$sdmt~data_prac$male,var.equal=T) We will focus on the equal variance t-test for now

Steps for hypothesis testing State null hypothesis State type of data for explanatory and outcome variable Determine appropriate statistical test State summary statistics if possible Calculate p-value (stat package) Decide whether to reject or not reject the null hypothesis Write conclusion

Hypothesis test H0: meanfemales =meanmales Explanatory: dichotomous, outcome: continuous Two sample t-test Estimated mean (mêanfemales) =49.2, Estimated mean (mêanmales)=47.0 Calculate p-value: p = 0.23 Fail to reject the null hypothesis because p-value is greater than 0.05 Conclusion: The difference between the groups is not statistically significant.

Confidence interval In addition to the p-value from the hypothesis test, we might also like to estimate the difference in the means between the groups The point estimate is the difference in the group means Difference in group means=2.2 How should we interpret the positive sign? The confidence interval provides a range of plausible values for the difference in the means The t.test command provides this confidence interval as part of the output How should we interpret this confidence interval?

What happened here? Here is another t-test command: t.test(data_prac$sdmt, data_prac$male) What is the difference between this command and the previous command? Why was this command not appropriate?

Linear regression An alternative approach to compare the groups is to use linear regression Linear regression can be used to analyze a continuous outcome with dichotomous, continuous or categorical predictors The linear regression model to assess the association between gender and SDMT is

Indicator variables In the previous model, the variable male is equal to 0 for the females and equal to 1 for the males For the females, the model is: E(SDMTi|malei=0)=b0 For the males, the model is: E(SDMTi|malei=1)=b0+ b1 How should we interpret each coefficient? If there was no difference between the genders, what would b1 equal?

Fitting linear regression in R There are two equivalent ways to fit a linear regression model with an indicator variable Since the variable male is already equal to 0 or 1, we can enter it into the model directly model<-lm(sdmt~male,data=data_prac) This command stores the results of the linear regression under the name model Alternatively, we can tell R that male is a dichotomous/categorical variable using the factor notation model2<-lm(sdmt~factor(male),data=data_prac)

Linear regression results To print the results from R including the estimated coefficients and p-values, we use the summary command summary(model)

Hypothesis test H0: meanfemales =meanmales H0: b1 =0 Explanatory: dichotomous, outcome: continuous Linear regression Estimated b1 =-2.2 Calculate p-value: p = 0.23 Fail to reject the null hypothesis because p-value is greater than 0.05 Conclusion: The difference between the groups is not statistically significant.

Confidence interval To get the confidence interval for the coefficients, we can use the confint command applied to model This provides the confidence interval for both of the beta coefficients Does the confidence interval for male include the null value?

Try on your own In addition to assessing the association between gender and SDMT scores, we might be interested in the effect of age on SDMT scores Fit a linear regression in R with SDMT score as the outcome and age as the predictor What was the estimated regression equation? Was age a statistically significant predictor? How did SDMT score change with increasing age?

Additional variables Although we did not see an association in the univariate analysis of gender, this might not be the end of the story since we did not have a randomized trial We know that age and fatigue can have an important impact on SDMT scores If there is a difference between the males and females in terms of age or fatigue, we would want to control for these in the analysis

Multiple regression The most common approach for handling multiple predictors in the same model is using multiple regression Model Multiple regression includes extra predictors on the right side of the equation All coefficients must be interpreted based on all of the features in the model

Interpretation of coefficients For this model, b0 is the mean SDMT score when male, age and fatigue are all equal to 0 b1 is the change in the mean SDMT comparing males to females, HOLDING age constant b2 is the change in the mean SDMT for a one-unit increase in age, HOLDING male constant

Multiple linear regression in R To fit this model in R, we can use the same command adding the additional variable model3<-lm(sdmt~male+age,data=data_prac) summary(model3)

Hypothesis test H0: There is no difference between males and females controlling for age H0: b1 =0 Explanatory: dichotomous, outcome: continuous Linear regression Estimated b1 =-2.3 Calculate p-value: p = 0.20 Fail to reject the null hypothesis because p-value is greater than 0.05 Conclusion: The difference between the groups controlling for age is not statistically significant.

Confidence interval To get the confidence interval for the coefficients, we can use the confint command applied to model This provides the confidence interval for both of the beta coefficients Does the confidence interval for male include the null value?

Regression assumptions Linear regression has four main assumptions Independence Linearity/correct model Homoscedasticity of residuals Normality of residuals Each of the final three can be investigated by looking at regression diagnostic plots For more information about regression diagnostics, please see http://www.statmethods.net/stats/rdiagnostics.html

Regression diagnostics R also easily allows us to create a set of regression diagnostic plots par(mfrow=c(2,2)) plot(model3)

Plots In addition to the residual plots, we might want to plot the relationship between age and SDMT score with different symbols for the males and females plot(x=data_prac$age, y=data_prac$sdmt, type="n", xlab="Age (years)", ylab="SDMT") points(x=data_prac[data_prac$male==0,]$age, y=data_prac[data_prac$male==0,]$sdmt, col="red") points(x=data_prac[data_prac$male==1,]$age, y=data_prac[data_prac$male==1,]$sdmt, col="blue")

Add fitted lines We can add the fitted lines to the graph using these command data_prac$predval<-predict(model3) data_prac<-data_prac[order(data_prac$age),] lines(x=data_prac[data_prac$male==0,]$age, y=data_prac[data_prac$male==0,]$predval, col="red") lines(x=data_prac[data_prac$male==1,]$age, y=data_prac[data_prac$male==1,]$predval, col="blue") What can we say about the fitted lines?

Try on your own Fit a linear regression in R with SDMT score as the outcome and male, age, and fatigue as the predictors What was the estimated regression equation? Was male a statistically significant predictor in this new model?

Interaction In the previous model, we assumed that the lines in the two genders were parallel This means that the change with age was the same in the two genders We can relax this assumption by adding an interaction term

Interpretation of coefficients Females: Males: b3 is the difference in the slope comparing males and females If b3=0, what would that mean?

Multiple linear regression in R To fit this model in R, we can use the same command adding the additional variable model4<-lm(sdmt~male+age+male:age, data=data_prac) summary(model4)

Hypothesis test H0: There is no interaction between age and gender in terms of the effect on SDMT score H0: b3 =0 Explanatory: dichotomous, outcome: continuous Linear regression Estimated b1 =-0.27 Calculate p-value: p = 0.097 Fail to reject the null hypothesis because p-value is greater than 0.05 Conclusion: The interaction is not statistically significant.

Caution Care must be taken when fitting interaction terms because the meaning of the other coefficients has changed b1 is now the difference between the males and females when age=0 b2 is now the effect of a one unit in age in the females You must be careful to understand the meaning of your coefficients

Goals At the end of this lecture, you will be able to Perform a two sample comparison with a continuous outcome in R Perform a linear regression analysis in R Assess the assumptions of linear regression in R