Download presentation
Presentation is loading. Please wait.
Published byClinton Ross Modified over 8 years ago
1
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression Marshall University Genomics Core Facility
2
Multiple Regression In linear regression, we had one independent variable, and one dependent (outcome) variable – In lab experiments, this is fairly common – The investigator manipulates the value of one variable and keeps everything else the same In some lab experiments, and in most observational studies, there is more than one independent variable – Multiple Regression is used for these scenarios – "Multiple Regression" really refers to a collection of different techniques Marshall University School of Medicine
3
Aims of Multiple Regression Quantifying the effect of one variable of interest while adjusting for the effects of other variables – Very common in observational studies – The other variables change outside of the control of the investigator – These other variables are often called covariates Creating an equation which is useful for predicting the value of the outcome variable given the values of the various independent variables – For example, predict the probability of cancer recurrence after surgery alone given characteristics of the tumor (grade, stage, etc) and of the patient (age, height, weight, etc) Might be used to decide whether or not to use chemotherapy in addition to surgery Developing a scientific understanding of the impact of several variables on the outcome Marshall University School of Medicine
4
Types of Multiple Regression We will look at the following types of multiple regression (there are many others): – Multiple Linear Regression The dependent variable is a linear function of the independent variables – Logistic Regression The outcome variable is binary (dichotomous, or categorical with two possible outcomes) The log odds ratio of the outcome is modeled as a function of the independent variables – Proportional Hazards Regression Proportional Hazards Regression is used when the outcome is the elapsed time to a non-recurring event It is effectively used to compute the effect of independent variables on a survival curve Marshall University School of Medicine
5
Multiple Linear Regression Multiple Linear Regression finds the linear equation which best predicts an outcome variable, Y, from multiple independent variables X 1, X 2,…, X k Example (from Motulsky): Lead Exposure and Kidney Function – Staessen et al. (1992) investigated the relationship between lead concentration in the blood and kidney function Kidney function measured by creatinine clearance – Observational study of 965 men – Naive approach would be to measure lead concentration and creatinine clearance and analyze just the two variables – However, kidney function is known to decrease with age, and lead accumulates in the blood over time Age is a confounding variable Must account for this Marshall University School of Medicine
6
Multiple Regression Model The model Staessen et al. used was Y i = β 0 + β 1 X i,1 + β 2 X i,2 + β 3 X i,3 + β 4 X i,4 + β 5 X i,5 + ε i where the variables are Marshall University School of Medicine VariableDescription YiYi Creatine clearance of subject i X i,1 log(serum lead) of subject i X i,2 Age of subject i X i,3 Body mass of subject i X i,4 log(GGT) of subject i (liver function) X i,5 1 if subject i had previously taken diuretics, 0 otherwise εiεi Random scatter
7
Multiple Regression Parameters The β in the equation for the model are the parameters of the model – Do not vary from data point to data point – Are values associated with the population – Will be estimated from the data Note that one of the variables (X i,5 ) is categorical, and we use a “dummy variable” in its place Marshall University School of Medicine
8
What multiple regression does Multiple linear regression finds values for the parameters that make the model predict the actual data as well as possible Estimates for β 0, … β 5 are usually denoted b 0 … b 5 Software performing the regression will report the best estimates for each parameter, a confidence interval and p-value for each estimate, and an R 2 value for the model Null hypotheses for the p-values are that the variable provides no information to the model, i.e. that the parameter is zero Marshall University School of Medicine
9
Interpreting the Coefficients The coefficients can be interpreted in a similar way to the slope estimate in simple linear regression – Represent the change in the dependent variable for one unit increase in the corresponding independent variable, keeping all the other independent variables fixed In the example, b 1 (estimate for log(lead concentration)) was -9.5 ml/min, with a 95% CI of [- 18.1, -0.9]. This means for every one unit increase in log(lead concentration), creatinine clearance decreased by - 9.5ml/min on average, if all other variables were kept fixed. Marshall University School of Medicine
10
Statistical Significance of the Coefficients One unit increase in log(lead concentration) means a 10 fold increase in lead concentration So the average decrease in creatinine clearance corresponding to a 10 fold increase in lead concentration was 9.5 ml/min, and the 95% confidence interval for the decrease was 0.9ml/min to 18.1ml/min. – Since the 95% CI does not contain 0, the p-value for this coefficient must be less than 0.05 This is the p-value for the null hypothesis that the coefficient is zero Alternatively think of this as a comparison of models: – Compare the full model (including this variable) to the model not including this variable Marshall University School of Medicine
11
Interpreting coefficients for “dummy variables” One of the variables in the model was really a binary variable – Has the subject previously taken diuretics? – Coded as 0 for no and 1 for yes Estimate for the coefficient for this variable was - 8.8ml/min – An increase in one unit for this variable results in a decrease in creatinine clearance of 8.8 ml/min, on average – Since the only values are 0 and 1, this means that participants who has previously taken diuretics had an average creatinine clearance 8.8 ml/min lower than those who had not, if all other variables are held equal Marshall University School of Medicine
12
Interpreting the R 2 value for the model Multiple linear regression reports an R2 value – For our example, R 2 is 0.27 This means that 27% of the variation in creatinine clearance is accounted for by the model The remaining 73% is due to random scatter, or is associated with variables not included in the model Unlike simple linear regression, we cannot plot a graph of the model One approach to visualizing the model is to plot the predicted outcome variable from the model against the actual measured value Marshall University School of Medicine
13
Multiple Linear Regression Plot Marshall University School of Medicine
14
Variable Selection The authors of the article collected much more data Stated that other variables did not improve the fit of the model Adding additional parameters will almost always increase the R 2 value – Should use the sum-of-squares F test explained earlier to test if there really is an improvement in the model – Beware of overfitting (explained later) Marshall University School of Medicine
15
Logistic Regression Logistic Regression is used when the outcome variable is binary – i.e. categorical with two possible outcomes The general idea is to build a multiple linear model with the outcome variable being the log of the odds ratio – i.e. we build a model predicting the log of the odds of one of the two outcomes from the independent variables – the parameters describe the difference in odds when the variables change by one unit Marshall University School of Medicine
16
Logistic Regression Example We performed chart reviews on 99 post- menopausal women Ran a logistic regression for an outcome of diabetes with age at menopause, smoking status, and BMI as independent variables Marshall University School of Medicine
17
Logistic Regression Results Marshall University School of Medicine
18
Interpreting Logistic Regression Results The "Model Summary" box describes how well the model fits the data. – -2 Log likelihood is computed from the likelihood of our observed data given the model. Since likelihood must be between 0 and 1, this is always positive and a small value means a better fit. (Our data do not fit the model well.) R 2 cannot be calculated in the same way for logisitic regression. The remaining two values give two alternate approaches, and the interpretation for these is similar to a regular R 2. Again, our data do not fit the model well. The "Classification Table" describes the accuracy of using the model as a predictor. Use the independent variables to compute the predicted odds, and predict the class based on the most likely Note that adding more variables will always improve the accuracy; this should really be tested on an independent data set Marshall University School of Medicine
19
Interpreting the Logistic Regression Parameters The "Variables in the Equation" box gives the parameter estimates, 95% CIs, and p-values The parameter for Smoking is 1.204. This means that a one-unit increase in the smoking variable results in an increase in the log odds ratio of 1.204. Logs here are natural logs; so the increase in odds ratio is e 1.204 =3.335 fold This is a dummy variable, so a smoker has about 3.3 times the odds of becoming diabetic than a non-smoker The parameter for BMI is 0.072; e 0.072 =1.075, so an increase of one unit in BMI results in a 1.075-fold increase in the odds ratio of being diabetic. The p-values and 95% CIs show that the parameter for smoking is significant at a significance level of 0.05. BMI has a p-value of 0.055. Marshall University School of Medicine
20
Mathematical Model for Logistic Regression The mathematical setup for logistic regression is: log(OR i ) = β 0 + X i,1 β 1 + … + X i,k β k where the variables are OR : Odds ratio for subject i X i,j : Value of variable j for subject i For our model, the estimates give log(OR) = -3.307 + 1.208 S + 0.071 B OR = e -3.307 + 1.208 S + 0.071 B OR = e -3.307 e 1.208 S e 0.071 B = 0.037 x 3.347 S x 1.073 B Marshall University School of Medicine
21
Proportional Hazards Regression Proportional Hazards Regression is used when the outcome is elapsed time to a non- recurring event – i.e. the same basic scenario as for survival analysis – We previously compared two groups for different survival rates using the Mantel-Cox test – Computed hazard ratio between the two groups Marshall University School of Medicine
22
Proportional Hazards extends Mantel- Cox test In Proportional Hazards regression, we estimate the effect of multiple factors on the hazard ratio Can be used to correct the hazard ratio for confounding variables Short et al. (2012) compared survival curves for two different treatments of COPD – Computed a “crude” hazard ratio using Mantel-Cox, and then a hazard ratio corrected for covariates (confounding variables) Marshall University School of Medicine
23
Summary Multiple Linear Regression fits a dependent variable as a linear model of multiple independent variables – Provides parameter estimates for each independent variable, along with confidence intervals and p-values – The null hypothesis for the p-value is that the variable doesn't contribute to the model – Used for finding the effect of a variable while correcting for confounding variables Logistic regression is used when the dependent variable is binary – Models the log odds ratio as a linear function of the dependent variables – Parameters are the increase in log odds ratio per unit increase in the independent variable Marshall University School of Medicine
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.