Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia
Introductory example 1 Gender difference in preference for white wine. A group of 57 men and 167 women were asked to make preference for a new white wine. The results are as follows: Gender difference in preference for white wine. A group of 57 men and 167 women were asked to make preference for a new white wine. The results are as follows: Question: Is there a gender effect on the preference ? GenderLikeDislikeALL Men Women ALL
Introductory example 2 Fat concentration and preference. 435 samples of a sauce of various fat concentration were tasted by consumers. There were two outcome: like or dislike. The results are as follows: Question: Is there an effect of fat concentration on the preference ? ConcentrationLikeDislikeALL
Consideration … The question in example 1 can be addressed by “traditional” analysis such as z-statistic or Chi-square test. The question in example 1 can be addressed by “traditional” analysis such as z-statistic or Chi-square test. The question in example 2 is a bit difficult to handle as the factor (fat concentration ) was a continuous variable and the outcome was a categorical variable (like or dislike) The question in example 2 is a bit difficult to handle as the factor (fat concentration ) was a continuous variable and the outcome was a categorical variable (like or dislike) However, there is a much better and more systematic method to analysis these data: Logistic regression However, there is a much better and more systematic method to analysis these data: Logistic regression
Odds and odds ratio Let P be the probability of preference, then the odds of preference is: O = P / (1-P) Let P be the probability of preference, then the odds of preference is: O = P / (1-P) GenderLikeDislikeALLP(like) Men Women ALL O men = / = O men = / = O women = / = O women = / = Odds ratio: OR = O men / O women = / = 2.55 (Meaning: the odds of preference is 2.55 times higher in men than in women)
Meanings of odds ratio OR > 1: the odds of preference is higher in men than in women OR > 1: the odds of preference is higher in men than in women OR < 1: the odds of preference is lower in men than in women OR < 1: the odds of preference is lower in men than in women OR = 1: the odds of preference in men is the same as in women OR = 1: the odds of preference in men is the same as in women How to assess the “significance” of OR ? How to assess the “significance” of OR ?
Computing variance of odds ratio The significance of OR can be tested by calculating its variance. The significance of OR can be tested by calculating its variance. The variance of OR can be indirectly calculated by working with logarithmic scale: The variance of OR can be indirectly calculated by working with logarithmic scale: Convert OR to log(OR) Convert OR to log(OR) Calculate variance of log(OR) Calculate variance of log(OR) Calculate 95% confidence interval of log(OR) Calculate 95% confidence interval of log(OR) Convert back to 95% confidence interval of OR Convert back to 95% confidence interval of OR
Computing variance of odds ratio OR = (23/34)/ (35/132) = 2.55 OR = (23/34)/ (35/132) = 2.55 Log(OR) = log(2.55) = Log(OR) = log(2.55) = Variance of log(OR): Variance of log(OR): V = 1/23 + 1/34 + 1/35 + 1/132 = Standard error of log(OR) Standard error of log(OR) SE = sqrt(0.109) = % confidence interval of log(OR) 95% confidence interval of log(OR) (1.96) = to Convert back to 95% confidence interval of OR Convert back to 95% confidence interval of OR Exp(0.289) = 1.33 to Exp(1.584) = 4.87 GenderLikeDislike Men2334 Women35132 ALL58166
Logistic analysis by R sex <- c(1, 2) like <- c(23, 35) dislike <- c(34, 132) total <- like + dislike prob <- like/total logistic <- glm(prob ~ sex, family=”binomial”, weight=total) GenderLikeDislike Men2334 Women35132 ALL58166 > summary(logistic) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) sex ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: e+00 on 1 degrees of freedom Residual deviance: e-15 on 0 degrees of freedom AIC:
Logistic regression model for continuous factor Concentr ation LikeDislike % like
Analysis by using R conc <- c(1.35, 1.60, 1.75, 1.85, 1.95, 2.05, 2.15, 2.25, 2.35) like <- c(13, 19, 67, 45, 71, 50, 35, 7, 1) dislike <- c(0, 0, 2, 5, 8, 20, 31, 49, 12) total <- like+dislike prob <- like/total plot(prob ~ conc, pch=16, xlab="Concentration")
Logistic regression model for continuous factor - model Let p = probability of preference Let p = probability of preference Logit of p is: Logit of p is: Model: Logit(p) = + (FAT) where is the intercept, and is the slope that have to be estimated from the data
Analysis by using R logistic <- glm(prob ~ conc, family="binomial", weight=total) summary(logistic) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** conc <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 8 degrees of freedom Residual deviance: on 7 degrees of freedom AIC:
Logistic regression model for continuous factor – Interpretation The odds ratio associated with each 0.1 increase in fat concentration was 2.90 (95% CI: 2.34, 3.59) The odds ratio associated with each 0.1 increase in fat concentration was 2.90 (95% CI: 2.34, 3.59) Interpretation: Each 0.1 increase in fat concentration was associated with a 2.9 odds of disliking the product. Since the 95% confidence interval exclude 1, this association was statistically significant at the p<0.05 level. Interpretation: Each 0.1 increase in fat concentration was associated with a 2.9 odds of disliking the product. Since the 95% confidence interval exclude 1, this association was statistically significant at the p<0.05 level.
Multiple logistic regression id fx age bmi bmd ictp pinp Fracture (0=no, 1=yes) Dependent variables: age, bmi, bmd, ictp, pinp Question: Which variables are important for fracture?
Multiple logistic regression: R analysis setwd(“c:/works/stats”) fracture <- read.table(“fracture.txt”, header=TRUE, na.string=”.”) names(fracture) fulldata <- na.omit(fracture) attach(fulldata) temp <- glm(fx ~., family=”binomial”, data=fulldata) search <- step(temp) summary(search)
Bayesian Model Average (BMA) analysis Library(BMA) xvars <- fulldata[, 3:7] y <- fx bma.search <- bic.glm(xvars, y, strict=F, OR=20, glm.family="binomial") summary(bma.search)imageplot.bma(bma.search)
Bayesian Model Average (BMA) analysis > summary(bma.search) Call: Best 5 models (cumulative posterior probability = ): p!=0 EV SD model 1 model 2 model 3 model 4 model 5 p!=0 EV SD model 1 model 2 model 3 model 4 model 5 Intercept age bmi bmd ictp pinp nVar BIC post prob
Bayesian Model Average (BMA) analysis > imageplot.bma(bma.search)
Summary of main points Logistic regression model is used to analyze the association between a binary outcome and one or many determinants. Logistic regression model is used to analyze the association between a binary outcome and one or many determinants. The determinants can be binary, categorical or continuous measurements The determinants can be binary, categorical or continuous measurements The model is logit(p) = log[p / (1-p)] = + X, where X is a factor, and and must be estimated from observed data. The model is logit(p) = log[p / (1-p)] = + X, where X is a factor, and and must be estimated from observed data.
Summary of main points Exp( ) is the odds ratio associated with an increment in the determinant X. Exp( ) is the odds ratio associated with an increment in the determinant X. The logistic regression model can be extended to include many determinants: The logistic regression model can be extended to include many determinants: logit(p) = log[p / (1-p)] = + X 1 + X 2 + X 3 + … logit(p) = log[p / (1-p)] = + X 1 + X 2 + X 3 + …