Logistic Regression and Generalized Linear Models: Blood Screening, Women’s Role in Society, and Colonic Polyps
Blood Screening ESR measurements (erythrocyte sedimentation rate) Study checks for the rate of increase of ESR when (blood proteins) fibrinogen and globulin increase Looking for an association between the probability of an ESR reading > 20mm/hr and the levels of the two plasma proteins. Less than 20mm/hr indicates a healthy individual
Multiple Regression Doesn’t Work Not Normally distributed The response variable is binary. In fact, the distribution is Binomial. The can be seen by looking at how the error term relates to the probability. If y = 1 then the error term is 1- P(y=1). We will assume for our Random Variables Y that
Logistic Regression, a Generalized Linear Model Modelling the expected value of the response requires a transformation This chapter is about the logit transformation of a model, which is of the form If the response variable is 1, the log odds of the response is the logit function of a probability. Fixing all explanatory variables but xj, exp(Bj) represents the odds that the response variable is 1 when xj increases by 1.
R commands Plasma_glm_1 <- glm(ESR ~ fibrinogen, data = plasma, family = binomial()) Fits the model, in this example specifying the logistic function is implied because of the binomial() parameter is already passed. Layout(matrix(1:2, ncol = 2)) Lets you choose the layout of your graph screen. Cdplot(ESR ~ fibrinogen, data = plasma) Cdplot(ESR ~ globulin, data = plasma) cdplot “plots conditional densities describing how the conditional distribution of a categorical variable y changes over a numerical variable x” (help(“cdplot”))
Interpretation The area of the dark region is the probability that ESR < 20 mm/hr, and this decreases as the protein levels increase There is not much shape to the globulin density function.
R Commands Confint(plasma_glm_1, parm = “fibrinogen”) Gives a confidence interval Output: Waiting for profiling to be done... 2.5 % 97.5 % 0.3389465 3.9988602 Exp(coef(plasma_glm_1)[“fibrinogen”]) This is necessary to do the reverse transformation to get the model coefficients fibrinogen 6.215715
R Commands Summary(plasma_glm_1) Output: Deviance Residuals: Min 1Q Median 3Q Max -0.9298 -0.5399 -0.4382 -0.3356 2.4794 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.8451 2.7703 -2.471 0.0135 * fibrinogen 1.8271 0.9009 2.028 0.0425 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.885 on 31 degrees of freedom Residual deviance: 24.840 on 30 degrees of freedom
R Commands Exp(confint(plasma_glm_1, parm = “fibrinogen”)) Output: 2.5 % 97.5 % 1.403468 54.535954 Plasma_glm_2 <- glm(ESR ~ fibrinogen + globulin, data = plasma, family = binomial()) Use logistic regression for both variables fibrinogen and globulin Summary(Plasma_glm_2)
R Commands Summary(Plasma_glm_2) Output Deviance Residuals: Min 1Q Median 3Q Max -0.9683 -0.6122 -0.3458 -0.2116 2.2636 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -12.7921 5.7963 -2.207 0.0273 * fibrinogen 1.9104 0.9710 1.967 0.0491 * globulin 0.1558 0.1195 1.303 0.1925 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.885 on 31 degrees of freedom Residual deviance: 22.971 on 29 degrees of freedom
Interpretation large confidence interval is because there are not many observations in all, and even fewer ESR > 20mm/hr The globulin coefficient is basically zero
R Commands Anova(plasma_glm_1, plasma_glm_2, test = “Chisq”) Compares the two models one of just fibrinogen, the other of both fibrinogen and globulin Why Chisquared test? ANOVA assumes normal distribution for each data set. Why is this OK? Output Analysis of Deviance Table Model 1: ESR ~ fibrinogen Model 2: ESR ~ fibrinogen + globulin Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 30 24.8404 2 29 22.9711 1 1.8692 0.1716 Comparing the residual deviance between the two models, we see that the difference is not significant. We must surmise that there is no association between globulin and ESR level.
R Commands Prob <- predict(plasma_glm_1, type = “response” The model of just fibrinogen useful in creating a neat looking bubble plot. First we must take the predicted values from the first model and use them to determine the size of the bubbles Plot(globulin ~ fibrinogen, data = plasma, xlim = c(2, 6), ylim = c(25, 50), pch = “.”) Plots the second model Symbols(plasma$fibrinogen, plasma$globulin, circles = prob, add = TRUE) Uses the values of the first model to create different bubble sizes.
Bubbles size is Probability The plot shows how the probability of having an ESR > 20mm/hr increases as fibrinogen increases
Generalized Linear Model Unifies the logistic regression, Analysis of Variance and multiple regression techniques. GLMs have three essential parts An error distribution- the distribution of the response variable. Normal for Analysis of Variance and Multiple Regression Binomial for Logistic Regression
Main parts of GLM (continued) A link function which links the explanatory variables to the expected value of the response. Logit function for logistic regression Identity function for ANOVA and multiple regression A variance function which shows the dependency of the response variable variability on the mean
Measure of Fit The deviance shows how well the model fits the data Comparing two model’s deviances Use a likelihood ratio test Compare using Chi-square distribution
Women’s Role in Society Response to “Women should take care of running their homes and leave the running the country up to men”. Factors: Education, Sex Response: Agree or Disagree data is presented as categories with counts for each education, sex combination
R Commands Womensrole_glm_1 <- glm(cbind(agree, disagree) ~ sex + education, data = womensrole, family= binomial()) This uses the cbind function to change data from two responses to one response that is a matrix of agree, disagree counts. Summary(womensrole_glm_1) We can see that education is a significant factor to the response.
Summary Output Deviance Residuals: Min 1Q Median 3Q Max -2.72544 -0.86302 -0.06525 0.84340 3.13315 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.50937 0.18389 13.646 <2e-16 *** Why is there a P-value for the intercept? sexFemale -0.01145 0.08415 -0.136 0.892 education -0.27062 0.01541 -17.560 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 451.722 on 40 degrees of freedom Residual deviance: 64.007 on 38 degrees of freedom
Declaring a function in R Myplot <- function(role.fitted) { Declares the function myplot and passes it an object “role.fitted” to be used in the function f <- womensrole$sex == “Female” Stores everything in the data with sex = femail in f plot(womensrole$education, role.fitted, type = "n", ylab = "probability of agreeing", xlab = "Education", ylim = c(0,1)) Plots education against the role.fitted object which will be some predicted values from our GLM defined as Role.fitted1 <- predict(womensrole_glm_1, type = “response”)
Myplot function (cont) lines(womensrole$education[!f], role.fitted[!f], lty = 1) lines(womensrole$education[f], role.fitted[f], lty = 2) These graph the lines, one for males and one for femails. Lty indicates the kind of line (in this case solid or dotted) lgtxt <- c("Fitted (Males)", "Fitted (Females)") legend("topright", lgtxt, lty = 1:2, bty = "n") These add a legend for each line.
Myplot Function (cont) y<-womensrole$agree/ (womensrole$agree + womensrole$disagree) A basic calculation of the proportion of women that agree text(womensrole$education, y, ifelse(f, "\\VE", "\\MA"), family = "HersheySerif", cex = 1.25) Plots y and not y using male/female symbols
Interpretation The two fitted lines indicate that sex does not change a probability of agreeing vs education The symbols of unfitted observations may indicate an interaction between sex and education
MyPlot Sex:Education By running the same analysis, e.g. Womensrole_glm_2 <- glm(cbind(agree, disagree) ~ sex*education, data = womensrole, family = binomial()) Summary(womensrole_glm_2) This summary shows a significant p-value for the sex:education interaction Role.fitted2 <- predict(womensrole_glm_1, type = “response”) Myplot(role.fitted2) The plot shows that less education is associated with agreement that women belong in the home.
The Deviance Residual One of the many methods for checking the adequacy of the model fit The deviance residual is the square root of the part of each observation that contributes to the deviance Res <- residuals( womensrole_glm_2, type = “deviance”) Pulls the residuals for many other models as well Plot(predict(womensrole_glm_2), res, xlab = “fitted values”, ylab = “Residuals”, lim = max(abs(res))*c(-1,1)) No visible pattern: fit appears ok Abline(h = 0, lty = 2) Adds a dotted line at height 0
Drug Treatment Testing Famlilial Andenomatous Polyposis (FAP) Counts of colonic polyps after 12 months of treatment Don’t want to be the guy that did that Placebo controlled (Binary factor) Age
GLM Analysis Multiple Regression won’t work. Count data strictly positive Normality is not probable Poisson Regression- GLM with a log link function Ensures a Poisson Distribution Ensures positive fitted amounts R command: Polyps_glm_1 <- glm(number ~ treat + age, data = polyps, family = poisson())
Model Summary treatdrug -1.359083 0.117643 -11.55 < 2e-16 *** Call: glm(formula = number ~ treat + age, family = poisson(), data = polyps) Deviance Residuals: Min 1Q Median 3Q Max -4.2212 -3.0536 -0.1802 1.4459 5.8301 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.529024 0.146872 30.84 < 2e-16 *** treatdrug -1.359083 0.117643 -11.55 < 2e-16 *** age -0.038830 0.005955 -6.52 7.02e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 378.66 on 19 degrees of freedom Residual deviance: 179.54 on 17 degrees of freedom AIC: 273.88 Number of Fisher Scoring iterations: 5
Model Problem: Overdispersion In previous models the variance can be seen as dependent solely on the mean E.G. Binomial, Poisson In practice, this doesn’t always work. Sometimes the raw data points are not independent, there is some correlation, in this case, possible “clustering” Compare residual deviance and degrees of freedom to determine These should be basically equal
Model Solution: Quasi-Likelihood This procedure estimates the other factors that might contribute to the variance R Command Polyps_glm_2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson()) Summary(polyps_glm_2) The coefficients are still significant, but less so
Homework Please run through the commands for the myplot function (pg 100) and send me the command script. Please send me what you thought of the presentation, give me a grade and add any constructive criticism. zweihanderdawg@gmail.com