Lecture 7 GLMs II Binomial Family Olivier MISSA, Advanced Research Skills
2 Outline Continue our Introduction to Generalized Linear Models. In this lecture: Illustrate the use of GLMs for proportion and binary data.
3 Binary & Proportion data tend to follow the Binomial distribution The Canonical link of this glm family is the logit function: The variance reaches a maximum for intermediate values of p and a minimum at either 0% or 100%. Reminder
4 In R, binary/proportion data can be entered into a model as a response in three different ways: as a numeric vector (holding the number or proportion of successes) as a logical vector or a factor (TRUE or the first factor level will be considered successes). as a two-column matrix (the first column holding the number of successes and the second column the number of failures). Three ways to work with binary data
5 Toxicity to tobacco budworm (moth) of different doses of trans-cypermethrin. Batches of 20 moths (of each sex) were put in contact for three days with increasing doses of the pyrethroid. 1 st Example Dose (micrograms) Sex Male Female Number of dead moths out of 20 tested > (dose <- rep(2^(0:5), 2)) [1] > numdead <- c(1,4,9,13,18,20,0,2,6,10,12,16) > (sex <- factor( rep( c("M","F"), c(6,6) ) )) [1] M M M M M M F F F F F F Levels: F M > SF <- cbind(numdead, numalive=20-numdead)
6 1 st Example > modb <- glm(SF ~ sex*dose, family=binomial) > summary(modb) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-07 *** sexM dose e-06 *** sexM:dose ** --- (Dispersion parameter for binomial family taken to be 1) Null deviance: on 11 degrees of freedom Residual deviance: on 8 degrees of freedom AIC: What is modelled is the proportion of successes
7 1 st Example > ldose <- log2(dose) > modb2 <- glm(SF ~ sex*ldose, family=binomial) > summary(modb2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-08 *** sexM ldose e-08 *** sexM:ldose (Dispersion parameter for binomial family taken to be 1) Null deviance: on 11 degrees of freedom Residual deviance: on 8 degrees of freedom AIC:
8 1 st Example > drop1(modb2, test="Chisq") Single term deletions Model: SF ~ sex * ldose Df Deviance AIC LRT Pr(Chi) sex:ldose > modb3 <- update(modb2, ~. – sex:ldose) > summary(modb3) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-13 *** sexM ** ldose e-16 *** --- (Dispersion parameter for binomial family taken to be 1) Null deviance: on 11 degrees of freedom Residual deviance: on 9 degrees of freedom AIC:
9 1 st Example > drop1(modb3, test="Chisq") Single term deletions Model: SF ~ sex + ldose Df Deviance AIC LRT Pr(Chi) sex ** ldose < 2.2e-16 *** > shapiro.test(residuals(modb3), type="deviance") Shapiro-Wilk normality test data: residuals(modb3, type = "deviance") W = , p-value =
10 1 st Example > par(mfrow=c(2,2)) > plot(modb3)
11 1 st Example > plot( c(0,1) ~ c(1,32), type="n", log="x", xlab="dose", ylab="Probability") > text(dose, numdead/20, labels=as.character(sex) ) > ld <- seq(0,32,0.5) > lines (ld, predict(modb3, data.frame(ldose=log2(ld), sex=factor(rep("M", length(ld)), levels=levels(sex))), type="response") ) > lines (ld, predict(modb3, data.frame(ldose=log2(ld), sex=factor(rep("F", length(ld)), levels=levels(sex))), type="response"), lty=2, col="red" )
12 1 st Example > modbp <- glm(SF ~ sex*ldose, family=binomial(link="probit")) > AIC(modbp) [1] > modbc <- glm(SF ~ sex*ldose, family=binomial(link="cloglog")) > AIC(modbc) [1] > AIC(modb3) [1]
13 > summary(modb3) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-13 *** sexM ** ldose e-16 *** --- > exp(modb3$coeff) ## careful it may be misleading (Intercept) sexM ldose ## odds ration: p / (1-p) > exp(modb3$coeff[1]+modb3$coeff[2]) ## odds for males (Intercept) st Example logit scale Every doubling of the dose will lead to an increase in the odds of dying over surviving by a factor of 2.899
14 Erythrocyte Sedimentation Rate in a group of patients. Two groups : 20 (ill) mm/hour Q: Is it related to globulin & fibrinogen level in the blood ? 2 nd Example > data("plasma", package="HSAUR") > str(plasma) 'data.frame': 32 obs. of 3 variables: $ fibrinogen: num $ globulin : int $ ESR : Factor w/ 2 levels "ESR 20": > summary(plasma) fibrinogen globulin ESR Min. :2.090 Min. :28.00 ESR < 20:26 1st Qu.: st Qu.:31.75 ESR > 20: 6 Median :2.600 Median :36.00 Mean :2.789 Mean : rd Qu.: rd Qu.:38.00 Max. :5.060 Max. :46.00
15 2 nd Example > stripchart(globulin ~ ESR, vertical=T, data=plasma, xlab="Erythrocyte Sedimentation Rate (mm/hr)", ylab="Globulin blood level", method="jitter" ) > stripchart(fibrinogen ~ ESR, vertical=T, data=plasma, xlab="Erythrocyte Sedimentation Rate (mm/hr)", ylab="Fibrinogen blood level", method="jitter" )
16 2 nd Example > mod1 <- glm(ESR~fibrinogen, data=plasma, family=binomial) > summary(mod1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * fibrinogen * --- (Dispersion parameter for binomial family taken to be 1) Null deviance: on 31 degrees of freedom Residual deviance: on 30 degrees of freedom AIC: > mod2 <- glm(ESR~fibrinogen+globulin, data=plasma, family=binomial) > AIC(mod2) [1] factor
17 2 nd Example > anova(mod1, mod2, test="Chisq") Analysis of Deviance Table Model 1: ESR ~ fibrinogen Model 2: ESR ~ fibrinogen + globulin Resid. Df Resid. Dev Df Deviance P(>|Chi|) > summary(mod2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * fibrinogen * globulin (Dispersion parameter for binomial family taken to be 1) Null deviance: on 31 degrees of freedom Residual deviance: on 29 degrees of freedom AIC: The difference in terms of Deviance between these models is not significant, which leads us to select the least complex model
18 2 nd Example > shapiro.test(residuals(mod1, type="deviance")) Shapiro-Wilk normality test data: residuals(mod1, type = "deviance") W = , p-value = 5.465e-07 > par(mfrow=c(2,2)) > plot(mod1)