Lecture 8 Generalized Additive Models Olivier MISSA, Advanced Research Skills
2 Outline Introduce you to Generalized Additive Models.
3 GLMs GAMs GAMs vs. GLMs Smooth function of x 1
4 A number of algorithms are available to fit them, all known generically as "splines" The most frequently used are: Thin Plate Regression Splines (default) & Cubic Regression Splines "Smooth" functions
5 More efficient than Polynomials Est. degr. freedom : 8.69 R 2 (adj) : Dev. Expl. :79.8% Thin Plate Regression Splines degree 5 degree 10 Polynomials degr. freedom : 10 R 2 (adj) : Dev. Expl. : 78%
6 More efficient than Polynomials No need to specify the degrees of freedom (Wiggliness) of the smooth function. The algorithm finds the optimal solution for us, and avoids overfitting by cross-validation ('leave-one-out' trick).
7 Example: ozone pollution > ozone.pollution <- read.table("ozone.data.txt", header=T) ## in datasets folder > names(ozone.pollution) [1] "rad" "temp" "wind" "ozone" > attach(ozone.pollution) > modgam <- gam(ozone ~ s(rad) + s(temp) + s(wind) ) > plot(ozone ~ modgam$$fitted, pch=16) > abline(0,1, col="red") Revisiting the ozone dataset (Lecture 4 Linear Models III)
8 Ozone Example > summary(modgam) Family: gaussian Link function: identity... Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) <2e-16 *** Approximate significance of smooth terms: edf Ref.df F p-value s(rad) ** s(temp) e-09 *** s(wind) e-08 *** R-sq.(adj) = Deviance explained = 74.8% GCV score = 338 Scale est. = n = 111 Generalised Cross-Validation
9 Ozone Example > plot(modgam, residuals=T, pch=16) rad temp wind > modgam2 <- update(modgam, ~. - s(rad) ) > modgam3 <- update(modgam, ~. - s(temp) ) > modgam4 <- update(modgam, ~. - s(wind) )
10 Ozone Example > anova(modgam, modgam2, test="F") Model 1: ozone ~ s(rad) + s(temp) + s(wind) Model 2: ozone ~ s(temp) + s(wind) Resid. Df Resid. Dev Df Deviance F Pr(>F) ** > anova(modgam, modgam3, test="F") Model 1: ozone ~ s(rad) + s(temp) + s(wind) Model 2: ozone ~ s(rad) + s(wind) Resid. Df Resid. Dev Df Deviance F Pr(>F) e-09 *** > anova(modgam, modgam4, test="F") Model 1: ozone ~ s(rad) + s(temp) + s(wind) Model 2: ozone ~ s(rad) + s(temp) Resid. Df Resid. Dev Df Deviance F Pr(>F) e-09 ***
11 Ozone Example > modgam5 <- gam(ozone ~ rad + s(temp) + s(wind) ) > summary(modgam5) Family: gaussian Link function: identity... Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-11 *** rad ** --- Approximate significance of smooth terms: edf Ref.df F p-value s(temp) e-09 *** s(wind) e-09 *** --- R-sq.(adj) = Deviance explained = 73.4% GCV score = Scale est. = n = 111 linear term
12 Ozone Example > anova(modgam, modgam5, test="F") Analysis of Deviance Table Model 1: ozone ~ s(rad) + s(temp) + s(wind) Model 2: ozone ~ rad + s(temp) + s(wind) Resid. Df Resid. Dev Df Deviance F Pr(>F) > shapiro.test(residuals(modgam5, type="deviance")) Shapiro-Wilk normality test data: residuals(modgam5, type = "deviance") W = , p-value = 1.999e-06 > resid <- residuals(modgam5, type="deviance")
13 Ozone Example > qqnorm(resid, pch=16) > qqline(resid, lwd=2, col="red") > plot(resid ~ fitted(modgam5), pch=16) > abline(h=0, col="gray85", lty=2)
14 Ozone Example > plot(sqrt(abs(resid)) ~ fitted(modgam5), pch=16) > lines(lowess(sqrt(abs(resid)) ~ fitted(modgam5)), lwd=2, col="red") > plot(cooks.distance(modgam5), type="h")
15 Possible Refinements > modgam <- gam(resp ~ pred1 + s(pred2) + s(pred3), family=poisson(link="log") ) Specify a different family than Gaussian Specify a different spline basis than Thin Plate > modgam <- gam(resp ~ pred1 + s(pred2, bs="cr") + s(pred3) ) Specify a maximum number of degrees of freedom for the spline > modgam <- gam(resp ~ pred1 + s(pred2, k=5) + s(pred3) ) cubic regression spline
16 1 st Example > plot( c(0,1) ~ c(1,32), type="n", log="x", xlab="dose", ylab="Probability") > text(dose, numdead/20, labels=as.character(sex) ) > ld <- seq(0,32,0.5) > lines (ld, predict(modb3, data.frame(ldose=log2(ld), sex=factor(rep("M", length(ld)), levels=levels(sex))), type="response") ) > lines (ld, predict(modb3, data.frame(ldose=log2(ld), sex=factor(rep("F", length(ld)), levels=levels(sex))), type="response"), lty=2, col="red" )
17 1 st Example > modbp <- glm(SF ~ sex*ldose, family=binomial(link="probit")) > AIC(modbp) [1] > modbc <- glm(SF ~ sex*ldose, family=binomial(link="cloglog")) > AIC(modbc) [1] > AIC(modb3) [1]
18 > summary(modb3) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-13 *** sexM ** ldose e-16 *** --- > exp(modb3$coeff) ## careful it may be misleading (Intercept) sexM ldose ## odds ration: p / (1-p) > exp(modb3$coeff[1]+modb3$coeff[2]) ## odds for males (Intercept) st Example logit scale Every doubling of the dose will lead to an increase in the odds of dying over surviving by a factor of 2.899
19 Erythrocyte Sedimentation Rate in a group of patients. Two groups : 20 (ill) mm/hour Q: Is it related to globulin & fibrinogen level in the blood ? 2 nd Example > data("plasma", package="HSAUR") > str(plasma) 'data.frame': 32 obs. of 3 variables: $ fibrinogen: num $ globulin : int $ ESR : Factor w/ 2 levels "ESR 20": > summary(plasma) fibrinogen globulin ESR Min. :2.090 Min. :28.00 ESR < 20:26 1st Qu.: st Qu.:31.75 ESR > 20: 6 Median :2.600 Median :36.00 Mean :2.789 Mean : rd Qu.: rd Qu.:38.00 Max. :5.060 Max. :46.00
20 2 nd Example > stripchart(globulin ~ ESR, vertical=T, data=plasma, xlab="Erythrocyte Sedimentation Rate (mm/hr)", ylab="Globulin blood level", method="jitter" ) > stripchart(fibrinogen ~ ESR, vertical=T, data=plasma, xlab="Erythrocyte Sedimentation Rate (mm/hr)", ylab="Fibrinogen blood level", method="jitter" )
21 2 nd Example > mod1 <- glm(ESR~fibrinogen, data=plasma, family=binomial) > summary(mod1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * fibrinogen * --- (Dispersion parameter for binomial family taken to be 1) Null deviance: on 31 degrees of freedom Residual deviance: on 30 degrees of freedom AIC: > mod2 <- glm(ESR~fibrinogen+globulin, data=plasma, family=binomial) > AIC(mod2) [1] factor
22 2 nd Example > anova(mod1, mod2, test="Chisq") Analysis of Deviance Table Model 1: ESR ~ fibrinogen Model 2: ESR ~ fibrinogen + globulin Resid. Df Resid. Dev Df Deviance P(>|Chi|) > summary(mod2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * fibrinogen * globulin (Dispersion parameter for binomial family taken to be 1) Null deviance: on 31 degrees of freedom Residual deviance: on 29 degrees of freedom AIC: The difference in terms of Deviance between these models is not significant, which leads us to select the least complex model
23 2 nd Example > shapiro.test(residuals(mod1, type="deviance")) Shapiro-Wilk normality test data: residuals(mod1, type = "deviance") W = , p-value = 5.465e-07 > par(mfrow=c(2,2)) > plot(mod1)