Generalized Linear Models II Distributions, link functions, diagnostics (linearity, homoscedasticity, leverage)
Dichotomous key: picking a distribution for your data
Discrete or continuous? Possible values: 0/1 or 0,1,2,… etc. Binomial (logistic regression) 0/1 Range of data - to + 0,1,2,… Discrete Continuous Poisson or Binomial Normal Gamma or Inverse-Gaussian >0 to + Check for overdispersion Poisson ok Resid. deviance ~= Resid. df (~ n-p ) Compare fit w/ quasi-poisson or Quasi-binomial or negative binomial Resid. deviance >> Resid. df (~ n-p ) Check residuals for normality Check s.dev. residuals for normality If distributional checks fail examine the data/residuals and try to determine source of deviance! Bimodality? Linearity? Fat tails? Excess zeros? Check Resid. deviance = Resid. df (~ n-p ) again and compare s.dev. resids to normality Common distributions (But see next slide for others And additional details)
Possible values: 0/1 Bernoulli(successs/failure, logistic regresion?) - to + Discrete Continuous Geometric (# trials to 1 st success) Poisson (#successes in large # trials) Negative Binomial (#trials to n th success or over-dispersed Poisson) Exponential(time to 1 st success) Gamma(time to n th success) Inverse-Gaussian( 1/x is normal) >0 to + 0,1,2,… infinity Normal Binomial (# successes in fixed # trials) Multinomial(more than 2 categories, fixed # trials) 0,1,2,… N (known) 0 to 1 Beta(fraction of total, proportions) Check out Wikipedia pages for each distribution for more info!
As sample sizes get large, many distributions converge on the normal distribution See, e.g. stribution stribution
Group exercise Get a partner Describe a real dataset to your partner Partner picks a potentially appropriate distribution Switch roles Repeat!
Link Functions Enforce appropriate range for expected response (e.g. 0,1 for ‘probability of success’, >0 for counts, etc) Linearize relationship between expected response and predictors G(E(y)) = b 0 + b 1 x 1 + b 2 x 2 + etc Be careful to interpret coefficients properly given a link function! E(y) =G -1 ( b 0 + b 1 x 1 + b 2 x 2 + etc) E.g. LinkConstraintInverse LogE(y)>0 LogitE(y) in (0,1) See Table 15.1 in GLM chapter for lots more!
Canonical link functions
Sample problems for count data Binomial vs. poisson 202/poiss_bin.html
Leverage (see diagnostic plots & websites on next slide) Xxx et al 2006 PLoS Biology
R: example GLM with data #read in data bd=read.csv("c:/marm/teaching/293qe/bat_lambda.csv") str(bd);head(bd) #What not to do- run models blindly! b1=glm(Lambda~PreWNS_Pop,family=Gamma,data=bd);summary(b1) #What to do - plot data plot(Lambda~PreWNS_Pop,data=bd) #What does it suggest would be a good idea? bd$Lpop=log(bd$PreWNS_Pop) plot(Lambda~Lpop,data=bd) b1=glm(Lambda~Lpop,family=Gamma,data=bd);summary(b1) b2=glm(Lambda~Lpop+Species,family=Gamma,data=bd);summary(b2) b3=glm(Lambda~Lpop*Species,family=Gamma,data=bd);summary(b3) anova(b1,b2,b3,test="Chisq") AIC(b1,b2,b3) plot(b3)