Introduction to Generalized Linear Model (GLM) Man Li, Research Fellow International Food Policy Research Institute Technical Training for Modeling Scenarios for Low Emission Development Strategies, September 9 th –20 th, 2013
What is GLM? In statistics, the GLM is a flexible generalization of ordinary linear (OL) regression that allows for response variable (Y) that other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related Y via a LINK FUNCTION, i.e., E(Y) = μ = g -1 (Xβ), where g is the link function s.t. g(μ) = Xβ.
Common distributions with typical uses and canonical link functions DistributionSupport of distribution Typical usesLink name Link function Mean function NormalReal: (- ∞, + ∞ ) Linear-response data IdentifyXβ = μμ = Xβ BernoulliInteger: [0, 1] Outcome of single yes/no occurrence Logit Xβ = log(μ/1-μ) μ = exp(Xβ)/1 +exp(Xβ) BinomialInteger: [0,N] Count of # of “yes” occurrence out of N yes/no occurrences CategoricalK-vector of integer: [0, 1] Outcome of single K- way occurrence Similar but a bit complicat ed MultinomialK-vector of integer: [0, N] Count of occurrences of 1-K types out of N total K-way occurrences
Logit Regression for Binary Responses Example: Survival and gender in the Donner party―an observational study In 1846 the Donner families left Springfield, Illinois for California by covered wagon. When they reached Fort Bridger, Wyoming in July, the Donner party decided to attempt a new and untested route to the Sacramento Valley. Having reached its full size of 87 people and 20 wagons, the party was delayed in the difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The group became stranded in the eastern Sierra Nevada mountains when hit by heavy snows in late October. By the time the last survivor was rescued on 21 April 1847, 40 of the 87 members had died from famine and exposure to extreme cold.
Example: Donner Party Deaths These data were used to study the theory that females are better able to withstand harsh conditions than are males AGESEXSTATUS 23.00MALEDIED 40.00FEMALESURVIVED 40.00MALESURVIVED 30.00MALEDIED 28.00MALEDIED 40.00MALEDIED 45.00FEMALEDIED 62.00MALEDIED 65.00MALEDIED 45.00FEMALEDIED 25.00FEMALEDIED 28.00MALESURVIVED 28.00MALEDIED 23.00MALEDIED 22.00FEMALESURVIVED 23.00FEMALESURVIVED 28.00MALESURVIVED 15.00FEMALESURVIVED 47.00FEMALEDIED 57.00MALEDIED 20.00FEMALESURVIVED … …… Ages and sexes of the adult (over 15 years) in the party
Example: Donner Party Deaths Question: For a given age, were women more likely to survival than were men? If linear model: – Y i |X i = X i β (i.i.d) – Y = 1 if survived, = 0 if died – X = (age, sex)
Ordinary Linear Regression Fitting model: Y = – 0.013*age *I [sex=female]
Ordinary Linear Regression ―with Interaction Term Fitting model: Y = – 0.006*age *I [sex=female] – 0.025*age*I [sex=female]
Logit Regression Model: – Y i |X i ~ Bin(1, π i ) (independent) – g(π i ) = log(π i /1- π i ) = X i β – Y = 1 if survived, = 0 if died – X = (age, sex) – Null model: log odds of survival = β 0 +β 1 age+β 2 I [sex=female]
Possible problems Logit is not a straight line function of age – Do quadratic age term tests separately for males and females (Wald test) X = (age, agesq) Slopes are not the same for males and females – Test for the significance of interaction term (Wald test) X = (age, sex, age*sex) – Alternative to Wald: Likelihood ratio test
Exercise Open R program code that is located at ftp://ftp.cgiar.org/ifpri/leds2013sep/GLM/GL M_code.R ftp://ftp.cgiar.org/ifpri/leds2013sep/GLM/GL M_code.R Load data named “donner” Define indicator variable “survival” and “sex” Draw a scatterplot: survival vs. age by gender
Exercise Estimate the null model, examine the sign and the p-Value of age and sex variables Test for the quadratic term of age by gender group Test for the interaction of sex and age Draw two fitting plots: the null model and the model with interaction term
How the Results look like? H 0 model: log odds of survival = *age+1.597*I [sex=female] H 1 model: log odds of survival = *age+6.928*I [sex=female] – 0.025*age*I [sex=female]
Logit Regression for Multiple Responses Y i |X i ~ Mult(m i, π 1i, π 2i,…, π Ki ), ∑ k π ki = 1 Y = 1,2,…,K. (K-category response) There are K-1 logit models: log(π 1i / π Ki ) = X i β 1 log(π 2i / π Ki ) = X i β 2 … log(π k-1i / π Ki ) = X i β K-1 Note: β K is normalized to be 0 Rewrite the probabilities Pr(Y i = 1) = exp(X i β 1 )/∑ k exp(X i β k ) Pr(Y i = 2) = exp(X i β 2 )/∑ k exp(X i β k ) … Pr(Y i = K-1) = exp(X i β K-1 )/∑ k exp(X i β k ) Pr(Y i = K) = exp(X i β K )/∑ k exp(X i β k )
Logit Regression for Multiple Responses
R Code multinom() function library(nnet) count.matrix <- cbind(Y1,Y2,…,YK) fit <- multinom(count.matrix ~ X1+X2+…, data=, Hess=True)
Some Extensions Conditional logit – X ik is specific to alternative choice, but β does not vary across choice, i.e., X ik β Nested logit – Can be decomposed into two standard logit Mixed logit – Integrals of standard logit probabilities over a density of parameters β See Train (2003) Discrete Choice Methods with Simulation for more discussions