Generalized Linear Models (GLM) in R

Generalized Linear Models (GLM) in R
EPID 799C Fall 2017

Overview Review of GLMs Overview of R GLM functions
Comparison to SAS code

Simple Linear Regression
Given data points (x and y), we draw a line of “best fit”. The “slope” of the best fit line can be used to summarize the relationship between x and y. We can find the best fit line by calculating the sum of the squared errors (equivalent to the maximum likelihood estimate)

Multiple Linear Regression
Given data in n-dimensional space (x…xn and y), draw a hyperplane (dim n-1) of “best fit” The “slopes” of the hyperplane along each dimension can be used to summarize the relationship between each xi and y.

How the Computer Solves a Regression
Except in trivial cases, a computer (through SAS, R, etc.) uses the same painfully simple approach to determine the best line fit: Choose a “guess” slope. Calculate a fit statistic for the guess. (higher=worse, lower=better). Repeat steps 1 and 2 until we find a “good” guess (a tolerance value defined by the user or set by default). [Optional Improvements] Keep track of your old guesses to see if you are getting “better” or “worse”. Have a system for selecting a good “jump” size between guesses. Start guessing somewhere that makes sense. Etc, etc.

Universal Fit Statistic: Likelihood
We typically think of linear regression as using the sum of squared errors fit statistic (i.e. “least-squares regression). It turns out that this is a special case of a more general fit statistic: the likelihood. The likelihood is the probability of observing the data you have under a given set of assumed “slope” values 𝛽 0 , 𝛽 1 , 𝛽 2 , etc. The maximum likelihood estimate (MLE) is the set of assumed “slope” values 𝛽 0 , 𝛽 1 , 𝛽 2 ,𝑒𝑡𝑐 that gives the highest likelihood (our best guess). We can use the values of the likelihood function near the MLE to estimate the precision of our estimates (i.e. calculate confidence intervals or statistical tests for 𝛽 0 , 𝛽 1 , 𝛽 2 )

Problem 1: Abnormal Variables
But what if the outcome variable is not normal? A binary variable? Binomial A categorical variable? Multinomial An ordinal variable? Ordinal Logistic A count or rate variable? Poisson Quasi-Poisson These problems can be addressed using various outcome distributions.

Problem 2: Non-Independence
But what if observations are not independent? Interference between adjacent units? Correlation between outcomes for adjacent units? Shared variables between different observations (clustering)? Repeated measurements of the same units? These problems can be addressed by adding terms to the regression.

Solution: Generalized Linear Models
We can solve these problems (and more) by extending the linear model with two new features: An outcome distribution. We can use something other than the normal distribution for our model. Conveniently, all that changes is the specific form of the likelihood expression. A link function. Since we are still using a linear model, we need to transform our data appropriately so that it appears “linear” (in other words, so that it looks more like a normal variable). R lets us choose these directly! Unlike other statistical programs, we don’t have to do any “tricks” to make our model work (although we still can).

Generalized Linear Models
(Base R) Outcome Distributions (Base R) Link Function binomial gaussian inverse.Gaussian poisson quasi quasibinomial quasipoisson identity log Inverse pogit probit cauchit cloglog You can find other options in packages, or manually create anything you want.

glm() Syntax You can fit regression models in R using the general-purpose glm() function. The ~ operator separates the outcome from the covariates. m1 = glm(height ~ vitamins) m2 = glm( outcome ~ x1 + x2 + x2, family=binomial(“logit”) ) The model results are best saved in an object (here, all of the m’s) so that we can inspect or manipulate parts of our output.

Epidemiology Examples
Choose your distribution family link based on what you are estimating: Risk difference family = binomial(link=“identity”) Risk ratio family = binomial(link=“log”) Rate difference family = poisson(link=“identity”) Rate ratio family = poisson(link=“log”) Odds ratio family= binomial(link=“logit”)

Inspecting Output We will consider the results of the following code (run it yourself): m1 = glm( WIC=="Y“ ~ MAGE, data=births, family=binomial("logit") ) The data=births tells R to look into the births data frame The WIC==“Y” is an in-line specification to set the TRUE value for the regression (in other words, what a counts as a success). You can also make a new variable with T/F values first if you prefer.

Simple Inspection Options
Try out these options. What is different about each one? m1 summary(m1) plot(m1) names(m1) coef(m1) exp(coef(m1)) confint(m1) exp(confint(m1))

Generalized Linear Models (GLM) in R

Similar presentations

Presentation on theme: "Generalized Linear Models (GLM) in R"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generalized Linear Models (GLM) in R

Similar presentations

Presentation on theme: "Generalized Linear Models (GLM) in R"— Presentation transcript:

Similar presentations

About project

Feedback