Logistic Regression and Generalized Linear Models:

Name: Logistic Regression and Generalized Linear Models:
Uploaded: 2017-10-20T17:45:18+00:00
Duration: PTM19S19
Channel: Blaise Harrell
Description: Logistic Regression and Generalized Linear Models:

Logistic Regression and Generalized Linear Models:
Blood Screening, Women’s Role in Society, and Colonic Polyps

Blood Screening ESR measurements (erythrocyte sedimentation rate)‏
Study checks for the rate of increase of ESR when (blood proteins) fibrinogen and globulin increase Looking for an association between the probability of an ESR reading > 20mm/hr and the levels of the two plasma proteins. Less than 20mm/hr indicates a healthy individual

Multiple Regression Doesn’t Work
Not Normally distributed The response variable is binary. In fact, the distribution is Binomial. The can be seen by looking at how the error term relates to the probability. If y = 1 then the error term is 1- P(y=1). We will assume for our Random Variables Y that

Logistic Regression, a Generalized Linear Model
Modelling the expected value of the response requires a transformation This chapter is about the logit transformation of a model, which is of the form If the response variable is 1, the log odds of the response is the logit function of a probability. Fixing all explanatory variables but xj, exp(Bj) represents the odds that the response variable is 1 when xj increases by 1.

R commands Plasma_glm_1 <- glm(ESR ~ fibrinogen, data = plasma, family = binomial())‏ Fits the model, in this example specifying the logistic function is implied because of the binomial() parameter is already passed. Layout(matrix(1:2, ncol = 2))‏ Lets you choose the layout of your graph screen. Cdplot(ESR ~ fibrinogen, data = plasma)‏ Cdplot(ESR ~ globulin, data = plasma)‏ cdplot “plots conditional densities describing how the conditional distribution of a categorical variable y changes over a numerical variable x” (help(“cdplot”))‏

Interpretation The area of the dark region is the probability that ESR < 20 mm/hr, and this decreases as the protein levels increase There is not much shape to the globulin density function.

R Commands Confint(plasma_glm_1, parm = “fibrinogen”)‏
Gives a confidence interval Output: Waiting for profiling to be done... 2.5 % % Exp(coef(plasma_glm_1)[“fibrinogen”])‏ This is necessary to do the reverse transformation to get the model coefficients fibrinogen

R Commands Summary(plasma_glm_1)‏ Output: Deviance Residuals:
Min Q Median Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * fibrinogen * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 31 degrees of freedom Residual deviance: on 30 degrees of freedom

R Commands Exp(confint(plasma_glm_1, parm = “fibrinogen”))‏
Output: 2.5 % % Plasma_glm_2 <- glm(ESR ~ fibrinogen + globulin, data = plasma, family = binomial())‏ Use logistic regression for both variables fibrinogen and globulin Summary(Plasma_glm_2)‏

R Commands Summary(Plasma_glm_2)‏ Output Deviance Residuals:
Min Q Median Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * fibrinogen * globulin --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 31 degrees of freedom Residual deviance: on 29 degrees of freedom

Interpretation large confidence interval is because there are not many observations in all, and even fewer ESR > 20mm/hr The globulin coefficient is basically zero

R Commands Anova(plasma_glm_1, plasma_glm_2, test = “Chisq”)‏
Compares the two models one of just fibrinogen, the other of both fibrinogen and globulin Why Chisquared test? ANOVA assumes normal distribution for each data set. Why is this OK? Output Analysis of Deviance Table Model 1: ESR ~ fibrinogen Model 2: ESR ~ fibrinogen + globulin Resid. Df Resid. Dev Df Deviance P(>|Chi|) Comparing the residual deviance between the two models, we see that the difference is not significant. We must surmise that there is no association between globulin and ESR level.

R Commands Prob <- predict(plasma_glm_1, type = “response”
The model of just fibrinogen useful in creating a neat looking bubble plot. First we must take the predicted values from the first model and use them to determine the size of the bubbles Plot(globulin ~ fibrinogen, data = plasma, xlim = c(2, 6), ylim = c(25, 50), pch = “.”)‏ Plots the second model Symbols(plasma$fibrinogen, plasma$globulin, circles = prob, add = TRUE)‏ Uses the values of the first model to create different bubble sizes.

Bubbles size is Probability
The plot shows how the probability of having an ESR > 20mm/hr increases as fibrinogen increases

Generalized Linear Model
Unifies the logistic regression, Analysis of Variance and multiple regression techniques. GLMs have three essential parts An error distribution- the distribution of the response variable. Normal for Analysis of Variance and Multiple Regression Binomial for Logistic Regression

Main parts of GLM (continued)‏
A link function which links the explanatory variables to the expected value of the response. Logit function for logistic regression Identity function for ANOVA and multiple regression A variance function which shows the dependency of the response variable variability on the mean

Measure of Fit The deviance shows how well the model fits the data
Comparing two model’s deviances Use a likelihood ratio test Compare using Chi-square distribution

Women’s Role in Society
Response to “Women should take care of running their homes and leave the running the country up to men”. Factors: Education, Sex Response: Agree or Disagree data is presented as categories with counts for each education, sex combination

R Commands Womensrole_glm_1 <- glm(cbind(agree, disagree) ~ sex + education, data = womensrole, family= binomial())‏ This uses the cbind function to change data from two responses to one response that is a matrix of agree, disagree counts. Summary(womensrole_glm_1)‏ We can see that education is a significant factor to the response.

Summary Output Deviance Residuals: Min 1Q Median 3Q Max
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** Why is there a P-value for the intercept? sexFemale education <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 40 degrees of freedom Residual deviance: on 38 degrees of freedom

Declaring a function in R
Myplot <- function(role.fitted) { Declares the function myplot and passes it an object “role.fitted” to be used in the function f <- womensrole$sex == “Female” Stores everything in the data with sex = f in f plot(womensrole$education, role.fitted, type = "n", ylab = "probability of agreeing", xlab = "Education", ylim = c(0,1))‏ Plots education against the role.fitted object which will be some predicted values from our GLM defined as Role.fitted1 <- predict(womensrole_glm_1, type = “response”)‏

Myplot function (cont)‏
lines(womensrole$education[!f], role.fitted[!f], lty = 1) lines(womensrole$education[f], role.fitted[f], lty = 2) These graph the lines, one for males and one for f s. Lty indicates the kind of line (in this case solid or dotted)‏ lgtxt <- c("Fitted (Males)", "Fitted (Females)") legend("topright", lgtxt, lty = 1:2, bty = "n") These add a legend for each line.

Myplot Function (cont)‏
y<-womensrole$agree/ (womensrole$agree + womensrole$disagree)‏ A basic calculation of the proportion of women that agree text(womensrole$education, y, ifelse(f, "\\VE", "\\MA"), family = "HersheySerif", cex = 1.25)‏ Plots y and not y using male/female symbols

Interpretation The two fitted lines indicate that sex does not change a probability of agreeing vs education The symbols of unfitted observations may indicate an interaction between sex and education

MyPlot Sex:Education By running the same analysis, e.g.
Womensrole_glm_2 <- glm(cbind(agree, disagree) ~ sex*education, data = womensrole, family = binomial())‏ Summary(womensrole_glm_2)‏ This summary shows a significant p-value for the sex:education interaction Role.fitted2 <- predict(womensrole_glm_1, type = “response”)‏ Myplot(role.fitted2)‏ The plot shows that less education is associated with agreement that women belong in the home.

The Deviance Residual One of the many methods for checking the adequacy of the model fit The deviance residual is the square root of the part of each observation that contributes to the deviance Res <- residuals( womensrole_glm_2, type = “deviance”)‏ Pulls the residuals for many other models as well Plot(predict(womensrole_glm_2), res, xlab = “fitted values”, ylab = “Residuals”, lim = max(abs(res))*c(-1,1))‏ No visible pattern: fit appears ok Abline(h = 0, lty = 2)‏ Adds a dotted line at height 0

Drug Treatment Testing Famlilial Andenomatous Polyposis (FAP)‏
Counts of colonic polyps after 12 months of treatment Don’t want to be the guy that did that Placebo controlled (Binary factor)‏ Age

GLM Analysis Multiple Regression won’t work.
Count data strictly positive Normality is not probable Poisson Regression- GLM with a log link function Ensures a Poisson Distribution Ensures positive fitted amounts R command: Polyps_glm_1 <- glm(number ~ treat + age, data = polyps, family = poisson())‏

Model Summary treatdrug -1.359083 0.117643 -11.55 < 2e-16 ***
Call: glm(formula = number ~ treat + age, family = poisson(), data = polyps)‏ Deviance Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) < 2e-16 *** treatdrug < 2e-16 *** age e-11 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1)‏ Null deviance: on 19 degrees of freedom Residual deviance: on 17 degrees of freedom AIC: Number of Fisher Scoring iterations: 5

Model Problem: Overdispersion
In previous models the variance can be seen as dependent solely on the mean E.G. Binomial, Poisson In practice, this doesn’t always work. Sometimes the raw data points are not independent, there is some correlation, in this case, possible “clustering” Compare residual deviance and degrees of freedom to determine These should be basically equal

Model Solution: Quasi-Likelihood
This procedure estimates the other factors that might contribute to the variance R Command Polyps_glm_2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson())‏ Summary(polyps_glm_2)‏ The coefficients are still significant, but less so

Homework Please run through the commands for the myplot function (pg 100) and send me the command script. Please send me what you thought of the presentation, give me a grade and add any constructive criticism.

Logistic Regression and Generalized Linear Models:

Similar presentations

Presentation on theme: "Logistic Regression and Generalized Linear Models:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Logistic Regression and Generalized Linear Models:

Similar presentations

Presentation on theme: "Logistic Regression and Generalized Linear Models:"— Presentation transcript:

Similar presentations

About project

Feedback