Logistic Regression and Generalized Linear Models:

Slides:



Advertisements
Similar presentations
Chapter 12 Inference for Linear Regression
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Chapter 13 Multiple Regression
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Chapter 12 Multiple Regression
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
The Simple Regression Model
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Log-linear and logistic models
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Inference for regression - Simple linear regression
New Ways of Looking at Binary Data Fitting in R Yoon G Kim, Colloquium Talk.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Lecture 7 GLMs II Binomial Family Olivier MISSA, Advanced Research Skills.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
CHI SQUARE TESTS.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.
Linear Models Alan Lee Sample presentation for STATS 760.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Remembering way back: Generalized Linear Models Ordinary linear regression What if we want to model a response that is not Gaussian?? We may have experiments.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Logistic Regression and Odds Ratios Psych DeShon.
Week 7: General linear models Overview Questions from last week What are general linear models? Discussion of the 3 articles.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Logistic Regression Jeff Witmer 30 March Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment.
Stats Methods at IC Lecture 3: Regression.
Transforming the data Modified from:
Chapter 4: Basic Estimation Techniques
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
CHAPTER 7 Linear Correlation & Regression Methods
Basic Estimation Techniques
Basic Estimation Techniques
CHAPTER 29: Multiple Regression*
SAME THING?.
CHAPTER 12 More About Regression
Presentation transcript:

Logistic Regression and Generalized Linear Models: Blood Screening, Women’s Role in Society, and Colonic Polyps

Blood Screening ESR measurements (erythrocyte sedimentation rate)‏ Study checks for the rate of increase of ESR when (blood proteins) fibrinogen and globulin increase Looking for an association between the probability of an ESR reading > 20mm/hr and the levels of the two plasma proteins. Less than 20mm/hr indicates a healthy individual

Multiple Regression Doesn’t Work Not Normally distributed The response variable is binary. In fact, the distribution is Binomial. The can be seen by looking at how the error term relates to the probability. If y = 1 then the error term is 1- P(y=1). We will assume for our Random Variables Y that

Logistic Regression, a Generalized Linear Model Modelling the expected value of the response requires a transformation This chapter is about the logit transformation of a model, which is of the form If the response variable is 1, the log odds of the response is the logit function of a probability. Fixing all explanatory variables but xj, exp(Bj) represents the odds that the response variable is 1 when xj increases by 1.

R commands Plasma_glm_1 <- glm(ESR ~ fibrinogen, data = plasma, family = binomial())‏ Fits the model, in this example specifying the logistic function is implied because of the binomial() parameter is already passed. Layout(matrix(1:2, ncol = 2))‏ Lets you choose the layout of your graph screen. Cdplot(ESR ~ fibrinogen, data = plasma)‏ Cdplot(ESR ~ globulin, data = plasma)‏ cdplot “plots conditional densities describing how the conditional distribution of a categorical variable y changes over a numerical variable x” (help(“cdplot”))‏

Interpretation The area of the dark region is the probability that ESR < 20 mm/hr, and this decreases as the protein levels increase There is not much shape to the globulin density function.

R Commands Confint(plasma_glm_1, parm = “fibrinogen”)‏ Gives a confidence interval Output: Waiting for profiling to be done... 2.5 % 97.5 % 0.3389465 3.9988602 Exp(coef(plasma_glm_1)[“fibrinogen”])‏ This is necessary to do the reverse transformation to get the model coefficients fibrinogen 6.215715

R Commands Summary(plasma_glm_1)‏ Output: Deviance Residuals: Min 1Q Median 3Q Max -0.9298 -0.5399 -0.4382 -0.3356 2.4794 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.8451 2.7703 -2.471 0.0135 * fibrinogen 1.8271 0.9009 2.028 0.0425 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.885 on 31 degrees of freedom Residual deviance: 24.840 on 30 degrees of freedom

R Commands Exp(confint(plasma_glm_1, parm = “fibrinogen”))‏ Output: 2.5 % 97.5 % 1.403468 54.535954 Plasma_glm_2 <- glm(ESR ~ fibrinogen + globulin, data = plasma, family = binomial())‏ Use logistic regression for both variables fibrinogen and globulin Summary(Plasma_glm_2)‏

R Commands Summary(Plasma_glm_2)‏ Output Deviance Residuals: Min 1Q Median 3Q Max -0.9683 -0.6122 -0.3458 -0.2116 2.2636 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -12.7921 5.7963 -2.207 0.0273 * fibrinogen 1.9104 0.9710 1.967 0.0491 * globulin 0.1558 0.1195 1.303 0.1925 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.885 on 31 degrees of freedom Residual deviance: 22.971 on 29 degrees of freedom

Interpretation large confidence interval is because there are not many observations in all, and even fewer ESR > 20mm/hr The globulin coefficient is basically zero

R Commands Anova(plasma_glm_1, plasma_glm_2, test = “Chisq”)‏ Compares the two models one of just fibrinogen, the other of both fibrinogen and globulin Why Chisquared test? ANOVA assumes normal distribution for each data set. Why is this OK? Output Analysis of Deviance Table Model 1: ESR ~ fibrinogen Model 2: ESR ~ fibrinogen + globulin Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 30 24.8404 2 29 22.9711 1 1.8692 0.1716 Comparing the residual deviance between the two models, we see that the difference is not significant. We must surmise that there is no association between globulin and ESR level.

R Commands Prob <- predict(plasma_glm_1, type = “response” The model of just fibrinogen useful in creating a neat looking bubble plot. First we must take the predicted values from the first model and use them to determine the size of the bubbles Plot(globulin ~ fibrinogen, data = plasma, xlim = c(2, 6), ylim = c(25, 50), pch = “.”)‏ Plots the second model Symbols(plasma$fibrinogen, plasma$globulin, circles = prob, add = TRUE)‏ Uses the values of the first model to create different bubble sizes.

Bubbles size is Probability The plot shows how the probability of having an ESR > 20mm/hr increases as fibrinogen increases

Generalized Linear Model Unifies the logistic regression, Analysis of Variance and multiple regression techniques. GLMs have three essential parts An error distribution- the distribution of the response variable. Normal for Analysis of Variance and Multiple Regression Binomial for Logistic Regression

Main parts of GLM (continued)‏ A link function which links the explanatory variables to the expected value of the response. Logit function for logistic regression Identity function for ANOVA and multiple regression A variance function which shows the dependency of the response variable variability on the mean

Measure of Fit The deviance shows how well the model fits the data Comparing two model’s deviances Use a likelihood ratio test Compare using Chi-square distribution

Women’s Role in Society Response to “Women should take care of running their homes and leave the running the country up to men”. Factors: Education, Sex Response: Agree or Disagree data is presented as categories with counts for each education, sex combination

R Commands Womensrole_glm_1 <- glm(cbind(agree, disagree) ~ sex + education, data = womensrole, family= binomial())‏ This uses the cbind function to change data from two responses to one response that is a matrix of agree, disagree counts. Summary(womensrole_glm_1)‏ We can see that education is a significant factor to the response.

Summary Output Deviance Residuals: Min 1Q Median 3Q Max -2.72544 -0.86302 -0.06525 0.84340 3.13315 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.50937 0.18389 13.646 <2e-16 *** Why is there a P-value for the intercept? sexFemale -0.01145 0.08415 -0.136 0.892 education -0.27062 0.01541 -17.560 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 451.722 on 40 degrees of freedom Residual deviance: 64.007 on 38 degrees of freedom

Declaring a function in R Myplot <- function(role.fitted) { Declares the function myplot and passes it an object “role.fitted” to be used in the function f <- womensrole$sex == “Female” Stores everything in the data with sex = femail in f plot(womensrole$education, role.fitted, type = "n", ylab = "probability of agreeing", xlab = "Education", ylim = c(0,1))‏ Plots education against the role.fitted object which will be some predicted values from our GLM defined as Role.fitted1 <- predict(womensrole_glm_1, type = “response”)‏

Myplot function (cont)‏ lines(womensrole$education[!f], role.fitted[!f], lty = 1) lines(womensrole$education[f], role.fitted[f], lty = 2) These graph the lines, one for males and one for femails. Lty indicates the kind of line (in this case solid or dotted)‏ lgtxt <- c("Fitted (Males)", "Fitted (Females)") legend("topright", lgtxt, lty = 1:2, bty = "n") These add a legend for each line.

Myplot Function (cont)‏ y<-womensrole$agree/ (womensrole$agree + womensrole$disagree)‏ A basic calculation of the proportion of women that agree text(womensrole$education, y, ifelse(f, "\\VE", "\\MA"), family = "HersheySerif", cex = 1.25)‏ Plots y and not y using male/female symbols

Interpretation The two fitted lines indicate that sex does not change a probability of agreeing vs education The symbols of unfitted observations may indicate an interaction between sex and education

MyPlot Sex:Education By running the same analysis, e.g. Womensrole_glm_2 <- glm(cbind(agree, disagree) ~ sex*education, data = womensrole, family = binomial())‏ Summary(womensrole_glm_2)‏ This summary shows a significant p-value for the sex:education interaction Role.fitted2 <- predict(womensrole_glm_1, type = “response”)‏ Myplot(role.fitted2)‏ The plot shows that less education is associated with agreement that women belong in the home.

The Deviance Residual One of the many methods for checking the adequacy of the model fit The deviance residual is the square root of the part of each observation that contributes to the deviance Res <- residuals( womensrole_glm_2, type = “deviance”)‏ Pulls the residuals for many other models as well Plot(predict(womensrole_glm_2), res, xlab = “fitted values”, ylab = “Residuals”, lim = max(abs(res))*c(-1,1))‏ No visible pattern: fit appears ok Abline(h = 0, lty = 2)‏ Adds a dotted line at height 0

Drug Treatment Testing Famlilial Andenomatous Polyposis (FAP)‏ Counts of colonic polyps after 12 months of treatment Don’t want to be the guy that did that Placebo controlled (Binary factor)‏ Age

GLM Analysis Multiple Regression won’t work. Count data strictly positive Normality is not probable Poisson Regression- GLM with a log link function Ensures a Poisson Distribution Ensures positive fitted amounts R command: Polyps_glm_1 <- glm(number ~ treat + age, data = polyps, family = poisson())‏

Model Summary treatdrug -1.359083 0.117643 -11.55 < 2e-16 *** Call: glm(formula = number ~ treat + age, family = poisson(), data = polyps)‏ Deviance Residuals: Min 1Q Median 3Q Max -4.2212 -3.0536 -0.1802 1.4459 5.8301 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.529024 0.146872 30.84 < 2e-16 *** treatdrug -1.359083 0.117643 -11.55 < 2e-16 *** age -0.038830 0.005955 -6.52 7.02e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1)‏ Null deviance: 378.66 on 19 degrees of freedom Residual deviance: 179.54 on 17 degrees of freedom AIC: 273.88 Number of Fisher Scoring iterations: 5

Model Problem: Overdispersion In previous models the variance can be seen as dependent solely on the mean E.G. Binomial, Poisson In practice, this doesn’t always work. Sometimes the raw data points are not independent, there is some correlation, in this case, possible “clustering” Compare residual deviance and degrees of freedom to determine These should be basically equal

Model Solution: Quasi-Likelihood This procedure estimates the other factors that might contribute to the variance R Command Polyps_glm_2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson())‏ Summary(polyps_glm_2)‏ The coefficients are still significant, but less so

Homework Please run through the commands for the myplot function (pg 100) and send me the command script. Please send me what you thought of the presentation, give me a grade and add any constructive criticism. zweihanderdawg@gmail.com