Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Multinomial Logistic Regression David F. Staples.
Linear regression models
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression.
Multiple Regression Predicting a response with multiple explanatory variables.
Statistical Methods Chichang Jou Tamkang University.
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Log-linear and logistic models
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Gl
Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
1 Logistic Regression Homework Solutions EPP 245/298 Statistical Analysis of Laboratory Data.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
MATH 3359 Introduction to Mathematical Modeling Download/Import/Modify Data, Logistic Regression.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Lecture 5 Correlation and Regression
Logistic Regression and Generalized Linear Models:
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Logistic Regression Pre-Challenger Relation Between Temperature and Field-Joint O-Ring Failure Dalal, Fowlkes, and Hoadley (1989). “Risk Analysis of the.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Logistic Regression. Linear Regression Purchases vs. Income.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
Multiple Logistic Regression STAT E-150 Statistical Methods.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 Model choice Gil McVean, Department of Statistics Tuesday 17 th February 2007.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
04/19/2006Econ 6161 Econ 616 – Spring 2006 Qualitative Response Regression Models Presented by Yan Hu.
Logistic Regression and Odds Ratios Psych DeShon.
Logistic Regression. What is the purpose of Regression?
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Stats Methods at IC Lecture 3: Regression.
Unit 32: The Generalized Linear Model
Transforming the data Modified from:
BINARY LOGISTIC REGRESSION
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
Notes on Logistic Regression
Basic Estimation Techniques
Regression Techniques
Generalized Linear Models
CHAPTER 29: Multiple Regression*
Chapter 12 Simple Linear Regression and Correlation
SAME THING?.
PSY 626: Bayesian Statistics for Psychological Science
Introduction to Logistic Regression
Logistic Regression with “Grouped” Data
Presentation transcript:

Logistic Regression Predicting Dichotomous Data

Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc Linear regression fails because we cannot keep the prediction within the bounds of 0 – 1 Continuous and non-continuous predictors possible

Logistic Model Explanatory variables used to predict the probability that the response will be present (male, yes, etc) We fit a linear model to the log of the odds that an event will occur If the probability that an event will occur is p, then the odds = p/(1-p)

logits Equations: –logit(p) = log(p/(1-p)) –logit(p) = b 0 + b 1 x1 + b 2 x2... So logistic regression is a linear regression of logits (logs of odds ratios)

Assumptions Dichotomous response (only two states possible) Outcomes statistically independent Model contains all relevant predictors and no irrelevant ones Samples sizes of about 50 cases per predictor

Two Approaches Data consisting of individual cases with a dichotomous variable Grouped data where the number present and number absent are known for each combination of explanatory variables (in practice these will usually be categorical/ ordinal)

Inverting Snodgrass Instead of seeing if houses inside the white wall are larger than those outside, we can use area to predict where the house is located.

# Use Rcmdr to create a dichotomous variable In Snodgrass$In <- with(Snodgrass, ifelse(Inside=="Inside", 1, 0)) # Use Rcmdr to bin Area into 10 bins using numbers to Snodgrass$AreaBin <- bin.var(Snodgrass$Area, bins=10, method='intervals', labels=FALSE) # Use Rcmdr to aggregate compute mean Area and In for each AreaBin AggregatedData <- aggregate(Snodgrass[,c("Area","In"), drop=FALSE], by=list(AreaBin=Snodgrass$AreaBin), FUN=mean) # Plot raw data plot(In~Area, data=Snodgrass, las=1) # Plot means by AreaBin groups points(AggregatedData[,2:3], type="b", pch=16)

Fitting a Simple Model We start with a simple model using Area only Statistics | Fit Models | Generalized Linear Model In is the response, Area is the explanatory variable Family is binomial, Link function is logit

> GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) > summary(GLM.1) Call: glm(formula = In ~ Area, family = binomial(logit), data = Snodgrass) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-06 *** Area e-06 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 90 degrees of freedom Residual deviance: on 89 degrees of freedom AIC: Number of Fisher Scoring iterations: 6

Results Slope value for Area is highly significant – Area is a significant predictor of the odds of being inside the white wall The residual deviance is less than the degrees of freedom (an indicator that the binomial model fits)

# Rcmdr command > GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) # Typed commands > x <- seq(20, 470, 5) > y <- predict(GLM.1, data.frame(Area=x), type="response") > plot(In~Area, data=Snodgrass, las=2) > points(AggregatedData[,2:3], type="b", lty=2, pch=16) > lines(x, y, col="red", lwd=2) # Rcmdr command > Snodgrass$Predicted <- with(Snodgrass, + factor(ifelse(fitted.GLM.1 <.5, "Outside", "Inside"))) # Use Rcmdr to produce a crosstabulation of Inside and Predicted >.Table <- xtabs(~Inside+Predicted, data=Snodgrass) >.Table Predicted Inside Inside Outside Inside 29 9 Outside 5 48 >( )/( ) [1] Predictions are correct 84.6% of the time

Expanding the Model Expand the model by adding Total and Types Check the results – neither of the new variables is significant, but this could be the high correlation between the two (+.94) Delete Types and try again

Third Model Without Types, Total is now highly significant ANOVA comparing 2 nd and 3 rd models show no difference so the 3 rd (simpler) model is preferred Also AIC, Akaike’s Information Criterion is lower (which is better) New model is 89% accurate

Akaike Information Criterion AIC measures relative goodness of fit of a statistical model Roughly it describes the tradeoff between accuracy and complexity of the model A method of comparing different statistical models – generally prefer model with lower AIC