Logistic Regression Logistic Regression is used to study or model the association between a binary response variable (y) and a set of explanatory variables.

Slides:



Advertisements
Similar presentations
Lecture 17: Tues., March 16 Inference for simple linear regression (Ch ) R2 statistic (Ch ) Association is not causation (Ch ) Next.
Advertisements

Brief introduction on Logistic Regression
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Logistic Regression.
Chapter 8 – Logistic Regression
Simple Logistic Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Overview of Logistics Regression and its SAS implementation
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
Logistic Regression STA302 F 2014 See last slide for copyright information 1.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Binary Response Lecture 22 Lecture 22.
In previous lecture, we highlighted 3 shortcomings of the LPM. The most serious one is the unboundedness problem, i.e., the LPM may make the nonsense predictions.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
EPI 809/Spring Multiple Logistic Regression.
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Statistics 350 Lecture 17. Today Last Day: Introduction to Multiple Linear Regression Model Today: More Chapter 6.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
BINARY CHOICE MODELS: LOGIT ANALYSIS
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
April 11 Logistic Regression –Modeling interactions –Analysis of case-control studies –Data presentation.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
Logistic Regression Database Marketing Instructor: N. Kumar.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
1 היחידה לייעוץ סטטיסטי אוניברסיטת חיפה פרופ’ בנימין רייזר פרופ’ דוד פרג’י גב’ אפרת ישכיל.
LOGISTIC REGRESSION A statistical procedure to relate the probability of an event to explanatory variables Used in epidemiology to describe and evaluate.
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
Linear vs. Logistic Regression Log has a slightly better ability to represent the data Dichotomous Prefer Don’t Prefer Linear vs. Logistic Regression.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Logistic Regression. Linear Regression Purchases vs. Income.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Generalized Linear Models (GLMs) and Their Applications.
I271B QUANTITATIVE METHODS Regression and Diagnostics.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
LOGISTIC REGRESSION Binary dependent variable (pass-fail) Odds ratio: p/(1-p) eg. 1/9 means 1 time in 10 pass, 9 times fail Log-odds ratio: y = ln[p/(1-p)]
Logistic Regression Analysis Gerrit Rooks
Example x y We wish to check for a non zero correlation.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
Binary logistic regression. Characteristic Regression model for target categorized variable explanatory variables – continuous and categorical Estimate.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Logistic Regression and Odds Ratios Psych DeShon.
Logistic Regression For a binary response variable: 1=Yes, 0=No This slide show is a free open source document. See the last slide for copyright information.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Nonparametric Statistics
Chapter 20 Linear and Multiple Regression
Statistical Modelling
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Notes on Logistic Regression
Inference for Regression
Hypothesis Tests: One Sample
Multiple logistic regression
Nonparametric Statistics
Simple Linear Regression
Introduction to Logistic Regression
CHAPTER 12 More About Regression
Simple Linear Regression
Logistic Regression.
Presentation transcript:

Logistic Regression Logistic Regression is used to study or model the association between a binary response variable (y) and a set of explanatory variables. Simple Logistic Regression Model: where, p(x) is the probability that y equals 1 (success) when the independent variable is x. Remarks: The linear expression involving the explanatory variable (x) can take on values from - ∞ to ∞. However, p(x) is between 0 and 1, while, the odds are between 0 and ∞. But the log(odds) is from - ∞ to ∞. The parameter β1 measures the degree of association between the probability of the event occurring and the value of the independent variable x.

Simple Logistic Regression Example Note that this is a summarized version of the data. data.ex12.22=read.csv("data_example12_22.csv",header=T) head(data.ex12.22) y ck 1 1 20 2 1 20 3 0 20 table(data.ex12.22) ck y 20 60 100 140 180 220 260 300 340 380 420 460 500 0 88 26 8 5 0 1 1 1 0 0 0 0 0 1 2 13 30 30 21 19 18 13 19 15 7 8 35

Simple Logistic Regression Using R data.ex12.22=read.csv("data_example12_22.csv",header=T) head(data.ex12.22) y ck 1 1 20 2 1 20 3 0 20 result.ex12.22=glm(y~ck, family=binomial(link="logit")) summary(result.ex12.22)$coef Estimate Std. Error z value Pr(>|z|) (Intercept) -3.02835960 0.366971041 -8.252312 1.553589e-16 ck 0.03510439 0.004080992 8.601927 7.838875e-18 This p-value can be used to test the null hypothesis H0: β1=0 vs. Ha: β1 ≠ 0. Since the estimate for β1 is positive, this indicates that patients with higher levels of CK are associated with higher chance of having a heart attack . These estimates are obtained using a Maximum Likelihood procedure. This implies that an increase of 1 unit of x results in the odds of a “success” being multiplied by exp(β1). In our example, exp(0.0351)=1.035723. Hence, as CK increases by 1 unit, the odds that the person had a heart attack increases by a factor of 1.036, or it increases by 3.6%.

Simple Logistic Regression Using R result.ex12.22=glm(y~ck, family=binomial(link="logit")) summary(result.ex12.22)$coef Estimate Std. Error z value Pr(>|z|) (Intercept) -3.02835960 0.366971041 -8.252312 1.553589e-16 ck 0.03510439 0.004080992 8.601927 7.838875e-18 confint(result.ex12.22,level=.90) Waiting for profiling to be done... 5 % 95 % (Intercept) -3.66938129 -2.45759221 ck 0.02881989 0.04227805 x=seq(20,500,by=40) proportions=tapply(y,ck,mean) plot(x,proportions) curve(exp(-3.0284+0.035*x)/(1+exp(-3.0284+0.035*x),20,500,lwd=2,col="blue",add=T) points(x,exp(-3.0284+0.035*x)/(1+exp(-3.0284+0.035*x)),pch=20) predict(result.ex12.22,newdata=data.frame(ck=x)) result.ex12.22$fitted result.ex12.22$residuals However, the residuals are not very informative, since the response variable is binary. Therefore, we are 90% confident that β1 is between 0.029 and 0.042.

Classifying Outcomes After we obtained the probability estimate that the patient had a heart attack (‘success’), we still have to decide whether that patient had a heart attack or not. rm(list=ls()) # This will clear the memory data.ex12.22=read.csv("data_example12_22.csv",header=T) attach(data.ex12.22) result.ex12.22=glm(y~ck, family=binomial(link="logit")) prediction=as.numeric(result.ex12.22$fitted>.7) error=y-prediction y.hat=result.ex12.22$fitted data.frame(y,predicted=y.hat, +prob=exp(y.hat)/(1+exp(y.hat)),prediction,error)[1:3,] y predicted prob prediction error 1 1 0.08897039 0.5222279 0 1 2 1 0.08897039 0.5222279 0 1 3 0 0.08897039 0.5222279 0 0 percent.error=sum(abs(error))/length(error) percent.error # 0.1472222 false.pos=sum((y==0)*(prediction==1)) # 8 false.neg=sum((y==1)*(prediction==0)) # 45 true.pos=sum((y==1)*(prediction==1)) # 185 true.neg=sum((y==0)*(prediction==0)) # 122 Actual Y value Predicted Y=1 Y=0 185 45 8 122

Finding the Optimal Cut Off Value ################################################## results=result.ex12.22; success=y # Finding the best cut off value cutoff=seq(0,1,by=.05) m=length(cutoff) False.pos=array(99,m) False.neg=array(99,m) Errors=array(99,m) for(i in 1:m) { prediction=as.numeric(results$fitted>cutoff[i]) error=success-prediction False.pos[i]=sum((success==0)*(prediction==1)) False.neg[i]=sum((success==1)*(prediction==0)) Errors[i]=False.pos[i]+False.neg[i] } Percent.error=Errors/length(error) data.frame(CutOff=cutoff,FalsePositive=False.pos,FalseNegative=False.neg,ErrorRate=Percent.error) CutOff FalsePositive FalseNegative ErrorRate 1 0.00 130 0 0.36111111 2 0.05 130 0 0.36111111 3 0.10 42 2 0.12222222 4 0.15 42 2 0.12222222 5 0.20 42 2 0.12222222 6 0.25 42 2 0.12222222 7 0.30 16 15 0.08611111 8 0.35 16 15 0.08611111 9 0.40 16 15 0.08611111 10 0.45 16 15 0.08611111 11 0.50 16 15 0.08611111 12 0.55 16 15 0.08611111 13 0.60 16 15 0.08611111 14 0.65 8 45 0.14722222 15 0.70 8 45 0.14722222 16 0.75 8 45 0.14722222 17 0.80 8 45 0.14722222 18 0.85 8 45 0.14722222 19 0.90 3 75 0.21666667 20 0.95 3 75 0.21666667 21 1.00 0 230 0.63888889 Therefore, use a cut off value of 0.45.