Download presentation
Presentation is loading. Please wait.
1
STAT E-150 Statistical Methods
Logistic Regression
2
So far we have considered regression analyses where the response variables are quantitative. What if this is not the case? If a response variable is categorical a different regression model applies, called logistic regression.
3
A categorical variable which has only two possible values is called a binary variable. We can represent these two outcomes as 1 to represent the presence of some condition (“success”) and 0 to represent the absence of the condition (“failure”): The logistic regression model describes how the probability of “success” is related to the values of the explanatory variables, which can be categorical or quantitative.
4
Logistic regression models work with odds rather than proportions
Logistic regression models work with odds rather than proportions. The odds are just the ratio of the proportions for the two possible outcomes: If π is the proportion for one outcome, then 1 − π is the proportion for the second outcome. The odds of the first outcome occurring are
5
Here's an example: Suppose that a coin is weighted so that heads are more likely than tails, with P(heads) = .6. Then P(tails) = 1 - P(heads) = = .4 The odds of getting heads in a toss of this coin are The odds of getting tails in a toss of this coin are The odds ratio is This tells us that you are 2.25 times more likely to get "heads" than to get "tails".
6
You can also convert the odds of an event back to the probability of the event:
For an event A, P(A) = For example, if the odds of a horse winning are 9 to 1, then the probability of the horse winning are 9/(1+9) = .9
7
The Logistic Regression Model
The relationship between a categorical response variable and a single quantitative predictor variable is an S-shaped curve. Here is a plot of p vs. x for different logistic regression models: The points on the curve represent P(Y=1) for each value of x. The associated model is the logistic or logit model:
8
The general logistic regression model is
where and E(Y) = π, the probability of success. The xi are independent quantitative or qualitative variables.
9
Odds and log(odds) Let π = P(Y = 1) be a probability with 0 < π < 1 Then the odds that Y = 1 is the ratio odds = and so the log (odds) =
10
This transformation from π to log(odds) is called the logistic or logit transformation.
The relationship is one-to-one: For every value of π (except for 0 and 1) there is one and only one value of
11
The log(odds) can have any value from -∞ to ∞, and so we can use a linear predictor.
That is, we can model the log odds as a linear function of the explanatory variable: y = β0 + β1x (To verify this, solve π and then take the log of both sides.)
12
For any fixed value of the predictor x, there are four probabilities:
If the model is exactly correct, then p = π and the two fitted values estimate the same number. True value Fitted value Actual probability p = true P(Yes) for this x = #Yes/(#Yes + #No) Model probability π = true P(Yes) from the model = fitted P(Yes) from the model
13
To go from log(odds) to odds, use the exponential function ex:
1. odds = elog(odds) 2. You can check that if odds = 1/(1 - π ), then you can solve for π to find that π = odds/(1 + odds). 3. Since log(odds) = we have the result π = elog(odds) / (1 + elog(odds))
14
The Logistic Regression Model
The Logistic Regression Model for the probability of success 𝝅 of a binary response variable based on a single predictor x is: Logit form: Probability form:
15
Example: A study was conducted to analyze behavioral variables and stress in people recently diagnosed with cancer. For our purposes we will look at patients who have been in the study for at least a year, and the dependent variable (Outcome) is coded 1 to indicate that the patient is improved or in complete remission, and 0 if the patient has not improved or has died. The predictor variable is the survival rating assigned by the patient's physician at the time of diagnosis. This is a number between 0 and 100 and represents the estimated probability of survival at five years. Out of 66 cases there are 48 patients who have improved and 18 who have not.
16
The scatterplot shows us that a linear regression analysis is not appropriate for this data. This scatterplot clearly has no linear trend, but it does show that the proportion of people who improve is much higher when the survival rate is high, as would be expected. However, if we transformation from whether the patient has improved to the odds of improvement, and then consider the log of the odds, we will have a variable that is a linear function of the survival rate, and we will be able to use linear regression.
17
Let p = the probability of improvement
Then 1 - p is the probability of no improvement We will look for an equation of the form Here β1 will be the amount of increase in the log odds for a one-unit increase in SurvRate.
18
Variables in the Equation
Here are the results of this analysis: We can see that the logistic regression equation is log(odds) = .081xSurvRate Variables in the Equation B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B) Lower Upper Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126 Constant -2.684 .811 10.941 .001 .068 a. Variable(s) entered on step 1: survrate.
19
Variables in the Equation
Assessing the Model In linear regression, we used the p-values associated with the test statistic t to assess the contribution of each predictor. In logistic regression, we can use the Wald statistic in the same way. Note that in this example, the Wald statistic for the predictor is , which is significant at the .05 level of significance. This is evidence that this predictor is a significant predictor in this model. Variables in the Equation B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B) Lower Upper Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126 Constant -2.684 .811 10.941 .001 .068 a. Variable(s) entered on step 1: survrate.
20
Variables in the Equation
Ha: β1 ≠ 0 Since p is close to zero, the null hypothesis is rejected. This indicates that the predicted survival rate is a useful indicator of the patient's outcome. The resulting regression equation is log(odds) = .081xSurvRate Variables in the Equation B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B) Lower Upper Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126 Constant -2.684 .811 10.941 .001 .068 a. Variable(s) entered on step 1: survrate.
21
Here are scatterplots of the data and of the values predicted by the model:
22
Note how well the results fit the data:
The suggested curve is quite close to the points in the lower left, rises rapidly across the points in the center, where the values of SurvRate that have a roughly equal number of patients who improve and don't improve, and finally comes close to the cluster of points in the upper right. The values all fall between 0 and 1.
23
SPSS takes an iterative approach to this solution: it will begin with some starting values for β0 and β1, see how well the estimated log odds fit the data, adjust the coefficients, and then reexamine the fit. This will continue until no further adjustments will produce a better fit What do all of the SPSS results tell us?
24
Case Processing Summary
Starting with Block 0: Beginning Block The Case Processing Summary tells us that all 66 cases were included: Case Processing Summary Unweighted Casesa N Percent Selected Cases Included in Analysis 66 100.0 Missing Cases .0 Total Unselected Cases a. If weight is in effect, see classification table for the total number of cases.
25
Variables in the Equation Variables not in the Equation
The Variables in the Equation table shows that in this first iteration only the constant was used. The second table lists the variables that were not included in this model; it indicates that if the second variable were to be included, it would be a significant predictor: Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 0 Constant .981 .276 12.594 1 .000 2.667 Variables not in the Equation Score df Sig. Step 0 Variables survrate 34.538 1 .000 Overall Statistics
26
The Iteration History shows what the results would be with only this single predictor. Since the second variable, SurvRate, is not included, there is little change. The -2 Log likelihood can be used to assess how well a model would fit the data. It is based on summing the probabilities associated with the expected and observed outcomes. The lower the -2LL value, the better the fit. Iteration Historya,b,c Iteration -2 Log likelihood Coefficients Constant Step 0 1 77.414 .909 2 77.346 .980 3 .981 4 a. Constant is included in the model. b. Initial -2 Log Likelihood: c. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
27
Classification Tablea,b
You can see from the Classification Table that the values were not classified by a second variable at this point. You can also see that there were 48 patients who improved and 18 who did not. Classification Tablea,b Observed Predicted outcome Percentage Correct 1 Step 0 18 .0 48 100.0 Overall Percentage 72.7 a. Constant is included in the model. b. The cut value is .500
28
Hosmer and Lemeshow Test
One way to test the overall model is the Hosmer-Lemeshow goodness-of-fit test, which is a Chi-Square test comparing the observed and expected frequencies of subjects falling in the two categories of the response variable. Large values of 2 (and the corresponding small p-values) indicate a lack of fit for the model. This table tells us that our model is a good fit, since the p-value is large: Hosmer and Lemeshow Test Step Chi-square df Sig. 1 6.887 7 .441
29
Iteration Historya,b,c,d
Now consider the next block, Block 1: Method = Enter The Iteration History table shows the progress as the model is reassessed; the value of the coefficient of SurvRate converges to .081. To assess whether this larger model provides a significantly better fit than the smaller model, consider the difference between the -2LL values. The value for the smaller model was , which is larger than , the value for the model with SurvRate included, indicating that the larger model is a significantly better fit. Iteration Historya,b,c,d Iteration -2 Log likelihood Coefficients Constant survrate Step 1 1 45.042 -1.547 .042 2 38.630 -2.184 .063 3 37.410 -2.552 .076 4 37.324 -2.673 .081 5 37.323 -2.684 6 a. Method: Enter b. Constant is included in the model. c. Initial -2 Log Likelihood: d. Estimation terminated at iteration number 6 because parameter estimates changed by less than .001.
30
Variables in the Equation
This table now shows that SurvRate is a significant predictor (p is close to 0), and we can find the coefficients in the resulting regression equation, y = log(odds) = .081xSurvRate Variables in the Equation B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B) Lower Upper Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126 Constant -2.684 .811 10.941 .001 .068 a. Variable(s) entered on step 1: survrate.
31
Why is the odds ratio Exp(B)?
Suppose that we have the logistic regression equation y = log(odds) = β1x + β0 Then β1 represents the change in y associated with a unit change in x. That is, y will increase by β1 when x increases by 1. But y is log(odds). So log(odds) will increase by β1 when x increases by 1.
32
Exp(B) is an indicator of the change in odds resulting from a unit change in the predictor. Let's see how this happens: Suppose we start with the regression equation y = β1x + β0 Now if x increases by 1, we have y = β1(x +1) + β0 How much has y changed? New value - old value = [β1(x +1) + β0] - [β1 x + b0] = [β1x + β1 + β0] - [β1 x + β0] = β1 So y has increased by β1. That is, β1 is the change in y associated with a unit change in x. But y = log(odds), so now we know that log(odds) will increase by β1 when x increases by 1.
33
Variables in the Equation
If log(odds) changes by β1 then odds increases by eβ1 In other words, the change in odds associated with a unit change in x is eβ1, which can be denoted as Exp(β1) -- or by Exp(B) in SPSS. In our example, then, with each unit increase in SurvRate, y = log(odds) will increase by That is, the odds of improving will increase by a factor of for each unit increase in SurvRate. Variables in the Equation B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B) Lower Upper Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126 Constant -2.684 .811 10.941 .001 .068
34
Another example: The sales director for a chain of appliance stores wants to find out what circumstances encourage customers to purchase extended warranties after a major appliance purchase. The response variable is an indicator of whether or not a warranty is purchased. The predictor variables are - Customer gender - Age of the customer - Whether a gift is offered with the warranty - Price of the appliance - Race of the customer (this is coded with three indicator variables to represent White, African-American, and Hispanic)
35
Variables in the Equation
Here are the results with all predictors: The odds ratio is shown in the Exp(B) column. This is the change in odds for each unit change in the predictor. For example, the odds of a customer purchasing a warranty are times greater for each additional year in the customer's age. Which of these variables might be removed from this model? Use α = .10 Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1a Gender -3.772 2.568 2.158 1 .142 .023 Gift 2.715 1.567 3.003 .083 15.112 Age .091 .056 2.638 .104 1.096 Price .001 .000 3.363 .067 1.001 White 3.773 13.863 .074 .785 43.518 AfricanAmerican 1.163 13.739 .007 .933 3.199 Hispanic 6.347 14.070 .203 .652 Constant 14.921 .649 .421 a. Variable(s) entered on step 1: Gender, Gift, Age, Price, White, AfricanAmerican, Hispanic.
36
Variables in the Equation
Here are the results with all predictors: The odds ratio is shown in the Exp(B) column. This is the change in odds for each unit change in the predictor. For example, the odds of a customer purchasing a warranty are times greater for each additional year in the customer's age. Which of these variables might be removed from this model? Use α = .10 Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1a Gender -3.772 2.568 2.158 1 .142 .023 Gift 2.715 1.567 3.003 .083 15.112 Age .091 .056 2.638 .104 1.096 Price .001 .000 3.363 .067 1.001 White 3.773 13.863 .074 .785 43.518 AfricanAmerican 1.163 13.739 .007 .933 3.199 Hispanic 6.347 14.070 .203 .652 Constant 14.921 .649 .421 a. Variable(s) entered on step 1: Gender, Gift, Age, Price, White, AfricanAmerican, Hispanic.
37
Variables in the Equation
If the analysis is rerun with only three predictors these are the results: In this model, all three predictors are significant. These results indicate that the odds that a customer who is offered a gift will purchase a warranty is more than ten times greater than the corresponding odds for a customer having the same other characteristics but who is not offered a gift. Also, the odds ratio for Age is greater than 1. This tells us that older buyers are more likely to purchase a warranty. Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1a Gift 2.339 1.131 4.273 1 .039 10.368 Age .064 .032 4.132 .042 1.066 Price .000 6.165 .013 1.000 Constant -6.096 2.142 8.096 .004 .002 a. Variable(s) entered on step 1: Gift, Age, Price.
38
Variables in the Equation
If the analysis is rerun with only three predictors these are the results: The resulting regression equation is Log(odds) = 2.339xGift xAge Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1a Gift 2.339 1.131 4.273 1 .039 10.368 Age .064 .032 4.132 .042 1.066 Price .000 6.165 .013 1.000 Constant -6.096 2.142 8.096 .004 .002 a. Variable(s) entered on step 1: Gift, Age, Price.
39
Variables in the Equation
If the analysis is rerun with only three predictors these are the results: Note: in this example, the coefficient of Price is too small to be expressed in three decimal places. This situation can be remedied by dividing the price by 100, and creating a new model. Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1a Gift 2.339 1.131 4.273 1 .039 10.368 Age .064 .032 4.132 .042 1.066 Price .000 6.165 .013 1.000 Constant -6.096 2.142 8.096 .004 .002 a. Variable(s) entered on step 1: Gift, Age, Price.
40
Variables in the Equation
The resulting equation is now Log(odds) = 2.339xGift xAge xPrice Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1a Gift 2.339 1.131 4.273 1 .039 10.368 Age .064 .032 4.132 .042 1.066 Price100 .040 .016 6.165 .013 1.041 Constant -6.096 2.142 8.096 .004 .002 a. Variable(s) entered on step 1: Gift, Age, Price100.
41
To produce the output for your analysis,
Choose >Analyze >Regression >Binary Logistic. Choose the response and predictor variables. Click on Options and check CI for Exp(B) to create the confidence intervals. Click on Continue Click on Save and check the Probabilities box. Click on Continue and then on OK.
42
To produce the graph of the results, create a simple scatterplot using Predicted Probability as the dependent variable, and the predictor as a covariate.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.