Download presentation
1
Logistic Regression
2
Aims When and Why do we Use Logistic Regression?
Binary Multinomial Theory Behind Logistic Regression Assessing the Model Assessing predictors Things that can go Wrong Interpreting Logistic Regression Slide 2
3
When And Why To predict an dicotomous variable from one or more categorical or continuous predictor variables. In logistic regression, instead of predicting the value of a variable Y from a predictor variable X1 or several predictor variables (Xs), we predict the probability of Y occurring given known values of X1 (or Xs). Used because having a categorical outcome variable violates the assumption of linearity in normal regression. Slide 3
4
Model Slide 4
5
Assessing the Model The Log-likelihood statistic
Analogous to the residual sum of squares in multiple regression It is an indicator of how much unexplained information there is after the model has been fitted. Large values indicate poorly fitting statistical models. The log-likelihood is based on summing the probabilities associated with the predicted and actual outcomes (Tabachnick & Fidell, 2007).
6
Assessing Changes in Models
It’s possible to calculate a log-likelihood for different models and to compare these models by looking at the difference between their log-likelihoods.
7
Assessing Predictors: The Wald Statistic
Similar to t-statistic in Regression. Tests the null hypothesis that b = 0. Is biased when b is large. Better to look at Likelihood-ratio statistics. Wald statistic, which has a chi-square distribution. 3. when the regression coefficient (b) is large, the standard error tends to become inflated, resulting in the Wald statistic being underestimated (see Menard, 1995). The inflation of the standard error increases the probability of rejecting a predictor as being significant when in reality it is making a significant contribution to the model (i.e. you are more likely to make a Type II error). Slide 7
8
Assessing Predictors: The Odds Ratio or Exp(b)
Indicates the change in odds resulting from a unit change in the predictor. OR > 1: Predictor , Probability of outcome occurring . OR < 1: Predictor , Probability of outcome occurring . p.271 Slide 8
9
Assessing the model To calculate the change in odds that results from a unit change in the predictor, we must first calculate the odds of becoming pregnant given that a condom wasn’t used using these equations. We then calculate the odds of becoming pregnant given that a condom was used. Finally, we calculate the proportionate change in these two odds. Slide 9
10
Model Assessment
13
Odds Ratio
14
Methods of Regression Forced Entry: All variables entered simultaneously. Hierarchical: Variables entered in blocks. Blocks should be based on past research, or theory being tested. Good Method. Stepwise: Variables entered on the basis of statistical criteria (i.e. relative contribution to predicting outcome). Should be used only for exploratory analysis. Also, as I mentioned for ordinary regression, if you do decide to use a stepwise method then the backward method is preferable to the forward method. This is because of suppressor effects, which occur when a predictor has a significant effect but only when another variable is held constant. Forward selection is more likely than backward elimination to exclude predictors involved in suppressor effects. As such, the forward method runs a higher risk of making a Type II error. Slide 14
15
Things That Can go Wrong
Linearity Independence of Errors Multicollinearity Overdispersion 1 The assumption of linearity in logistic regression assumes that there is a linear relationship between any continuous predictors and the logit of the outcome variable. This assumption can be tested by looking at whether the interaction term between the predictor and its log transformation is significant (Hosmer & Lemeshow, 1989). 2 Cases of data should not be related; for example, you cannot measure the same people at different points in time. Violating this assumption produces overdispersion. 4. the observed variance is bigger than expected from the logistic regression model. This can happen for two reasons. The first is correlated observations (i.e. when the assumption of independence is broken) and the second is due to variability in success probabilities. SPSS produces a chi-square goodness-of-fit statistic, and overdispersion is present if the ratio of this statistic to its degrees of freedom is greater than 1 (this ratio is called the dispersion parameter, φ). Overdispersion is likely to be problematic if the dispersion parameter approaches or is greater than 2. (Incidentally, underdispersion is shown by values less than 1, but this problem is much less common in practice.) There is also the deviance goodness-of-fit statistic, and the dispersion parameter can be based on this statistic instead (again by dividing by the degrees of freedom). When the chi-square and deviance statistics are very discrepant, then overdispersion is likely. The effects of overdispersion can be reduced by using the dispersion parameter to rescale the standard errors and confidence intervals. For example, the standard errors are multiplied by √φ to make them bigger (as a function of how big the overdispersion is). You can base these corrections on the deviance statistic too.
16
Output: Initial Model To open the above Logistic Regression dialog box select Analyze, then Regression, and finally Binary Logistic. Click on Kibblemix and drag it to the Dependent box. To specify a main effect, simply select the predictors (age, environ4 and gender in the above example) and then drag it to the Covariates box. To input an interaction, click on more than one variable on the left-hand side of the dialog box (i.e. click on several variables while holding down the Ctrl key) and then click on the >a*b> button to move them to the Covariates box. In this example there are only two predictors and therefore there is only one possible interaction (the age× environ4 interaction), but if you have three predictors then you can select several interactions using two predictors, and an interaction involving all three. Now click on Enter and then clicking on a method in the resulting drop-down menu. For this analysis select a Forward:LR method of regression.
17
Output: Initial Model To specify categorical predictor variables, click the Categorical… button to invoke the Logistic Regression: Define Categorical Variables dialog box. In this dialog box, the covariates are listed on the left-hand side, and there is a space on the right-hand side in which categorical covariates can be placed. Highlight any categorical variables you have (in this example we have only one, so click on gender) and drag it to the Categorical Covariates box. Categorical predictors could be incorporated into regression by recoding them using zeros and ones (known as dummy coding). Actually, there are different ways that you can code categorical variables. By default SPSS uses Indicator coding, which is the standard dummy variable coding. To change to a different kind of contrast click on Change to access a drop-down list of possible contrasts.
18
Output: Initial Model As with linear regression, it is possible to save a set of residuals as new variables in the data editor. These residual variables can then be examined to see how well the model fits the observed data. To save residuals click on Save in the main Logistic Regression dialog box. Two residuals that are unique to logistic regression are the predicted probabilities and the predicted group memberships. The predicted probabilities are the probabilities of Y occurring given the values of each predictor for a given participant. The predicted group membership is self-explanatory in that it predicts to which of the two outcome categories a participant is most likely to belong based on the model. The group memberships are based on the predicted probabilities. Make the selections as shown above and click Continue to go back to the main dialog box.
19
Output: Initial Model There is a final dialog box that offers further options. This box is above and is accessed by clicking on Options button in the main Logistic Regression dialog box. Make the selections as shown above. A classification plot is a histogram of the actual and predicted values of the outcome variable. This plot is useful for assessing the fit of the model to the observed data. It is also possible to do a Casewise listing of residuals either for any cases for which the standardized residual is greater than 2 standard deviations (this value can be changed but the default is sensible), or for all cases. You can ask SPSS to display a confidence interval for the odds ratio Exp(B), and by default a 95% confidence interval is used, which is appropriate and a useful statistic to have. More important, you can request the Hosmer-Lemeshow goodness-of-fit statistic, which can be used to assess how well the chosen model fits the data. Now click Continue and then OK to get the output.
20
Output: Initial Model
26
Output: Initial Model
27
Output: Initial Model
28
Output: Initial Model
29
Output: Step 1
30
Output: Step 1
31
Output: Step 1
32
Classification Plot
33
Summary The overall fit of the final model is shown by the −2 log-likelihood statistic. If the significance of the chi-square statistic is less than .05, then the model is a significant fit of the data. Check the table labelled Variables in the equation to see which variables significantly predict the outcome. Use the odds ratio, Exp(B), for interpretation. OR > 1, then as the predictor increases, the odds of the outcome occurring increase. OR < 1, then as the predictor increases, the odds of the outcome occurring decrease. The confidence interval of the OR should not cross 1! Check the table labelled Variables not in the equation to see which variables did not significantly predict the outcome.
34
Reporting the Analysis
35
Multinomial logistic regression
Logistic regression to predict membership of more than two categories. It (basically) works in the same way as binary logistic regression. The analysis breaks the outcome variable down into a series of comparisons between two categories. E.g., if you have three outcome categories (A, B and C), then the analysis will consist of two comparisons that you choose: Compare everything against your first category (e.g. A vs. B and A vs. C), Or your last category (e.g. A vs. C and B vs. C), Or a custom category (e.g. B vs. A and B vs. C). The important parts of the analysis and output are much the same as we have just seen for binary logistic regression
36
I may not be Fred Flintstone …
How successful are chat-up lines? The chat-up lines used by 348 men and 672 women in a night-club were recorded. Outcome: Whether the chat-up line resulted in one of the following three events: The person got no response or the recipient walked away, The person obtained the recipient’s phone number, The person left the night-club with the recipient. Predictors: The content of the chat-up lines were rated for: Funniness (0 = not funny at all, 10 = the funniest thing that I have ever heard) Sexuality (0 = no sexual content at all, 10 = very sexually direct) Moral vales (0 = the chat-up line does not reflect good characteristics, 10 = the chat-up line is very indicative of good characteristics). Gender of recipient
37
Output
38
Output
39
Output
40
Output
41
Interpretation Good_Mate: Whether the chat-up line showed signs of good moral fibre significantly predicted whether you got a phone number or no response/walked away, b = 0.13, Wald χ2(1) = 6.02, p < .05. Funny: Whether the chat-up line was funny did not significantly predict whether you got a phone number or no response, b = 0.14, Wald χ2(1) = 1.60, p > .05. Gender: The gender of the person being chatted up significantly predicted whether they gave out their phone number or gave no response, b = −1.65, Wald χ2(1) = 4.27, p < .05. Sex: The sexual content of the chat-up line significantly predicted whether you got a phone number or no response/walked away, b = 0.28, Wald χ2(1) = 9.59, p < .01. Funny×Gender: The success of funny chat-up lines depended on whether they were delivered to a man or a woman because in interaction these variables predicted whether or not you got a phone number, b = 0.49, Wald χ2(1) = 12.37, p < .001. Sex×Gender: The success of chat-up lines with sexual content depended on whether they were delivered to a man or a woman because in interaction these variables predicted whether or not you got a phone number, b = −0.35, Wald χ2(1) = 10.82, p < .01.
42
Interpretation Good_Mate: Whether the chat-up line showed signs of good moral fibre did not significantly predict whether you went home with the date or got a slap in the face, b = 0.13, Wald χ2(1) = 2.42, p > .05. Funny: Whether the chat-up line was funny significantly predicted whether you went home with the date or no response, b = 0.32, Wald χ2(1) = 6.46, p < .05. Gender: The gender of the person being chatted up significantly predicted whether they went home with the person or gave no response, b = −5.63, Wald χ2(1) = 17.93, p < .001. Sex: The sexual content of the chat-up line significantly predicted whether you went home with the date or got a slap in the face, b = 0.42, Wald χ2(1) = 11.68, p < .01. Funny×Gender: The success of funny chat-up lines depended on whether they were delivered to a man or a woman because in interaction these variables predicted whether or not you went home with the date, b = 1.17, Wald χ2(1) = 34.63, p < .001. Sex×Gender: The success of chat-up lines with sexual content depended on whether they were delivered to a man or a woman because in interaction these variables predicted whether or not you went home with the date, b = −0.48, Wald χ2(1) = 8.51, p < .01.
43
Reporting the Results
44
Multiple Logistic Regression E(Y|X)=P(Y=1|x) = Π(X) = The relationship between πi and X is S shaped The logit (log-odds) transformation (link function) Has many of the desirable properties of the linear regression model, while relaxing some of the assumptions. Maximum Likelihood (ML) model parameters are estimated by iteration
45
Assumptions for Logistic Regression
The independent variables are liner in the logit. It is also possible to add explicit interaction and power terms, as in OLS regression. The dependent variable need not be normally distributed (it is assumed to be distributed within the range of the exponential family of distributions, such as normal, Poisson, binomial, gamma). The dependent variable need not be homoscedastic for each level of the independents; that is, there is no homogeneity of variance assumption. Normally distributed error terms are not assumed. The independent variables may be binary, categorical, continuous
46
Applications Identify risk factors Ho: β0 = 0 while controlling for confounders and other important determinants of the event Classification: Predict outcome for a new observation with a particular constellation of risk factors (a form of discriminant analysis)
47
Design Variables (coding)
In SPSS, designate Categorical to get k-1 indicators for a k-level factor design variable D1 D2 RACE White 0 0 Black 1 0 Other 0 1
48
Interpretation of the parameters
If p is the probability of an event and O is the odds for that event then … the link function in logistic regression gives the log-odds
49
…and the odds ratio, OR, is
Y=1 Y=0 X=1 X=0
50
Definitions and Annotated SPSS output for Logistic Regression
Virtually any sin that can be committed with least squares regression can be committed with logistic regression. These include stepwise procedures and arriving at a final model by looking at the data. All of the warnings and recommendations made for least squares regression apply to logistic regression as well ... Gerard Dallal logistic regression logistic regression var=honcomp /method=enter read socst.
51
Assessing the Model Fit
There are several R2-like measures; they are not goodness-of-fit tests but rather attempt to measure strength of association Cox and Snell's R-Square is an attempt to imitate the interpretation of multiple R-Square based on the likelihood, but its maximum can be (and usually is) less than 1.0, making it difficult to interpret. It is part of SPSS output. Nagelkerke's R-Square is a further modification of the Cox and Snell coefficient to assure that it can vary from 0 to 1. That is, Nagelkerke's R2 divides Cox and Snell's R2 by its maximum in order to achieve a measure that ranges from 0 to 1. Therefore Nagelkerke's R-Square will normally be higher than the Cox and Snell measure. It is part of SPSS output and is the most-reported of the R-squared estimates. See Nagelkerke (1991).
52
Hosmer and Lemeshow's Goodness of Fit Test
tests the null hypothesis that the data were generated by the fitted model divide subjects into deciles based on predicted probabilities compute a chi-square from observed and expected frequencies compute a probability (p) value from the chi-square distribution with 8 degrees of freedom to test the fit of the logistic model If the Hosmer and Lemeshow Goodness-of-Fit test statistic has p = .05 or less, we reject the null hypothesis that there is no difference between the observed and model-predicted values of the dependent. (This means the model predicts values significantly different from the observed values).
53
Observed vs. Predicted This particular model performs better
when the event rate is low observed expected
54
Check for Linearity in the LOGIT
Box-Tidwell Transformation (Test): Add to the logistic model interaction terms which are the crossproduct of each independent times its natural logarithm [(X)ln(X)]. If these terms are significant, then there is nonlinearity in the logit. This method is not sensitive to small nonlinearities. Orthogonal polynomial contrasts, an option in SPSS, may be used. This option treats each independent as a categorical variable and computes logit (effect) coefficients for each category, testing for linear, quadratic, cubic, or higher-order effects. The logit should not change over the contrasts. This method is not appropriate when the independent has a large number of values, inflating the standard errors of the contrasts.
55
Residual Plots Plot the Cook’s distance against Several other plots suggested in Hosmer & Lemishow (p177) involve further manipulation of the statistics produced by SPSS External Validation a new sample a hold-out sample Cross Validation (classification) n-fold (leave 1 out) V-fold (divide data into V subsets)
56
Pitfalls 1. Multiple comparisons (data driven model/data dredging)
2. Over fitting -complex models fit to a small dataset good fit in THIS dataset, but not generalize: you’re modeling the random error at least 10 events per independent variable -validation new data to check predictive ability, calibration hold-out sample -look for sensitivity to a single observation (residuals) 3. Violating the assumptions more serious in prediction models than association There are many strategies: don’t try them all -chose one based on the structure of the question -draw primary conclusions based on that one -examine robustness to other strategies
57
CASE STUDY Develop a strategy for analyzing Hosmer & Lemishow’s Low Birth weight data using LOW as the dependent variable Try ANCOVA for the same data with BWT (birth weight in grams) as the dependent variable LBW.SAV is on the S drive under GCRC data analysis
58
References Hosmer, D.W. and Lemishow, S, (2000) Applied Logistic Regression, 2nd ed., John Wiley & Sons, New York, NY Harrell, F. E., Lee, K. L., Mark, D. B. (1996) “Multivariable Prognostic models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors”, Statistics in Medicine, 15, Nagelkerke, N. J. D. (1991). “A note on a general definition of the coefficient of determination” Biometrika, Vol. 78, No. 3: Covers the two measures of R-square for logistic regression which are found in SPSS output. Agresti, A. (1990) Categorical Data Analysis, John Wiley & Sons, New York, NY
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.