Multinomial Logistic Regression Basic Relationships Describing Relationships Classification Accuracy Sample Problem Steps in Solving Problems
Multinomial logistic regression Multinomial logistic regression is used to analyze relationships between a non-metric dependent variable and metric or dichotomous independent variables. Multinomial logistic regression compares multiple groups through a combination of binary logistic regressions. The group comparisons are equivalent to the comparisons for a dummy-coded dependent variable, with the group with the highest numeric score used as the reference group. For example, if we wanted to study differences in BSW, MSW, and PhD students using multinomial logistic regression, the analysis would compare BSW students to PhD students and MSW students to PhD students. For each independent variable, there would be two comparisons.
What multinomial logistic regression predicts Multinomial logistic regression provides a set of coefficients for each of the two comparisons. The coefficients for the reference group are all zeros, similar to the coefficients for the reference group for a dummy-coded variable. Thus, there are three equations, one for each of the groups defined by the dependent variable. The three equations can be used to compute the probability that a subject is a member of each of the three groups. A case is predicted to belong to the group associated with the highest probability. Predicted group membership can be compared to actual group membership to obtain a measure of classification accuracy.
Level of measurement requirements Multinomial logistic regression analysis requires that the dependent variable be non-metric. Dichotomous, nominal, and ordinal variables satisfy the level of measurement requirement. Multinomial logistic regression analysis requires that the independent variables be metric or dichotomous. Since SPSS will automatically dummy-code nominal level variables, they can be included since they will be dichotomized in the analysis. In SPSS, non-metric independent variables are included as “factors.” SPSS will dummy-code non-metric IVs. In SPSS, metric independent variables are included as “covariates.” If an independent variable is ordinal, we will attach the usual caution.
Assumptions and outliers Multinomial logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. Because it does not impose these requirements, it is preferred to discriminant analysis when the data does not satisfy these assumptions. SPSS does not compute any diagnostic statistics for outliers. To evaluate outliers, the advice is to run multiple binary logistic regressions and use those results to test the exclusion of outliers.
Sample size requirements The minimum number of cases per independent variable is 10, using a guideline provided by Hosmer and Lemeshow, authors of Applied Logistic Regression, one of the main resources for Logistic Regression. For preferred case-to-variable ratios, we will use 20 to 1.
Methods for including variables Beginning with version 13, SPSS supports stepwise entry of variables, as well as simultaneous or direct entry. In previous versions, the only method for selecting independent variables in SPSS is simultaneous or direct entry.
Overall test of relationship - 1 The overall test of relationship among the independent variables and groups defined by the dependent is based on the reduction in the likelihood values for a model which does not contain any independent variables and the model that contains the independent variables. This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-square. The significance test for the final model chi-square (after the independent variables have been added) is our statistical evidence of the presence of a relationship between the dependent variable and the combination of the independent variables.
Overall test of relationship - 2 The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information". In this analysis, the probability of the model chi-square (18.457) was 0.005, less than or equal to the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported.
Strength of multinomial logistic regression relationship While multinomial logistic regression does compute correlation measures to estimate the strength of the relationship (pseudo R square measures, such as Nagelkerke's R²), these correlations measures do not really tell us much about the accuracy or errors associated with the model. A more useful measure to assess the utility of a multinomial logistic regression model is classification accuracy, which compares predicted group membership based on the logistic model to the actual, known group membership, which is the value for the dependent variable.
Evaluating usefulness for logistic models The benchmark that we will use to characterize a multinomial logistic regression model as useful is a 25% improvement over the rate of accuracy achievable by chance alone. Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy. The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group. The only difference between by chance accuracy for binary logistic models and by chance accuracy for multinomial logistic models is the number of groups defined by the dependent variable.
Computing by chance accuracy The percentage of cases in each group defined by the dependent variable is found in the ‘Case Processing Summary’ table. The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the 'Case Processing Summary', and then squaring and summing the proportion of cases in each group (0.371² + 0.557² + 0.072² = 0.453). The proportional by chance accuracy criteria is 56.6% (1.25 x 45.3% = 56.6%).
Comparing accuracy rates To characterize our model as useful, we compare the overall percentage accuracy rate produced by SPSS at the last step in which variables are entered to 25% more than the proportional by chance accuracy. (Note: SPSS does not compute a cross-validated accuracy rate for multinomial logistic regression .) The classification accuracy rate was 60.5% which was greater than or equal to the proportional by chance accuracy criteria of 56.6% (1.25 x 45.3% = 56.6%). The criteria for classification accuracy is satisfied in this example.
Numerical problems The maximum likelihood method used to calculate multinomial logistic regression is an iterative fitting process that attempts to cycle through repetitions to find an answer. Sometimes, the method will break down and not be able to converge or find an answer. Sometimes the method will produce wildly improbable results, reporting that a one-unit change in an independent variable increases the odds of the modeled event by hundreds of thousands or millions. These implausible results can be produced by multicollinearity, categories of predictors having no cases or zero cells, and complete separation whereby the two groups are perfectly separated by the scores on one or more independent variables. The clue that we have numerical problems and should not interpret the results are standard errors for some independent variables that are larger than 2.0.
Relationship of individual independent variables and the dependent variable There are two types of tests for individual independent variables: The likelihood ratio test evaluates the overall relationship between an independent variable and the dependent variable The Wald test evaluates whether or not the independent variable is statistically significant in differentiating between the two groups in each of the embedded binary logistic comparisons. If an independent variable has an overall relationship to the dependent variable, it might or might not be statistically significant in differentiating between pairs of groups defined by the dependent variable.
Relationship of individual independent variables and the dependent variable The interpretation for an independent variable focuses on its ability to distinguish between pairs of groups and the contribution which it makes to changing the odds of being in one dependent variable group rather than the other. We should not interpret the significance of an independent variable’s role in distinguishing between pairs of groups unless the independent variable also has an overall relationship to the dependent variable in the likelihood ratio test. The interpretation of an independent variable’s role in differentiating dependent variable groups is the same as we used in binary logistic regression. The difference in multinomial logistic regression is that we can have multiple interpretations for an independent variable in relation to different pairs of groups.
Relationship of individual independent variables and the dependent variable SPSS identifies the comparisons it makes for groups defined by the dependent variable in the table of ‘Parameter Estimates,’ using either the value codes or the value labels, depending on the options settings for pivot table labeling. The reference category is identified in the footnote to the table. In this analysis, two comparisons will be made: the TOO LITTLE group (coded 1, shaded blue) will be compared to the TOO MUCH group (coded 3, shaded purple) the ABOUT RIGHT group (coded 2 , shaded orange)) will be compared to the TOO MUCH group (coded 3, shaded purple). The reference category plays the same role in multinomial logistic regression that it plays in the dummy-coding of a nominal variable: it is the category that would be coded with zeros for all of the dummy-coded variables that all other categories are interpreted against.
Relationship of individual independent variables and the dependent variable In this example, there is a statistically significant relationship between the independent variable CONLEGIS and the dependent variable. (0.010 < 0.05) As well, the independent variable CONLEGIS is significant in distinguishing both category 1 of the dependent variable from category 3 of the dependent variable. (0.027 < 0.05) And the independent variable CONLEGIS is significant in distinguishing category 2 of the dependent variable from category 3 of the dependent variable. (0.007 < 0.05)
Interpreting relationship of individual independent variables to the dependent variable Survey respondents who had greater confidence in congress (higher values correspond to greater confidence) were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges (DV category 1), rather than the group of survey respondents who thought we spend too much money on highways and bridges (DV category 3). For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. (0.253 – 1.0 = -0.747)
Interpreting relationship of individual independent variables to the dependent variable Survey respondents who had greater confidence in congress (higher values correspond to greater confidence) were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges (DV category 2), rather than the group of survey respondents who thought we spend too much money on highways and bridges (DV Category 3). For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. (0.191 – 1.0 = 0.809)
Relationship of individual independent variables and the dependent variable In this example, there is a statistically significant relationship between SEX and the dependent variable, spending on childcare assistance. As well, SEX plays a statistically significant role in differentiating the TOO LITTLE group from the TOO MUCH (reference) group. (0.007 < 0.5) However, SEX does not differentiate the ABOUT RIGHT group from the TOO MUCH (reference) group.(0.51 > 0.5)
Interpreting relationship of individual independent variables and the dependent variable Survey respondents who were male (code 1 for sex) were less likely to be in the group of survey respondents who thought we spend too little money on childcare assistance (DV category 1), rather than the group of survey respondents who thought we spend too much money on childcare assistance (DV category 3). Survey respondents who were male were 88.5% less likely (0.115 – 1.0 = -0.885) to be in the group of survey respondents who thought we spend too little money on childcare assistance.
Interpreting relationships for independent variables in problems In the multinomial logistic regression problems, the problem statement will ask about only one of the independent variables. The answer will be true or false based on only the stated relationship between the specified independent variable and the dependent variable. The individual relationships between other independent variables and the dependent variable are not incorporated in the determination of whether or not the answer is true or false.
Level of Measurement - question The first question requires us to examine the level of measurement requirements for multinomial regression. Multinomial logistic regression requires that the dependent variable be non-metric and the independent variables be metric or dichotomous.
Level of Measurement – evidence and answer True with caution is the correct answer, since we satisfy the level of measurement requirements, but include ordinal level variables in the analysis.
Sample Size - question The second question asks about the sample size requirements for the multinomial regression. To answer this question, we will run the a baseline logistic regression to obtain some basic data about the problem and solution. The phrase “simultaneous entry” dictates the method for including variables in the model.
Request multinomial logistic regression Select the Regression | Multinomial Logistic… command from the Analyze menu.
Selecting the dependent variable First, highlight the dependent variable natroad in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box.
Selecting metric independent variables Metric independent variables are specified as covariates in multinomial logistic regression. Metric variables can be either interval or, by convention, ordinal. Move the metric independent variables, age, educ and conlegis to the Covariate(s) list box. In this analysis, there are no non-metric independent variables. Non-metric independent variables would be moved to the Factor(s) list box.
Specifying statistics to include in the output While we will accept most of the SPSS defaults for the analysis, we need to specifically request the classification table. Click on the Statistics… button to make a request.
Requesting the classification table Third, click on the Continue button to complete the request. First, keep the SPSS defaults for Model and Parameters. Second, mark the checkbox for the Classification table.
Completing the multinomial logistic regression request Click on the OK button to request the output for the multinomial logistic regression. The multinomial logistic procedure supports additional commands to specify the model computed for the relationships (we will use the default main effects model), additional specifications for computing the regression, and saving classification results. We will not make use of these options.
Sample size – ratio of cases to variables evidence and answer Multinomial logistic regression requires that the minimum ratio of valid cases to independent variables be at least 10 to 1. The ratio of valid cases (167) to number of independent variables (3) was 55.7 to 1, which was equal to or greater than the minimum ratio. The requirement for a minimum ratio of cases to independent variables was satisfied. The preferred ratio of valid cases to independent variables is 20 to 1. The ratio of 55.7 to 1 was equal to or greater than the preferred ratio. The preferred ratio of cases to independent variables was satisfied. The answer to this question is true.
Multicollinearity and Numerical Problems - question Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, cells with a zero count for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.
Multicollinearity and Numerical Problems – evidence and answer None of the independent variables in this analysis had a standard error larger than 2.0. (We are not interested in the standard errors associated with the intercept.) The answer to this question is true.
Overall Relationship - question The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled 'Model Fitting Information'.
Overall Relationship – evidence and answer In this analysis, the probability of the model chi-square (18.457) was p=0.005, less than or equal to the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. The answer to this question is true with caution. Caution in interpreting the relationship should be exercised because of the ordinal level variable "confidence in Congress" [conlegis] was treated as metric.
Individual Relationships – Age question The statistical significance of the relationship between age and opinion about spending on highways and bridges is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests" and the interpretation of the odds ratio.
Individual Relationships – Age evidence and answer The statistical significance of the relationship between confidence in Congress and opinion about spending on highways and bridges is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests". The likelihood ratio test of the relationship between "age" and "opinion about spending on highways and bridges" did not support the existence of a relationship (chi-square=2.652, p=0.265). False is the correct answer to this question.
Individual Relationships – highest year of school question The statistical significance of the relationship between highest year of school completed and opinion about spending on highways and bridges is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests" and the interpretation of the odds ratio.
Individual Relationships – highest year of school evidence and answer The statistical significance of the relationship between confidence in Congress and opinion about spending on highways and bridges is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests". The likelihood ratio test of the relationship between "highest year of school completed" and "opinion about spending on highways and bridges" did not support the existence of a relationship (chi-square=4.423, p=0.110). False is the correct answer to this question.
Individual Relationships – confidence in Congress question The statistical significance of the relationship between confidence in Congress and opinion about spending on highways and bridges is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests" and the interpretation of the odds ratio.
Individual Relationships – confidence in Congress evidence and answer - 1 The statistical significance of the relationship between confidence in Congress and opinion about spending on highways and bridges is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests". For this relationship, the probability of the chi-square statistic (9.221) was 0.010, less than or equal to the level of significance of 0.05. The null hypothesis that all of the b coefficients associated with confidence in Congress were equal to zero was rejected. The existence of a relationship between confidence in Congress and opinion about spending on highways and bridges was supported.
Individual Relationships – confidence in Congress evidence and answer - 2 In the comparison of survey respondents who thought we spend too little money on highways and bridges to survey respondents who thought we spend too much money on highways and bridges, the probability of the Wald statistic (4.913) for the variable confidence in Congress [conlegis] was 0.027. Since the probability was less than or equal to the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in Congress was equal to zero for this comparison was rejected.
Individual Relationships – confidence in Congress evidence and answer - 3 The value of Exp(B) was 3.948 which implies that for each unit increase in confidence in Congress the odds increased by approximately four times. The relationship stated in the problem is supported. Survey respondents who had more confidence in congress were more likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges increased by approximately four times.
Individual Relationships – confidence in Congress evidence and answer - 4 In the comparison of survey respondents who thought we spend about the right amount of money on highways and bridges to survey respondents who thought we spend too much money on highways and bridges, the probability of the Wald statistic (7.298) for the variable confidence in Congress [conlegis] was 0.007. Since the probability was less than or equal to the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in Congress was equal to zero for this comparison was rejected.
Individual Relationships – confidence in Congress evidence and answer - 5 The value of Exp(B) was 5.242 which implies that for each unit increase in confidence in Congress the odds increased by approximately five and a quarter times. The relationship stated in the problem is supported. Survey respondents who had more confidence in congress were more likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges increased by approximately five and a quarter times.
Individual Relationships – confidence in Congress evidence and answer - 6 True with caution is the correct answer to this question. Caution in interpreting the relationship should be exercised because of the ordinal level variable "confidence in Congress" [conlegis] was treated as metric.
Classification Accuracy - question The independent variables could be characterized as useful predictors distinguishing survey respondents who thought we spend too little money on highways and bridges, survey respondents who thought we spend about the right amount of money on highways and bridges and survey respondents who thought we spend too much money on highways and bridges if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone.
Classification Accuracy – evidence and answer 1 The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the 'Case Processing Summary', and then squaring and summing the proportion of cases in each group (0.371² + 0.557² + 0.072² = 0.453).
Classification Accuracy – evidence and answer 2 The classification accuracy rate was 60.5% which was greater than or equal to the proportional by chance accuracy criteria of 56.6% (1.25 x 45.3% = 56.6%). The criteria for classification accuracy is satisfied. True is the correct answer to this question.
Steps in solving multinomial logistic regression problems: level of measurement Question: Variables included in the analysis satisfy the level of measurement requirements? Dependent non-metric? Independent variables metric or dichotomous? No Inappropriate application of a statistic Yes Ordinal independent variable included in analysis? Yes True with caution No True
Steps in solving multinomial logistic regression problems: sample size Question: Number of variables and cases satisfy sample size requirements? Run multinomial logistic regression Ratio of cases to independent variables at least 10 to 1? No Inappropriate application of a statistic Yes Yes Ratio of cases to independent variables at least 20 to 1? No True with caution Yes Yes True
Steps in solving multinomial logistic regression problems: multicollinearity/numerical problems Question: no evidence of multicollinearity or numerical problems? Standard errors of coefficients indicate presence of numerical problems (s.e. > 2.0)? Yes False If numerical problem found, halt analysis until problem is resolved. No True
Steps in solving multinomial logistic regression problems: overall relationship Question: overall relationship between independent variables and dependent variable? Overall relationship statistically significant? (model chi-square test) No False Yes Caution for ordinal variable or sample size not meeting preferred requirements? Yes True with caution No True
Steps in solving multinomial logistic regression problems: relationships between IV's and DV Question: Interpretation of relationship between independent variable and dependent variable groups? Overall relationship between specific IV and DV is statistically significant? (likelihood ratio test) No False Yes Role of specific IV and DV groups statistically significant and interpreted correctly? (Wald test and Exp(B)) No False Yes Ordinal independent variable or sample size less than preferred requirements? Yes True with caution No True
Steps in solving multinomial logistic regression problems: classification accuracy Question: Classification accuracy sufficient to be characterized as a useful model? Overall accuracy rate is 25% > than proportional by chance accuracy rate? No False Yes Yes True