Download presentation
Presentation is loading. Please wait.
Published byCornelia Butler Modified over 9 years ago
1
SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems
2
SW388R7 Data Analysis & Computers II Slide 2 Level of Measurement - question The first question requires us to examine the level of measurement requirements for binary logistic regression. Binary logistic regression requires that the dependent variable be dichotomous and the independent variables be metric or dichotomous.
3
SW388R7 Data Analysis & Computers II Slide 3 Level of Measurement – evidence and answer True with caution is the correct answer, since we satisfy the level of measurement requirements, but include ordinal level variables in the analysis.
4
SW388R7 Data Analysis & Computers II Slide 4 Sample Size - question The second question asks about the sample size requirements for binary logistic regression. To answer this question, we will run the a baseline logistic regression to obtain some basic data about the problem and solution. The phrase “hierarchical entry” dictates the method for including variables in the model.
5
SW388R7 Data Analysis & Computers II Slide 5 Request hierarchical logistic regression Select the Regression | Binary Logistic… command from the Analyze menu.
6
SW388R7 Data Analysis & Computers II Slide 6 Selecting the dependent variable Second, click on the right arrow button to move the dependent variable to the Dependent text box. First, highlight the dependent variable grass in the list of variables.
7
SW388R7 Data Analysis & Computers II Slide 7 Selecting the control independent variables First, move the control independent variable, sex, listed in the problem to the Covariates list box. This will be the only variable in Block 1. Second, make sure that Enter is selected in the Method drop down menu. This tells SPSS that all of the variables in Block 1 will be included at the same time.
8
SW388R7 Data Analysis & Computers II Slide 8 Selecting the block for the predictors Next, click on the Next button to add the second block that will contain the predictors.
9
SW388R7 Data Analysis & Computers II Slide 9 Adding the predictor independent variables First, move the predictors to the Covariates list box. Block 2 of 2 tells us that we are entering variables in the second block.
10
SW388R7 Data Analysis & Computers II Slide 10 Specifying the method for including variables In our hierarchical regression, we will specify that all of the variables in Block 2 be entered simultaneously when the block is entered.
11
SW388R7 Data Analysis & Computers II Slide 11 Including the option for listing outliers SPSS will include a table of outliers in the output if we include the option to produce the table.
12
SW388R7 Data Analysis & Computers II Slide 12 Set the option for listing outliers Second, click on the At last step option to display the table of outliers only at the end of the analysis. First, mark the checkbox for Casewise listing of residuals, accepting the default of outliers outside 2 standard deviations.
13
SW388R7 Data Analysis & Computers II Slide 13 Requesting statistics needed for identifying outliers SPSS will calculate the values for studentized residuals and save them to the data set so that we can remove the outliers easily. Click on the Save… button to request the statistics what we want to save.
14
SW388R7 Data Analysis & Computers II Slide 14 Saving statistics needed for removing outliers Second, click on the Continue button to complete the specifications. First, mark the checkbox for Studentized residuals in the Residuals panel.
15
SW388R7 Data Analysis & Computers II Slide 15 Completing the logistic regression request Click on the OK button to request the output for the logistic regression. The logistic procedure supports the selection of subsets of cases, automatic recoding of nominal variables, saving other diagnostic statistics like standardized residuals, and options for additional statistics. However, none of these are needed for this analysis.
16
SW388R7 Data Analysis & Computers II Slide 16 Sample size – evidence and answer The minimum ratio of valid cases to independent variables for logistic regression is 10 to 1, with a preferred ratio of 20 to 1. In this analysis, there are 163 valid cases and 3 independent variables. The ratio of cases to independent variables is 54.33 to 1, which satisfies the minimum requirement. In addition, the ratio of 54.33 to 1 satisfies the preferred ratio of 20 to 1. The question which precipitated computing the logistic regression in SPSS was the question about sample size. We can now answer that question.
17
SW388R7 Data Analysis & Computers II Slide 17 Outliers - question Outliers are defined as cases that have a studentized residual of +/-2.0 or larger.
18
SW388R7 Data Analysis & Computers II Slide 18 Outliers – evidence and answer False is the correct answer for the statement that there are no outliers. Using the criteria of studentized residuals greater than +/- 2.0, SPSS identified three outliers: case number 29; case number 92; and case number 173. Note that the cases are identified by the information in the footnote, and not by the list of standardized residuals (zresid) in the table.
19
SW388R7 Data Analysis & Computers II Slide 19 Model Selected for Interpretation - question Since we have found outliers, we need to determine whether we will interpret the model that includes all cases or the model that excludes outliers.
20
SW388R7 Data Analysis & Computers II Slide 20 Accuracy rate for baseline model The accuracy rate for the model used to detect outliers (70.6%) is used for the baseline accuracy rate. We will compare this to the accuracy rate for the model excluding outliers. In hierarchical logistic regression, we interpret the output for Block 2, when both the controls and the predictors have been entered into the analysis.
21
SW388R7 Data Analysis & Computers II Slide 21 Removing the outliers from the analysis - 1 Our next step is to run the revised logistic regression model that omits outliers. Our first step in this process is to tell SPSS to exclude the outliers from the analysis. We accomplish this by telling SPSS to include in the analysis all of the cases that are not outliers. First, select the Select Cases… command from the Transform menu.
22
SW388R7 Data Analysis & Computers II Slide 22 Removing the outliers from the analysis - 2 First, mark the If condition is satisfied option button to indicate that we will enter a specific condition for including cases. Second, click on the If… button to specify the criteria for inclusion in the analysis.
23
SW388R7 Data Analysis & Computers II Slide 23 Removing the outliers from the analysis - 3 To eliminate the outliers, we request the cases that are not outliers be selected into the analysis. The formula specifies that we should include cases if the standard score for the residual (sre_1) is less than or equal to 2.00. The abs() or absolute value function tells SPSS to ignore the sign of the value. After typing in the formula, click on the Continue button to close the dialog box.
24
SW388R7 Data Analysis & Computers II Slide 24 Removing the outliers from the analysis - 4 To complete the request, we click on the OK button.
25
SW388R7 Data Analysis & Computers II Slide 25 Revised logistic regression omitting outliers - 1 To run the logistic regression eliminating the outliers, select the Logistic Regression command from the menu that drops down when you click on the Dialog Recall button.
26
SW388R7 Data Analysis & Computers II Slide 26 Revised logistic regression omitting outliers - 2 When we wanted to detect outliers, we asked SPSS to save the studentized residuals to the data editor. Since we no longer need the studentized residuals, we will omit saving them from this analysis. Click on the Save button to open the dialog box.
27
SW388R7 Data Analysis & Computers II Slide 27 Revised logistic regression omitting outliers - 3 Clear the checkbox for Studentized Residuals so that SPSS does not save a new set of them in the data editor when it runs the new regression. Click on the Continue button to close the dialog box.
28
SW388R7 Data Analysis & Computers II Slide 28 Revised logistic regression omitting outliers - 4 Click on the OK button to obtain the output for the revised model. The other specifications for the logistic regression are the same as previously marked.
29
SW388R7 Data Analysis & Computers II Slide 29 Accuracy rate for revised model Prior to the removal of outliers, the accuracy rate of the logistic regression model was 70.6%. After removing outliers, the accuracy rate of the logistic regression model was 71.3%. Since the logistic regression omitting outliers was less than two percent more accurate in classifying cases than the logistic regression with all cases, the logistic regression model with all cases is interpreted. False is the correct answer to the statement tht we will interpret the model that excludes outliers. We will interpret the model that includes all cases. In hierarchical logistic regression, we interpret the output for Block 2, when both the controls and the predictors have been entered into the analysis.
30
SW388R7 Data Analysis & Computers II Slide 30 Restore all cases and run the baseline model again Since we will interpret the model including the outliers, we need to add the excluded cases back into the analysis. Choose the Select Cases… command from the Data menu.
31
SW388R7 Data Analysis & Computers II Slide 31 Select all cases First, mark the option button for All cases. Second, click on the OK button to close the dialog box.
32
SW388R7 Data Analysis & Computers II Slide 32 Re-run baseline model - 1 To re-run the baseline logistic regression including the outliers, select the Logistic Regression command from the menu that drops down when you click on the Dialog Recall button.
33
SW388R7 Data Analysis & Computers II Slide 33 Re-run baseline model - 2 We want to run the same logistic regression analysis we have previously run. All we need to do is click on the OK button.
34
SW388R7 Data Analysis & Computers II Slide 34 Multicollinearity and Numerical Problems - question Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, cells with a zero count for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.
35
SW388R7 Data Analysis & Computers II Slide 35 Multicollinearity and Numerical Problems – evidence and answer The standard errors for the variables included in the analysis were: "liberal or conservative political views" (.133), "general happiness" (.362) and "sex" (.356). None of the independent variables in this analysis had a standard error larger than 2.0. True is the correct answer.
36
SW388R7 Data Analysis & Computers II Slide 36 Overall Relationship - question The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the model chi- square at Block 2 after the independent variables have been added to the analysis.
37
SW388R7 Data Analysis & Computers II Slide 37 Overall Relationship – evidence and answer True is the correct answer. In a hierarchical logistic regression, the presence of a relationship between the dependent variable and combination of independent variables entered after the control variables have been included is based on the statistical significance of the block chi-square for the second block of variables in which the predictor independent variables are included. In this analysis, the probability of the block chi-square (20.308) was p<0.001, less than or equal to the level of significance of 0.05. The null hypothesis that there is no difference between the model with only a constant and the control variables versus the model with the predictor independent variables was rejected. The contribution of the relationship between the predictor independent variables and the dependent variable was supported
38
SW388R7 Data Analysis & Computers II Slide 38 Individual Relationships – Political Views - question To answer the question about an individual relationship, we look to the significance of the Wald test of the B coefficient and the interpretation of the odds ratio.
39
SW388R7 Data Analysis & Computers II Slide 39 Individual Relationships – Political Views – evidence and answer The probability of the Wald statistic for the variable "liberal or conservative political views" [polviews] was p=0.008, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for "liberal or conservative political views" [polviews] was equal to zero was rejected. "Liberal or conservative political views" [polviews] is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were more conservative.
40
SW388R7 Data Analysis & Computers II Slide 40 Individual Relationships – Political Views – evidence and answer The value of Exp(B) was 0.704 which implies a decrease in the odds of 29.6% (0.704 - 1.0 = -0.296). The correct interpretation of the relationship is that 'survey respondents who were more conservative were 29.6% less likely to have been more supportive that the use of marijuana should be made legal.' True with caution is the correct answer. Caution in interpreting the relationship should be exercised because of the ordinal level variable "liberal or conservative political views" [polviews] was treated as metric.
41
SW388R7 Data Analysis & Computers II Slide 41 Individual Relationships – General Happiness - question To answer the question about an individual relationship, we look to the significance of the Wald test of the B coefficient and the interpretation of the odds ratio.
42
SW388R7 Data Analysis & Computers II Slide 42 Individual Relationships – General Happiness – evidence and answer The probability of the Wald statistic for the variable "general happiness" [happy] was p=0.001, less than or equal to the level of significance of 0.05. The null hypothesis that the b coefficient for "general happiness" [happy] was equal to zero was rejected. "General happiness" [happy] is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were happier overall.
43
SW388R7 Data Analysis & Computers II Slide 43 Individual Relationships – General Happiness – evidence and answer The value of Exp(B) was 0.286 which implies a decrease in the odds of 71.4% (0.286 - 1.0 = -0.714). The correct interpretation of the relationship is that 'survey respondents who were happier overall were 71.4% less likely to have been more supportive that the use of marijuana should be made legal.' True with caution is the correct answer. Caution in interpreting the relationship should be exercised because of the ordinal level variable "general happiness" [happy] was treated as metric.
44
SW388R7 Data Analysis & Computers II Slide 44 Classification Accuracy - question The independent variables could be characterized as useful predictors distinguishing survey respondents who have been more supportive that the use of marijuana should be made legal from survey respondents who have been less supportive that the use of marijuana should be made legal if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.
45
SW388R7 Data Analysis & Computers II Slide 45 Classification Accuracy computing by chance accuracy rate The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0. The proportion in the Not Legal group was 0.664, making the proportion in the Legal group 0.356 (1.0 – 0.664). The proportion of cases in each group are then squared and summed (0.644² + 0.356² = 0.541). The proportional by chance accuracy criteria is 25% higher, or 67.7% (1.25 x 54.1% = 67.7%).
46
SW388R7 Data Analysis & Computers II Slide 46 Classification Accuracy – evidence and answer The classification accuracy rate computed by SPSS was 70.6% which was greater than or equal to the proportional by chance accuracy criteria of 67.7% (1.25 x 54.1% = 67.7%). The criteria for classification accuracy is satisfied. True is the correct answer to the question.
47
SW388R7 Data Analysis & Computers II Slide 47 Validation - question For a hierarchical logistic regression, the 75%-25% cross-validation must verify the overall contribution of the independent variables entered after the control variables have been included. In addition, the pattern of significance for the individual relationships between the dependent variable and the predictors for the training sample should be the same as the pattern for the full data set. And finally, the classification accuracy rate for the validation sample must be within 2% of the accuracy rate for the training sample.
48
SW388R7 Data Analysis & Computers II Slide 48 Validation analysis: set the random number seed To set the random number seed, select the Random Number Seed… command from the Transform menu.
49
SW388R7 Data Analysis & Computers II Slide 49 Set the random number seed First, click on the Set seed to option button to activate the text box. Second, type in the random seed stated in the problem. Third, click on the OK button to complete the dialog box. Note that SPSS does not provide you with any feedback about the change.
50
SW388R7 Data Analysis & Computers II Slide 50 Validation analysis: compute the split variable To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.
51
SW388R7 Data Analysis & Computers II Slide 51 The formula for the split variable First, type the name for the new variable, split, into the Target Variable text box. Second, the formula for the value of split is shown in the text box. The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0. 75. If the random number is less than or equal to 0.75, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.75, the formula will return a 0, the SPSS numeric equivalent to false. Third, click on the OK button to complete the dialog box.
52
SW388R7 Data Analysis & Computers II Slide 52 Running the logistic regression again with the training sample We repeat the logistic regression analysis for the training sample. Select the Regression | Binary Logistic… command from the Analyze menu.
53
SW388R7 Data Analysis & Computers II Slide 53 Using "split" as the selection variable First, scroll down the list of variables and highlight the variable split. Second, click on the right arrow button to move the split variable to the Selection Variable text box.
54
SW388R7 Data Analysis & Computers II Slide 54 Setting the value of split to select cases When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the Rule… button to enter a value for split.
55
SW388R7 Data Analysis & Computers II Slide 55 Completing the value selection First, type the value for the first half of the sample, 1, into the Value text box. Second, click on the Continue button to complete the value entry.
56
SW388R7 Data Analysis & Computers II Slide 56 Requesting output for the validation sample When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable. Click on the OK button to request the output.
57
SW388R7 Data Analysis & Computers II Slide 57 Validation – evidence and answer Overall relationship The significance of the overall relationship between the individual independent variables and the dependent variable supports the interpretation of the model using the full data set. For a hierarchical logistic regression, the cross-validation must verify the contribution of the independent variables entered after the control variables have been included. This is based on the statistical significance of the block chi-square for the second block of variables. In the cross-validation analysis, the relationship between the independent variables and the dependent variable taking into account the effect of the control variables was statistically significant. The probability for the block chi- square (23.287) testing the block of independent variables was p<0.001.
58
SW388R7 Data Analysis & Computers II Slide 58 Validation – evidence and answer Individual relationship – Political Views The relationship between "liberal or conservative political views" [polviews] and “support for legalization of marijuana" [grass] was statistically significant for the model using the full data set (p=0.008). Similarly, the relationship in the cross-validation analysis was statistically significant. In the cross-validation analysis, the probability for the test of relationship between "liberal or conservative political views" [polviews and “support for legalization of marijuana" [grass] was p=0.004, which was less than or equal to the level of significance of 0.05 and statistically significant.
59
SW388R7 Data Analysis & Computers II Slide 59 Validation – evidence and answer Individual relationship – General Happiness The pattern of significance for the individual relationships between the dependent variable and the independent variables was the same for the analysis using the full data set and the 75% training sample. The relationship between “general happiness" [happy] and “support for legalization of marijuana" [grass] was statistically significant for the model using the full data set (p=0.001). Similarly, the relationship in the cross-validation analysis was statistically significant. In the cross-validation analysis, the probability for the test of relationship between “general happiness" [happy] and “support for legalization of marijuana" [grass] was p<0.001, which was less than or equal to the level of significance of 0.05 and statistically significant.
60
SW388R7 Data Analysis & Computers II Slide 60 Validation – evidence and answer Classification accuracy The classification accuracy rate for the model using the training sample was 66.9%, compared to 66.7% for the validation sample. The shrinkage in classification accuracy for the validation analysis is the difference between the accuracy for the training sample (66.9%) and the accuracy for the validation sample (66.7%), which equals 0.2% in this analysis. The shrinkage was within the 2% criteria for minimal shrinkage, small enough to support a conclusion that the logistic regression model based on this analysis would be effective in predicting scores for cases other than those included in the calculation of the regression analysis. The validation analysis supports the generalizability of the analysis. The answer to the question is true.
61
SW388R7 Data Analysis & Computers II Slide 61 Summary of Findings - question The final question is a summary of the findings of the analysis: overall relationship, individual relationships, and usefulness of the model. Cautions are added, if needed, for sample size and level of measurement issues.
62
SW388R7 Data Analysis & Computers II Slide 62 Summary of Findings – evidence and answer True with caution is the correct answer.
63
SW388R7 Data Analysis & Computers II Slide 63 Hierarchical binary logistic regression: level of measurement Inappropriate application of a statistic No Dependent dichotomous? Independent variables metric or dichotomous? Question: Variables included in the analysis satisfy the level of measurement requirements? Yes Ordinal independent variable included in analysis? No Yes True True with caution
64
SW388R7 Data Analysis & Computers II Slide 64 Hierarchical binary logistic regression: sample size Yes Ratio of cases to independent variables at least 10 to 1? Yes No Inappropriate application of a statistic Yes Ratio of cases to independent variables at least 20 to 1? Yes No True with caution Question: Number of variables and cases satisfy sample size requirements? Run baseline logistic regression, using hierarchical method for including variables identified in the research question. Record classification accuracy for evaluation of the effect of removing outliers. True
65
SW388R7 Data Analysis & Computers II Slide 65 Hierarchical binary logistic regression: detecting outliers Question: Outliers were not detected in the analysis? Outliers for the solution identified by studentized residuals > ±2.0? Yes No False True
66
SW388R7 Data Analysis & Computers II Slide 66 Hierarchical binary logistic regression: selecting model for interpretation Outliers for the solution identified by studentized residuals > ±2.0? Yes No Run revised logistic regression excluding outliers, using method for including variables identified in research question. Classification accuracy omitting outliers better than baseline by 2% or more? Pick baseline logistic regression for interpretation Pick logistic regression that omits outliers for interpretation YesNo Question: Interpret baseline model or model excluding outliers ? FalseTrue
67
SW388R7 Data Analysis & Computers II Slide 67 Hierarchical binary logistic regression: multicollinearity or numerical problems No Standard errors of coefficients indicate presence of numerical problems (s.e. > 2.0)? Yes False Question: no evidence of multicollinearity or numerical problems ? True If numerical problem found, halt analysis until problem is resolved.
68
SW388R7 Data Analysis & Computers II Slide 68 Hierarchical binary logistic regression: overall relationship Yes False No Relationship confirmed by significance of block chi-square for predictors at step 2? Caution for ordinal variable or sample size not meeting preferred requirements? No Yes True with caution True Question: overall relationship between independent variables and dependent variable?
69
SW388R7 Data Analysis & Computers II Slide 69 Hierarchical binary logistic regression: relationships between IV's and DV Individual relationship confirmed by significance of Wald statistic? Direction and size of odds ratio interpreted correctly? No Yes False No False Yes Caution for ordinal variable or sample size not meeting preferred requirements? No Yes True with caution True Question: Interpretation of relationship between independent variable and dependent variable groups?
70
SW388R7 Data Analysis & Computers II Slide 70 Hierarchical binary logistic regression: classification accuracy Yes Overall accuracy rate is 25% > than proportional by chance accuracy rate? Yes No False Question: Classification accuracy sufficient to be characterized as a useful model? True
71
SW388R7 Data Analysis & Computers II Slide 71 Hierarchical binary logistic regression: validation - 1 Compute 75-25 split variable. Re-run logistic regression, using method for including variables identified in the research question. Block chi-square for predictors at Block 2 <= level of significance? Yes No False Question: Validation analysis supports generalizability of model?
72
SW388R7 Data Analysis & Computers II Slide 72 Hierarchical binary logistic regression: validation - 2 Significance of predictors in training sample matches pattern for model using full data set? Yes No False Shrinkage in classification accuracy for holdout sample < 2%? Yes No False
73
SW388R7 Data Analysis & Computers II Slide 73 Hierarchical binary logistic regression: summary of findings - 1 Question: Summary of findings correctly stated, including cautions? Overall relationship correctly stated? Yes No False Individual relationship with IV and DV correctly stated? Yes No False Classification accuracy supports useful model? Yes No False
74
SW388R7 Data Analysis & Computers II Slide 74 Hierarchical binary logistic regression: summary of findings - 2 One or more IV's are ordinal level variables? No Yes True Satisfies preferred ratio of cases to IV's of 20 to 1? No Yes True with caution
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.