Download presentation
Presentation is loading. Please wait.
Published byFelicity Walton Modified over 9 years ago
1
SW388R7 Data Analysis & Computers II Slide 1 Analyzing Missing Data Introduction Practice Problems Homework Problems Using Scripts
2
SW388R7 Data Analysis & Computers II Slide 2 Missing data and data analysis Missing data is a problem in multivariate data because a case will be excluded from the analysis if it is missing data for any variable included in the analysis. If our sample is large, we may be able to allow cases to be excluded. If our sample is small, we will try to use a substitution method so that we can retain enough cases to have sufficient power to detect effects. In either case, we need to make certain that we understand the potential impact that missing data may have on our analysis.
3
SW388R7 Data Analysis & Computers II Slide 3 Tools for evaluating missing data SPSS has a specific package for evaluating missing data, but it is included under the UT license. In place of this package, we will first examine missing data using SPSS statistics and procedures. After studying the standard SPSS procedures that we can use to examine missing data, we will use an SPSS script that will produce the output needed for missing data analysis without requiring us to issue all of the SPSS commands individually.
4
SW388R7 Data Analysis & Computers II Slide 4 Key issues in missing data analysis We will focus on three key issues for evaluating missing data: The number of cases missing per variable The number of variables missing per case The pattern of correlations among variables created to represent missing and valid data. Further analysis may be required depending on the problems identified in these analyses.
5
SW388R7 Data Analysis & Computers II Slide 5 The number of cases missing per variable If a variable is missing data for half or more of the cases, it should be considered a candidate for exclusion from the analysis, and certainly not one for which we could substitute values. The rule for calculating what constitutes half or more of the cases is to always round down. Thus, in a dataset containing 100 cases, 50 cases represent half or more of the cases. In a data set containing 101 cases, 50 cases would also represent half or more of the cases (101 / 2 = 50.5, which equal 50 when rounded down).
6
SW388R7 Data Analysis & Computers II Slide 6 The number of variables missing per case If a case is missing more than half of the variables, it should be considered a candidate for exclusion from the analysis. To compute the number of variables that constitute more than half, we divide the number of variables in the analysis by 2, round the result down if there is a decimal remainder, and add one to make our criteria more than half. For example, if an analysis contains 6 variables, more than half would be 4 (6 / 2 = 3 + 1). If an analysis contained 7 variables, more than half would also be 4 (7 / 2 = 3.5, rounded down to 3; 3 + 1 = 4).
7
SW388R7 Data Analysis & Computers II Slide 7 Pattern of missing data among variables To detect a pattern of missing data among variables, we create special dichotomous variables where a value of 1 means the case has valid data and a value of 0 means the case is missing data. If the correlation between pairs of these dichotomous variable is statistically significant, it indicates that they have a similar pattern of missing values that is not random. The pattern should be examined further to determine what impact it might have on the generalizability of our findings.
8
SW388R7 Data Analysis & Computers II Slide 8 Practice Problem 1
9
SW388R7 Data Analysis & Computers II Slide 9 Identifying the number of cases in the data set This problem wants to know if a variable is missing data for half or more of the cases. Our first task is to identify the number of cases that satisfy that criterion. If we scroll to the bottom of the data set, we see than there are 270 cases in the data set. 270 ÷ 2 = 135. If any variable included in the analysis has 135 or more missing cases, the answer to the problem will be true.
10
SW388R7 Data Analysis & Computers II Slide 10 Request frequency distributions We will use the output for frequency distributions to find the number of missing cases for each variable. Select the Frequencies… | Descriptive Statistics command from the Analyze menu.
11
SW388R7 Data Analysis & Computers II Slide 11 Completing the specification for frequencies Second, click on the OK button to complete the request for statistical output. First, move the five variables included in the problem statement to the list box for variables.
12
SW388R7 Data Analysis & Computers II Slide 12 Number of missing cases for each variable In the table of statistics at the top of the Frequencies output, there is a table detailing the number of missing cases for each variable in the analysis.
13
SW388R7 Data Analysis & Computers II Slide 13 Answering the problem With 270 subjects in the data set, variables missing data for 135 or more cases would correctly be characterized as missing data for half or more of the cases in the data set. One variable was incorrectly characterized as missing half or more of the 270 cases: "self- employment" [wrkslf] was missing data for 20 of the 270 cases (7.4%). None of the variables in this analysis was missing cases for half or more of the 270 cases in the data set. False is the correct answer.
14
SW388R7 Data Analysis & Computers II Slide 14 Practice Problem 2
15
SW388R7 Data Analysis & Computers II Slide 15 Create a variable that counts missing data We want to know how many of the five variables in the analysis were missing data for each case in the data set. We will create a variable containing this information that uses an SPSS function to count the number of variables with missing data. To compute a new variable, select the Compute… command from the Transform menu.
16
SW388R7 Data Analysis & Computers II Slide 16 Enter specifications for new variable Third, click on the up arrow button to move the NMISS function into the Numeric Expression text box. First, type in the name for the new variable nmiss in the Target variable text box. Second, scroll down the list of functions and highlight the NMISS function.
17
SW388R7 Data Analysis & Computers II Slide 17 Enter specifications for new variable The NMISS function is moved into the Numeric Expression text box. Second, click on the right arrow button to move the variable name into the function arguments. To add the list of variables to count missing data for, we first highlight the first variable to include in the function, wrkstat.
18
SW388R7 Data Analysis & Computers II Slide 18 Enter specifications for new variable First, before we add another variable to the function, we type a comma to separate the names of the variables. Third, click on the right arrow button to move the variable name into the function arguments. Second, to add the next variable we highlight the second variable to include in the function, hrs1.
19
SW388R7 Data Analysis & Computers II Slide 19 Complete specifications for new variable Continue adding variables to function until all of the variables specified in the problem have been added. Be sure to type a comma between the variable names. When all of the variables have been added to the function, click on the OK button to complete the specifications.
20
SW388R7 Data Analysis & Computers II Slide 20 The nmiss variable in the data editor If we scroll the worksheet to the right, we see the new variable that SPSS has just computed for us.
21
SW388R7 Data Analysis & Computers II Slide 21 A frequency distribution for nmiss To answer the question of how many cases had each of the possible numbers of missing value, we create a frequency distribution. Select the Frequencies… | Descriptive Statistics command from the Analyze menu.
22
SW388R7 Data Analysis & Computers II Slide 22 Completing the specification for frequencies Second, click on the OK button to complete the request for statistical output. First, move the nmiss variable to the list of variables.
23
SW388R7 Data Analysis & Computers II Slide 23 The frequency distribution SPSS produces a frequency distribution for the nmiss variable. 170 cases had valid, non- missing values for all 5 variables. 85 cases had one missing value; 1 case had 2 missing values; and 14 cases had missing values for 4 variables.
24
SW388R7 Data Analysis & Computers II Slide 24 Answering the problem The problem asked whether or not 14 cases had missing data for more than half the variables. For a set of five variables, cases that had 3, 4, or 5 missing values would meet this requirement. The number of cases with 3 missing variables is 0 (not shown in table), with 4 missing variables is 14, and with 5 missing variables is 0, for a total of 14. The answer to the problem is true.
25
SW388R7 Data Analysis & Computers II Slide 25 Practice Problem 3
26
SW388R7 Data Analysis & Computers II Slide 26 Compute valid/missing dichotomous variables To evaluate the pattern of missing data, we need to compute dichotomous valid/missing variables for each of the five variables included in the analysis. We will compute the new variable using the Recode command. To create the new variable, select the Recode | Into Different Variables… from the Transform menu.
27
SW388R7 Data Analysis & Computers II Slide 27 Enter specifications for new variable First, move the first variable in the analysis, wrkstat, into the Numeric Variable -> Output Variable text box. Second, type the name for the new variable into the Name text box. My convention is to add an underscore character to the end of the variable name. If this would make the variable more than 8 characters long, delete characters from the end of the original variable name.
28
SW388R7 Data Analysis & Computers II Slide 28 Enter specifications for new variable Next, type the label for the new variable into the Label text box. My convention is to add the phrase (Valid/Missing) to the end of the variable label for the original variable. Finally, click on the Change button to add the name of the dichotomous variable to the Numeric Variable -> Output Variable text box.
29
SW388R7 Data Analysis & Computers II Slide 29 Enter specifications for new variable To specify the values for the new variable, click on the Old and New Values… button.
30
SW388R7 Data Analysis & Computers II Slide 30 Change the value for missing data The dichotomous variable should be coded 1 if the variable has a valid value, 0 if the variable has a missing value. First, mark the System- or user-missing option button. Second, type 0 in the Value text box. Third, click on the Add button to include this change in the list of Old->New list box.
31
SW388R7 Data Analysis & Computers II Slide 31 Change the value for valid data First, mark the All other values option button. Second, type 1 in the Value text box. Third, click on the Add button to include this change in the list of Old->New list box.
32
SW388R7 Data Analysis & Computers II Slide 32 Complete the value specifications Having entered the values for recoding the variable into dichotomous values, we click on the Continue button to complete this dialog box.
33
SW388R7 Data Analysis & Computers II Slide 33 Complete the recode specifications Having entered specifications for the new variable and the values for recoding the variable into dichotomous values, we click on the OK button to produce the new variable.
34
SW388R7 Data Analysis & Computers II Slide 34 The dichotomous variable The procedure for creating a dichotomous valid/missing variable is repeated for the four other variables in the analysis: hrs1, wrkslf, wrkgovt, and prestg80.
35
SW388R7 Data Analysis & Computers II Slide 35 Filtering cases with excessive missing variables To filter cases included in further analysis, we choose the Select Cases… command from the Data menu. If we include the cases that have more than half of the variables missing, we will inflate the correlations. To prevent this, we exclude this cases before creating the correlation matrix. We do this by selecting in, or filtering, cases that have fewer than half missing variables, i.e. less than 3 missing variables.
36
SW388R7 Data Analysis & Computers II Slide 36 Enter specifications for selecting cases Second, click on the If… button to enter the criteria for including cases. First, click on the If condition is satisfied option button on the Select panel.
37
SW388R7 Data Analysis & Computers II Slide 37 Enter specifications for selecting cases Second, click on the Continue button to complete the If specification. First, enter the criteria for including cases: nmiss < 3
38
SW388R7 Data Analysis & Computers II Slide 38 Complete the specifications for selecting cases To complete the specifications, click on the OK button.
39
SW388R7 Data Analysis & Computers II Slide 39 Cases excluded from further analyses SPSS marks the cases that will not be included in further analyses by drawing a slash mark through the case number. We can verify that the selection is working correctly by noting that the case which is omitted had 4 missing variables.
40
SW388R7 Data Analysis & Computers II Slide 40 Correlating the dichotomous variables To compute a correlation matrix for the dichotomous variables, select the Correlate | Bivariate command from the Analyze menu.
41
SW388R7 Data Analysis & Computers II Slide 41 Specifications for correlations Second, click on the OK button to complete the request. First, move the dichotomous variables to the variables list box.
42
SW388R7 Data Analysis & Computers II Slide 42 The correlation matrix The correlation matrix is symmetric along the diagonal (shown by the blue line). The correlation for any pair of variables is included twice in the table. So we only count the correlations below the diagonal (the cells with the yellow background).
43
SW388R7 Data Analysis & Computers II Slide 43 The correlation matrix The correlations marked with footnote letter a could not be computed because one of the variables was a constant, i.e. the dichotomous variable has the same value for all cases. This happens when one of the valid/missing variables has no missing cases, so that all of the cases have a value of 1 and none have a value of 0.
44
SW388R7 Data Analysis & Computers II Slide 44 The correlation matrix In the cells for which the correlation could be computed, the probabilities indicating significance are 0.437, 0.501, and 0.877. The correlation of -.042 between the missing/valid pair for "number of hours worked in the past week" [hrs1] and "occupational prestige score" [prestg80] was not statistically significant (p=0.501) and should not be interpreted as indicating a non-random pattern of missing data.
45
SW388R7 Data Analysis & Computers II Slide 45 Answering the problem False is the correct answer. None of the correlations among the missing/valid variables were statistically significant. The correlation matrix does not indicate a non-random pattern of missing data. Fourteen cases were excluded from the calculations for the correlation matrix because they were missing more than half of the variables.
46
SW388R7 Data Analysis & Computers II Slide 46 Using scripts The process of evaluating missing data requires numerous SPSS procedures and outputs that are time consuming to produce. These procedures can be automated by creating an SPSS script. A script is a program that executes a sequence of SPSS commands. Thought writing scripts is not part of this course, we can take advantage of a script that I use to reduce the burdensome tasks of evaluating missing data.
47
SW388R7 Data Analysis & Computers II Slide 47 Using a script for missing data The script: “EvaluatingAssumptions_MissingData_Outliers_2006.SBS” will produce all of the output we have used for evaluating missing data, as well as other diagnostic outputs described in the textbook. Navigate to the link “SPSS Scripts and Syntax” on the course web page. Download the script file: “EvaluatingAssumptions_MissingData_Outliers_2006.SBS” to your computer.
48
SW388R7 Data Analysis & Computers II Slide 48 Open the data set in SPSS Before using a script, a data set should be open in the SPSS data editor.
49
SW388R7 Data Analysis & Computers II Slide 49 Invoke the script To invoke the script, select the Run Script… command in the Utilities menu.
50
SW388R7 Data Analysis & Computers II Slide 50 Select the missing data script First, navigate to the folder where you put the script. Third, click on Run button to start the script. Second, click on the script name to highlight it.
51
SW388R7 Data Analysis & Computers II Slide 51 The script dialog The script dialog box acts similarly to SPSS dialog boxes. You select the variables to include in the analysis and choose options for the output.
52
SW388R7 Data Analysis & Computers II Slide 52 Complete the specifications Select the variables for the analysis. This analysis uses the variables for the last problem we worked. For the missing data check, it does not matter what role or what level of measurement we assign to the variables. Click on the OK button to produce the output. We accept the default option to Check missing data.
53
SW388R7 Data Analysis & Computers II Slide 53 The script finishes Since it may take a while to produce the output, and since there are times when it appears that nothing is happening, there is an alert to tell you when the script is finished. When you see this alert, click on the OK button and view your output. Note: the script dialog box does not close by itself. This is purposeful so that you can test assumptions or detect outliers without having to redo variable selection.
54
SW388R7 Data Analysis & Computers II Slide 54 Output from the script The script will produce lots of output. Additional descriptive material in the titles should help link specific outputs to specific tasks.
55
SW388R7 Data Analysis & Computers II Slide 55 Homework Problem The homework problems ask whether or not there is a missing data issue for a subset of variables that could affect our findings. Specifically, we are interested in whether or not a variable is missing data for half or more of the cases and, thus, is a candidate for automatic removal; or the matrix of correlations for missing/valid variables suggests a non-random pattern of missing data.
56
SW388R7 Data Analysis & Computers II Slide 56 Run the script for analyzing missing data Open the dataset mentioned in the problem statement. Select the Run Script command from the Utilities menu.
57
SW388R7 Data Analysis & Computers II Slide 57 Select the missing data script First, navigate to the folder where you put the script. Third, click on Run button to start the script. Second, click on the script name to highlight it.
58
SW388R7 Data Analysis & Computers II Slide 58 Specify the variables to analyze Move the variables listed into the problem into the list boxes for the dependent and independent variable. The missing data analysis does not depend on either the level of measurement or role of the variable. I put the variables in the list boxes shown just for convenience. Make sure the Check missing data radio button is marked. Click on the OK button to produce the SPSS output.
59
SW388R7 Data Analysis & Computers II Slide 59 The script finishes producing output Since it may take a while to produce the output, and since there are times when it appears that nothing is happening, there is an alert to tell you when the script is finished. When you see this alert, click on the OK button and view your output. Note: the script dialog box does not close by itself. This is purposeful so that you can test assumptions or detect outliers without having to redo variable selection. To close the script dialog, click on the red close box on the window title bar, or the Close button.
60
SW388R7 Data Analysis & Computers II Slide 60 Variables missing half or more cases With 270 subjects in the data set, variables missing data for 135 or more cases would correctly be characterized as missing data for half or more of the cases in the data set. None of the variables in this analysis was missing cases for half or more of the 270 cases in the data set.
61
SW388R7 Data Analysis & Computers II Slide 61 Cases missing more than half the variables With six variables included in the analysis, subjects missing data for four or more variables would correctly be characterized as missing more than half of the variables in the analysis. There were 6 cases missing data for four variables. There were 2 cases missing data for five variables. The eight cases missing data for more than half of the variables (four or more) are excluded from the calculations for the correlation matrix of missing/valid variables.
62
SW388R7 Data Analysis & Computers II Slide 62 Randomness of the pattern of missing data The correlation of - 0.407 between the missing/valid pair for "attitude toward abortion when a woman wants one for any reason" [abany] and "opinion about government reducing income differences between rich and poor" [eqwlth] was statistically significant (p<0.001). Other correlations were also statistically significant. Ten of the correlations among the missing/valid variables were statistically significant, indicating a non-random pattern of missing data. The answer to the question is false. The correlation matrix identified a missing data process that could affect the generalizability of the results
63
SW388R7 Data Analysis & Computers II Slide 63 Feedback in Blackboard After you have submitted a test in Blackboard, the graded results will show the feedback for the problem to explain how the answer was obtained. The feedback will contain the statistical data from SPSS used to create the problem. This will enable you to compare your output to mine. If your output does not agree with mine, check to see that you used the correct variables in the analysis.
64
SW388R7 Data Analysis & Computers II Slide 64 Feedback for homework problem - 1 VARIABLES MISSING HALF OR MORE CASES With 270 subjects in the data set, variables missing data for 135 or more cases would correctly be characterized as missing data for half or more of the cases in the data set. None of the variables in this analysis was missing cases for half or more of the 270 cases in the data set. The tally of missing cases for the variables missing data for less than half the cases was: "attitude toward abortion when a woman wants one for any reason" [abany] was missing data for 88 cases (32.6%), "opinion about government reducing income differences between rich and poor" [eqwlth] was missing data for 88 cases (32.6%), "favor or oppose death penalty for murder" [cappun] was missing data for 32 cases (11.9%), "favor or oppose gun permits" [gunlaw] was missing data for 87 cases (32.2%), "should marijuana be made legal" [grass] was missing data for 99 cases (36.7%) and "bible prayer in public schools" [prayer] was missing data for 110 cases (40.7%).
65
SW388R7 Data Analysis & Computers II Slide 65 Feedback for homework problem - 2 CASES MISSING MORE THAN HALF OF THE VARIABLES With six variables included in the analysis, subjects missing data for four or more variables would correctly be characterized as missing more than half of the variables in the analysis. There were 85 cases missing data for one variable. There were 146 cases missing data for two variables. There were 31 cases missing data for three variables. There were 6 cases missing data for four variables. There were 2 cases missing data for five variables. The eight cases missing data for more than half of the variables (four or more) are excluded from the calculations for the correlation matrix of missing/valid variables. The correlation matrix of missing/valid variables will be based on 262 cases.
66
SW388R7 Data Analysis & Computers II Slide 66 Feedback for homework problem - 3 RANDOMNESS OF THE PATTERN OF MISSING DATA Ten of the correlations among the missing/valid variables were statistically significant at alpha=0.01, indicating a non-random pattern of missing data. The missing/valid variable for "attitude toward abortion when a woman wants one for any reason" [abany] had a statistically signficant correlation with the missing/valid variables: "opinion about government reducing income differences between rich and poor" [eqwlth] (r=- 0.407, p<0.001), "favor or oppose gun permits" [gunlaw] (r=0.938, p<0.001), "should marijuana be made legal" [grass] (r=-0.359, p<0.001) and "bible prayer in public schools" [prayer] (r=-0.460, p<0.001). The missing/valid variable for "opinion about government reducing income differences between rich and poor" [eqwlth] had a statistically signficant correlation with the missing/valid variables: "favor or oppose gun permits" [gunlaw] (r=- 0.402, p<0.001), "should marijuana be made legal" [grass] (r=0.891, p<0.001) and "bible prayer in public schools" [prayer] (r=-0.532, p<0.001). The missing/valid variable for "favor or oppose gun permits" [gunlaw] had a statistically signficant correlation with the missing/valid variables: "should marijuana be made legal" [grass] (r=-0.336, p<0.001) and "bible prayer in public schools" [prayer] (r=-0.489, p<0.001). The missing/valid variable for "should marijuana be made legal" [grass] had a statistically signficant correlation with the missing/valid variable: "bible prayer in public schools" [prayer] (r=-0.530, p<0.001). The missing/valid variable for "attitude toward abortion when a woman wants one for any reason" [abany] did not have a statistically signficant correlation with the missing/valid variable: "favor or oppose death penalty for murder" [cappun] (r=-0.079, p=0.202). The missing/valid variable for "opinion about government reducing income differences between rich and poor" [eqwlth] did not have a statistically signficant correlation with the missing/valid variable: "favor or oppose death penalty for murder" [cappun] (r=0.002, p=0.971). The missing/valid variable for "favor or oppose death penalty for murder" [cappun] …
67
SW388R7 Data Analysis & Computers II Slide 67 Steps in answering questions about missing data False No Yes Is any variable missing data for half or more of the cases? Are any correlations of valid/missing variables statistically significant? No False Yes True Remove cases missing data for more than half of the variables.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.