Download presentation
Presentation is loading. Please wait.
Published byAshley Clark Modified over 9 years ago
1
Slide 1 Analyzing Patterns of Missing Data While SPSS contains a rich set of procedures for analyzing patterns of missing data, they are not included in the set of tools licensed by the University. However, we can replicate much of the analysis with other SPSS procedures. The first set of tasks in the missing data analysis involve the creation of diagnostic variables that support the analysis: first, a variable that counts the number of variables with missing data for each case; second, one new dichotomous variable for each original variable that indicates whether or not the original variable had a missing data value; and third, a single pattern variable for each case that summarizes the missing or valid status of values for all of the variables in the analysis. Using the diagnostic variable that counts the missing values for each case, we can identify cases with large concentrations of missing data as candidates for elimination from the analysis. After we remove specific cases with large numbers of missing variables, we do a frequency distribution for the remaining cases to see if any variables have so many missing cases that the variable should be considered a candidate for exclusion. Next, we compute a frequency distribution for the pattern variable to identify patterns that occur often in the data, indicating a problematic missing data process. Next, using the valid/missing variables as a grouping variable, we examine whether or not the missing cases are statistically different from the valid cases for all of the other variables in the analysis. If the variable is metric, we do a t-test for group differences; if the variable is non-metric, we do a chi-square test of independence to detect group differences. Finally, we do a correlation matrix of the valid/missing variables to detect concentrations of missing data across multiple variables. Analyzing Patterns of Missing Data
2
Slide 2 1. Download the data set Download the HATMISS data set from the course web page and save it in your C:\SW388R7 folder. Analyzing Patterns of Missing Data
3
Slide 3 2. Tallying the Number of Missing Variables One of the major information items we need for the missing data analysis is the number of variables that have missing data for each case in the sample. We will create a new variable which we will name num_miss that will contain the number of variables from the first ten in the data set, x1 through x10. We include only the first ten variables in this calculation to maintain consistency with the text. The SPSS function NMISS counts the number of variables that have missing values. We will use this function to calculate the value for our NUM_MISS variable for each case. Analyzing Patterns of Missing Data
4
Slide 4 Computing the Number Missing by Case Analyzing Patterns of Missing Data
5
Slide 5 Specifying the Variables in the Function Analyzing Patterns of Missing Data
6
Slide 6 3. Creating Dichotomous Valid/Missing Variables for Diagnosing Missing Data To determine whether or not the pattern of missing data is random, we create a special diagnostic variable that indicates whether the variable is missing or valid for each case in the data set. Each diagnostic variable is dichotomous, using the value 1 for 'Valid' and the value 0 for 'Missing' Since we may need to refer back to the original variables in the course of the missing data analysis, I recommend a naming convention for the diagnostic variables that makes it easy to identify the original variable. If the original variable name is less than eight characters, an underscore is appended to the end of the original variable name, e.g. the diagnostic variable for race would be race_. If the original variable name is eight characters, the last character is replaced with an underscore, e.g. the diagnostic variable name for response would be respons_. If replacing the last character with an underscore duplicates the name assigned to another diagnostic variable for an eight- character variable name, we drop the last two characters from the original name and append an underscore followed by a sequence letter or digit, e.g. the diagnostic variable name for response would be respon_1 if we had already used the name respons_ for a diagnostic variable. When we assign variable labels to the diagnostic variables, we can add a keyword to the original variable label to designate it as a missing/valid diagnostic variable, e.g. the variable label for the diagnostic variable that had an original variable label of Grade Level could be Grade Level (Valid/Missing). We will demonstrate the process of creating dichotomous Valid/Missing variables for diagnosing missing data using the variables in the HATMISS.SAV data set. If the copy of HATMISS.SAV that you are working with does not have variable labels and value labels, do the exercise Applying a Data Dictionary to apply the data labels from the HATCO.SAV data set to the HATMISS.SAV data set. A quick test for the presence of variable labels is to position the mouse over a variable name in the data editor. If a variable label appears in a yellow tips box, a variable label has been added for that variable. Analyzing Patterns of Missing Data
7
Slide 7 Recoding Diagnostic Variables for Missing Data Analyzing Patterns of Missing Data
8
Slide 8 Opening the Dialog for Old and New Values Analyzing Patterns of Missing Data
9
Slide 9 Add the Value for Missing Data Analyzing Patterns of Missing Data
10
Slide 10 Add the Value for Valid Data Analyzing Patterns of Missing Data
11
Slide 11 Completing the Values Dialog Box Analyzing Patterns of Missing Data
12
Slide 12 Adding Diagnostic Variables for the Remaining Variables Analyzing Patterns of Missing Data
13
Slide 13 Adding Value Labels to the Diagnostic Variables Analyzing Patterns of Missing Data
14
Slide 14 Adding the Value Label for Missing Analyzing Patterns of Missing Data
15
Slide 15 Add the Value Label for Valid Analyzing Patterns of Missing Data
16
Slide 16 Apply the Value Labels Analyzing Patterns of Missing Data
17
Slide 17 Displaying the Value Labels for the Variables Analyzing Patterns of Missing Data
18
Slide 18 The Diagnostic Variables Analyzing Patterns of Missing Data
19
Slide 19 4. Adding a Pattern Variable to the Data Set Another indication of a problematic missing data process is the frequent occurrence of the same pattern of missing data among the variables. While patterns can be detected by sorting and scanning the data set, this task is facilitated by the creation of a pattern variable. The pattern variable is a string variable containing one character for each variable in the data set. Each character in the pattern variable is set to a character indicating missing data or a character indicating valid data. To make the pattern more visually intuitive, the characters selected should have the same width when printed. If we do not use same width characters, we cannot scan down values to compare them because the column alignment of the characters is not the same from one value to the next. We will use an X for missing data and a tilde, ~, for valid data, because both are full width characters. To create the pattern variable, we first create a one-character string variable for each of the original variables. Then, we use the SPSS 'CONCAT' function to add the string variables together into a single variable. Analyzing Patterns of Missing Data
20
Slide 20 Recode the Original Variables into String Variables Analyzing Patterns of Missing Data
21
Slide 21 Opening the Dialog for Old and New Values Analyzing Patterns of Missing Data
22
Slide 22 Add the Value for Missing Data Analyzing Patterns of Missing Data
23
Slide 23 Add the Value for Valid Data Analyzing Patterns of Missing Data
24
Slide 24 Completing the Values Dialog Box Analyzing Patterns of Missing Data
25
Slide 25 Adding String Variables for the Other Original Variables Analyzing Patterns of Missing Data
26
Slide 26 The String Variables Analyzing Patterns of Missing Data
27
Slide 27 Create the Variable Containing the Concatenated Data Analyzing Patterns of Missing Data
28
Slide 28 Enter the Formula for the Concatenated Variable Analyzing Patterns of Missing Data
29
Slide 29 The Missing Data Pattern Variable Analyzing Patterns of Missing Data
30
Slide 30 5. Removing Cases with a Large Proportion of Missing Variables To identify the cases that we should consider removing, we will sort the data set in descending order by the number of missing variables. The candidates for elimination will appear at the top of the data set. Once we have located the cases that we want to eliminate, we specify a filter condition to eliminate the cases from further analysis. The cases are not deleted from the data set, so we can include them in later analysis should we desire to do so. Analyzing Patterns of Missing Data
31
Slide 31 Sorting the Cases Analyzing Patterns of Missing Data
32
Slide 32 The Cases Sorted by Number Missing Analyzing Patterns of Missing Data
33
Slide 33 Excluding the Cases Analyzing Patterns of Missing Data
34
Slide 34 Specifying the If Condition Analyzing Patterns of Missing Data
35
Slide 35 Specify Filtering for Unselected Cases Analyzing Patterns of Missing Data
36
Slide 36 The Data Set with Filtered Cases Analyzing Patterns of Missing Data
37
Slide 37 6. Summary Statistics for the Unfiltered Cases Filtering cases with 50% or more missing data removed six cases from the data set, reducing our effective sample size to 64 cases. We next look at a frequency distribution for each variable to see if any variables have such a high proportion of missing data that they should be considered candidates for removal from the analysis. We can see the distribution of missing data on each of our variables by using the Frequencies command, which produces the SPSS output equivalent to Table 2.2 on page 56 of the text. We will use a Frequencies command instead of a Descriptives command, because the Frequencies command will provide a count of the remaining missing cases for each variable. Analyzing Patterns of Missing Data
38
Slide 38 Requesting the Frequency Distributions Analyzing Patterns of Missing Data
39
Slide 39 Requesting Specific Statistics Analyzing Patterns of Missing Data
40
Slide 40 The Frequencies Output Analyzing Patterns of Missing Data
41
Slide 41 Changing the Orientation of the Table Analyzing Patterns of Missing Data
42
Slide 42 The Transposed Frequencies Table Analyzing Patterns of Missing Data
43
Slide 43 7. Tabulating Missing Data Patterns In a previous exercise, Adding a Pattern Variable to the Data Set, we created a pattern variable that contained a single string of ten characters representing valid or missing data for the first ten variables in the data set. To create table 2.4 on page 58, we do frequency distribution on the pattern variable. This frequency distribution will tell us if there are one or two patterns of missing data that occur with sufficient frequency to require further investigation. Analyzing Patterns of Missing Data
44
Slide 44 Request a Frequency Distribution for the Pattern Variable Analyzing Patterns of Missing Data
45
Slide 45 The Frequency of Different Patterns Analyzing Patterns of Missing Data
46
Slide 46 8. T-tests and Chi-square Tests for Diagnosing Randomness of Missing Data In previous exercises, we created dichotomous grouping variables for the variables X1 through X10, where the grouping variable was assigned a 1 if the data was valid and a 0 if the data was missing. We will use these grouping variables to determine whether the valid and missing groups differ in their relationship to other variables in the data set. If the missing and valid groups are statistically equivalent on other variables, then the missing cases can be characterized as random, and of no consequence to our analysis. If the missing group shows a statistically significant relationship to the other variable, it suggests that there is a missing data process that requires further understanding. The statistical tests that we use in this analysis are chi-square tests of independence, if the variable to be tested is nonmetric, or t-tests for two independent samples, if the variable to be tested is metric. The authors use the separate variance output for all t- tests instead of examining individual tests of homogeneity. We will follow this practice. When this analysis is conducted, there are usually a large number of statistical relationships tested. We know that using an alpha level of 0.05 in these tests implies that we will make an incorrect inference in one out of every twenty tests. With a large number of tests, we will get some statistically significant relationships even when there is no serious problem with our data. We are not looking at the individual test results, as much as we are concerned with an overall pattern of relationships. NOTE. I cannot reconcile the findings on these tests to the discussion of findings on page 58 of the text. The statistical results are consistent with table 2.5 on page 59, while the text discussion appears to be a carryover from the fourth edition of the text, which does not contain the same statistical results as the fifth edition. Analyzing Patterns of Missing Data
47
Slide 47 The Statistical Tests to Be Computed We will use the grouping variable 'Delivery Speed (Valid/Missing)' (X1_) to explore differences among the next nine variables in the data set, 'Price Level' through 'Satisfaction Level' (X2 through X10). In each statistical test, we are testing the null hypothesis of no relationship associated with the grouping variable, 'Delivery Speed (Valid/Missing)'. If we reject the null hypothesis, we would conclude that persons who did not answer the question on Delivery Speed had a different pattern of responses than did persons who did provide Delivery Speed. The variable 'Firm Size' (x8) is a nonmetric variable and we will do a chi-square test of independence for this variable. The variables 'Price Level' (x2), 'Price Flexibility' (x3), 'Manufacturer Image' (x4), 'Service' (x5), 'Salesforce Image' (x6), 'Product Quality' (x7), 'Usage Level' (x9), and 'Satisfaction Level' (x10) are all metric and we will do t-tests for these variables. Analyzing Patterns of Missing Data
48
Slide 48 The Chi-square Test of Independence Analyzing Patterns of Missing Data
49
Slide 49 Requesting the Chi-square Test Analyzing Patterns of Missing Data
50
Slide 50 Specifying Cell Contents Analyzing Patterns of Missing Data
51
Slide 51 Chi-square Test Results Analyzing Patterns of Missing Data
52
Slide 52 Requesting the T-tests Analyzing Patterns of Missing Data
53
Slide 53 Specifying the Groups by Code Number Analyzing Patterns of Missing Data
54
Slide 54 Results of the T-tests Analyzing Patterns of Missing Data
55
Slide 55 9. The Correlation Matrix for Diagnosing Randomness of Missing Data To continue our missing data analysis, we run a correlation matrix for the dichotomous grouping variables: 'Delivery Speed (Valid/Missing)', 'Price Level (Valid/Missing)', 'Price Flexibility (Valid/Missing)', 'Manufacturer Image (Valid/Missing)', 'Service (Valid/Missing)', 'Salesforce Image (Valid/Missing)', 'Product Quality (Valid/Missing)', 'Usage Level (Valid/Missing)', and 'Satisfaction Level (Valid/Missing)'. We examine the pattern of correlations to see if there is are large correlations among multiple pairs of variables that do not have an obvious explanation. An obvious explanation would be that subjects only answered these questions if their answer to another question were some value, e.g. only answer the question about job satisfaction if you are employed. If there are variables that show a strong pattern of systematic missing data without an obvious explanation, we should evaluate the impact that this pattern has on our research questions, and make our decision about including, eliminating, or substituting for these variables. Analyzing Patterns of Missing Data
56
Slide 56 Requesting the Correlation Matrix Analyzing Patterns of Missing Data
57
Slide 57 The Correlation Matrix Output Analyzing Patterns of Missing Data
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.