Slide 1 Analyzing Patterns of Missing Data While SPSS contains a rich set of procedures for analyzing patterns of missing data, they are not included in.

Slides:



Advertisements
Similar presentations
SW388R7 Data Analysis & Computers II Slide 1 Solving Problems in SPSS The data sets Options for variable lists in statistical procedures Options for variable.
Advertisements

5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
Assumption of normality
Chapter 13: Inference for Distributions of Categorical Data
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Outliers Split-sample Validation
1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.
Detecting univariate outliers Detecting multivariate outliers
Chapter18 Determining and Interpreting Associations Among Variables.
A Simple Guide to Using SPSS© for Windows
Chi-square Test of Independence
Outliers Split-sample Validation
Additional HW Exercise 9.1 (a) A state government official is interested in the prevalence of color blindness among drivers in the state. In a random sample.
Introduction to SPSS Descriptive Statistics. Introduction to SPSS Statistics Program for the Social Sciences (SPSS) Commonly used statistical software.
Multiple Regression – Assumptions and Outliers
Multiple Regression – Basic Relationships
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
Assumption of Homoscedasticity
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Assumption of linearity
SW388R7 Data Analysis & Computers II Slide 1 Analyzing Missing Data Introduction Problems Using Scripts.
SW388R6 Data Analysis and Computers I Slide 1 Chi-square Test of Goodness-of-Fit Key Points for the Statistical Test Sample Homework Problem Solving the.
8/15/2015Slide 1 The only legitimate mathematical operation that we can use with a variable that we treat as categorical is to count the number of cases.
Microsoft Office Word 2013 Expert Microsoft Office Word 2013 Expert Courseware # 3251 Lesson 4: Working with Forms.
Example of Simple and Multiple Regression
Slide 1 SOLVING THE HOMEWORK PROBLEMS Simple linear regression is an appropriate model of the relationship between two quantitative variables provided.
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems.
8/23/2015Slide 1 The introductory statement in the question indicates: The data set to use: GSS2000R.SAV The task to accomplish: a one-sample test of a.
Three-Group Illustrative Example of Discriminant Analysis
SW388R7 Data Analysis & Computers II Slide 1 Assumption of Homoscedasticity Homoscedasticity (aka homogeneity or uniformity of variance) Transformations.
SW388R7 Data Analysis & Computers II Slide 1 Analyzing Missing Data Introduction Practice Problems Homework Problems Using Scripts.
Hierarchical Binary Logistic Regression
9/23/2015Slide 1 Published reports of research usually contain a section which describes key characteristics of the sample included in the study. The “key”
Chapter 15 Data Analysis: Testing for Significant Differences.
Independent Samples t-Test (or 2-Sample t-Test)
Chi-Square Test of Independence Practice Problem – 1
Stepwise Multiple Regression
SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems Homework Problems.
SW388R6 Data Analysis and Computers I Slide 1 Independent Samples T-Test of Population Means Key Points about Statistical Test Sample Homework Problem.
SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.
CHAPTER 11 SECTION 2 Inference for Relationships.
6/4/2016Slide 1 The one sample t-test compares two values for the population mean of a single variable. The two-sample t-test of population means (aka.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Within Subjects Analysis of Variance PowerPoint.
Chi-square Test of Independence
SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.
SW388R6 Data Analysis and Computers I Slide 1 Percentiles and Standard Scores Sample Percentile Homework Problem Solving the Percentile Problem with SPSS.
SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
1/5/2016Slide 1 We will use a one-sample test of proportions to test whether or not our sample proportion supports the population proportion from which.
DTC Quantitative Methods Summary of some SPSS commands Weeks 1 & 2, January 2012.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
1/11/2016Slide 1 Extending the relationships found in linear regression to a population is procedurally similar to what we have done for t-tests and chi-square.
Tutorial I: Missing Value Analysis
Extracting Information from an Excel List The purpose of creating a database, or list in Excel, is to be able to manipulate the data elements in ways that.
Introduction to Data Analysis Why do we analyze data?  Make sense of data we have collected Basic steps in preliminary data analysis  Editing  Coding.
2/24/2016Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
(Slides not created solely by me – the internet is a wonderful tool) SW388R7 Data Analysis & Compute rs II Slide 1.
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8… Where we are going… Significance Tests!! –Ch 9 Tests about a population proportion –Ch 9Tests.
Assumption of normality
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Chapter 11: Inference for Distributions of Categorical Data
Multiple Regression – Split Sample Validation
Presentation transcript:

Slide 1 Analyzing Patterns of Missing Data While SPSS contains a rich set of procedures for analyzing patterns of missing data, they are not included in the set of tools licensed by the University. However, we can replicate much of the analysis with other SPSS procedures. The first set of tasks in the missing data analysis involve the creation of diagnostic variables that support the analysis: first, a variable that counts the number of variables with missing data for each case; second, one new dichotomous variable for each original variable that indicates whether or not the original variable had a missing data value; and third, a single pattern variable for each case that summarizes the missing or valid status of values for all of the variables in the analysis. Using the diagnostic variable that counts the missing values for each case, we can identify cases with large concentrations of missing data as candidates for elimination from the analysis. After we remove specific cases with large numbers of missing variables, we do a frequency distribution for the remaining cases to see if any variables have so many missing cases that the variable should be considered a candidate for exclusion. Next, we compute a frequency distribution for the pattern variable to identify patterns that occur often in the data, indicating a problematic missing data process. Next, using the valid/missing variables as a grouping variable, we examine whether or not the missing cases are statistically different from the valid cases for all of the other variables in the analysis. If the variable is metric, we do a t-test for group differences; if the variable is non-metric, we do a chi-square test of independence to detect group differences. Finally, we do a correlation matrix of the valid/missing variables to detect concentrations of missing data across multiple variables. Analyzing Patterns of Missing Data

Slide 2 1. Download the data set Download the HATMISS data set from the course web page and save it in your C:\SW388R7 folder. Analyzing Patterns of Missing Data

Slide 3 2. Tallying the Number of Missing Variables One of the major information items we need for the missing data analysis is the number of variables that have missing data for each case in the sample. We will create a new variable which we will name num_miss that will contain the number of variables from the first ten in the data set, x1 through x10. We include only the first ten variables in this calculation to maintain consistency with the text. The SPSS function NMISS counts the number of variables that have missing values. We will use this function to calculate the value for our NUM_MISS variable for each case. Analyzing Patterns of Missing Data

Slide 4 Computing the Number Missing by Case Analyzing Patterns of Missing Data

Slide 5 Specifying the Variables in the Function Analyzing Patterns of Missing Data

Slide 6 3. Creating Dichotomous Valid/Missing Variables for Diagnosing Missing Data To determine whether or not the pattern of missing data is random, we create a special diagnostic variable that indicates whether the variable is missing or valid for each case in the data set. Each diagnostic variable is dichotomous, using the value 1 for 'Valid' and the value 0 for 'Missing' Since we may need to refer back to the original variables in the course of the missing data analysis, I recommend a naming convention for the diagnostic variables that makes it easy to identify the original variable. If the original variable name is less than eight characters, an underscore is appended to the end of the original variable name, e.g. the diagnostic variable for race would be race_. If the original variable name is eight characters, the last character is replaced with an underscore, e.g. the diagnostic variable name for response would be respons_. If replacing the last character with an underscore duplicates the name assigned to another diagnostic variable for an eight- character variable name, we drop the last two characters from the original name and append an underscore followed by a sequence letter or digit, e.g. the diagnostic variable name for response would be respon_1 if we had already used the name respons_ for a diagnostic variable. When we assign variable labels to the diagnostic variables, we can add a keyword to the original variable label to designate it as a missing/valid diagnostic variable, e.g. the variable label for the diagnostic variable that had an original variable label of Grade Level could be Grade Level (Valid/Missing). We will demonstrate the process of creating dichotomous Valid/Missing variables for diagnosing missing data using the variables in the HATMISS.SAV data set. If the copy of HATMISS.SAV that you are working with does not have variable labels and value labels, do the exercise Applying a Data Dictionary to apply the data labels from the HATCO.SAV data set to the HATMISS.SAV data set. A quick test for the presence of variable labels is to position the mouse over a variable name in the data editor. If a variable label appears in a yellow tips box, a variable label has been added for that variable. Analyzing Patterns of Missing Data

Slide 7 Recoding Diagnostic Variables for Missing Data Analyzing Patterns of Missing Data

Slide 8 Opening the Dialog for Old and New Values Analyzing Patterns of Missing Data

Slide 9 Add the Value for Missing Data Analyzing Patterns of Missing Data

Slide 10 Add the Value for Valid Data Analyzing Patterns of Missing Data

Slide 11 Completing the Values Dialog Box Analyzing Patterns of Missing Data

Slide 12 Adding Diagnostic Variables for the Remaining Variables Analyzing Patterns of Missing Data

Slide 13 Adding Value Labels to the Diagnostic Variables Analyzing Patterns of Missing Data

Slide 14 Adding the Value Label for Missing Analyzing Patterns of Missing Data

Slide 15 Add the Value Label for Valid Analyzing Patterns of Missing Data

Slide 16 Apply the Value Labels Analyzing Patterns of Missing Data

Slide 17 Displaying the Value Labels for the Variables Analyzing Patterns of Missing Data

Slide 18 The Diagnostic Variables Analyzing Patterns of Missing Data

Slide Adding a Pattern Variable to the Data Set Another indication of a problematic missing data process is the frequent occurrence of the same pattern of missing data among the variables. While patterns can be detected by sorting and scanning the data set, this task is facilitated by the creation of a pattern variable. The pattern variable is a string variable containing one character for each variable in the data set. Each character in the pattern variable is set to a character indicating missing data or a character indicating valid data. To make the pattern more visually intuitive, the characters selected should have the same width when printed. If we do not use same width characters, we cannot scan down values to compare them because the column alignment of the characters is not the same from one value to the next. We will use an X for missing data and a tilde, ~, for valid data, because both are full width characters. To create the pattern variable, we first create a one-character string variable for each of the original variables. Then, we use the SPSS 'CONCAT' function to add the string variables together into a single variable. Analyzing Patterns of Missing Data

Slide 20 Recode the Original Variables into String Variables Analyzing Patterns of Missing Data

Slide 21 Opening the Dialog for Old and New Values Analyzing Patterns of Missing Data

Slide 22 Add the Value for Missing Data Analyzing Patterns of Missing Data

Slide 23 Add the Value for Valid Data Analyzing Patterns of Missing Data

Slide 24 Completing the Values Dialog Box Analyzing Patterns of Missing Data

Slide 25 Adding String Variables for the Other Original Variables Analyzing Patterns of Missing Data

Slide 26 The String Variables Analyzing Patterns of Missing Data

Slide 27 Create the Variable Containing the Concatenated Data Analyzing Patterns of Missing Data

Slide 28 Enter the Formula for the Concatenated Variable Analyzing Patterns of Missing Data

Slide 29 The Missing Data Pattern Variable Analyzing Patterns of Missing Data

Slide Removing Cases with a Large Proportion of Missing Variables To identify the cases that we should consider removing, we will sort the data set in descending order by the number of missing variables. The candidates for elimination will appear at the top of the data set. Once we have located the cases that we want to eliminate, we specify a filter condition to eliminate the cases from further analysis. The cases are not deleted from the data set, so we can include them in later analysis should we desire to do so. Analyzing Patterns of Missing Data

Slide 31 Sorting the Cases Analyzing Patterns of Missing Data

Slide 32 The Cases Sorted by Number Missing Analyzing Patterns of Missing Data

Slide 33 Excluding the Cases Analyzing Patterns of Missing Data

Slide 34 Specifying the If Condition Analyzing Patterns of Missing Data

Slide 35 Specify Filtering for Unselected Cases Analyzing Patterns of Missing Data

Slide 36 The Data Set with Filtered Cases Analyzing Patterns of Missing Data

Slide Summary Statistics for the Unfiltered Cases Filtering cases with 50% or more missing data removed six cases from the data set, reducing our effective sample size to 64 cases. We next look at a frequency distribution for each variable to see if any variables have such a high proportion of missing data that they should be considered candidates for removal from the analysis. We can see the distribution of missing data on each of our variables by using the Frequencies command, which produces the SPSS output equivalent to Table 2.2 on page 56 of the text. We will use a Frequencies command instead of a Descriptives command, because the Frequencies command will provide a count of the remaining missing cases for each variable. Analyzing Patterns of Missing Data

Slide 38 Requesting the Frequency Distributions Analyzing Patterns of Missing Data

Slide 39 Requesting Specific Statistics Analyzing Patterns of Missing Data

Slide 40 The Frequencies Output Analyzing Patterns of Missing Data

Slide 41 Changing the Orientation of the Table Analyzing Patterns of Missing Data

Slide 42 The Transposed Frequencies Table Analyzing Patterns of Missing Data

Slide Tabulating Missing Data Patterns In a previous exercise, Adding a Pattern Variable to the Data Set, we created a pattern variable that contained a single string of ten characters representing valid or missing data for the first ten variables in the data set. To create table 2.4 on page 58, we do frequency distribution on the pattern variable. This frequency distribution will tell us if there are one or two patterns of missing data that occur with sufficient frequency to require further investigation. Analyzing Patterns of Missing Data

Slide 44 Request a Frequency Distribution for the Pattern Variable Analyzing Patterns of Missing Data

Slide 45 The Frequency of Different Patterns Analyzing Patterns of Missing Data

Slide T-tests and Chi-square Tests for Diagnosing Randomness of Missing Data In previous exercises, we created dichotomous grouping variables for the variables X1 through X10, where the grouping variable was assigned a 1 if the data was valid and a 0 if the data was missing. We will use these grouping variables to determine whether the valid and missing groups differ in their relationship to other variables in the data set. If the missing and valid groups are statistically equivalent on other variables, then the missing cases can be characterized as random, and of no consequence to our analysis. If the missing group shows a statistically significant relationship to the other variable, it suggests that there is a missing data process that requires further understanding. The statistical tests that we use in this analysis are chi-square tests of independence, if the variable to be tested is nonmetric, or t-tests for two independent samples, if the variable to be tested is metric. The authors use the separate variance output for all t- tests instead of examining individual tests of homogeneity. We will follow this practice. When this analysis is conducted, there are usually a large number of statistical relationships tested. We know that using an alpha level of 0.05 in these tests implies that we will make an incorrect inference in one out of every twenty tests. With a large number of tests, we will get some statistically significant relationships even when there is no serious problem with our data. We are not looking at the individual test results, as much as we are concerned with an overall pattern of relationships. NOTE. I cannot reconcile the findings on these tests to the discussion of findings on page 58 of the text. The statistical results are consistent with table 2.5 on page 59, while the text discussion appears to be a carryover from the fourth edition of the text, which does not contain the same statistical results as the fifth edition. Analyzing Patterns of Missing Data

Slide 47 The Statistical Tests to Be Computed We will use the grouping variable 'Delivery Speed (Valid/Missing)' (X1_) to explore differences among the next nine variables in the data set, 'Price Level' through 'Satisfaction Level' (X2 through X10). In each statistical test, we are testing the null hypothesis of no relationship associated with the grouping variable, 'Delivery Speed (Valid/Missing)'. If we reject the null hypothesis, we would conclude that persons who did not answer the question on Delivery Speed had a different pattern of responses than did persons who did provide Delivery Speed. The variable 'Firm Size' (x8) is a nonmetric variable and we will do a chi-square test of independence for this variable. The variables 'Price Level' (x2), 'Price Flexibility' (x3), 'Manufacturer Image' (x4), 'Service' (x5), 'Salesforce Image' (x6), 'Product Quality' (x7), 'Usage Level' (x9), and 'Satisfaction Level' (x10) are all metric and we will do t-tests for these variables. Analyzing Patterns of Missing Data

Slide 48 The Chi-square Test of Independence Analyzing Patterns of Missing Data

Slide 49 Requesting the Chi-square Test Analyzing Patterns of Missing Data

Slide 50 Specifying Cell Contents Analyzing Patterns of Missing Data

Slide 51 Chi-square Test Results Analyzing Patterns of Missing Data

Slide 52 Requesting the T-tests Analyzing Patterns of Missing Data

Slide 53 Specifying the Groups by Code Number Analyzing Patterns of Missing Data

Slide 54 Results of the T-tests Analyzing Patterns of Missing Data

Slide The Correlation Matrix for Diagnosing Randomness of Missing Data To continue our missing data analysis, we run a correlation matrix for the dichotomous grouping variables: 'Delivery Speed (Valid/Missing)', 'Price Level (Valid/Missing)', 'Price Flexibility (Valid/Missing)', 'Manufacturer Image (Valid/Missing)', 'Service (Valid/Missing)', 'Salesforce Image (Valid/Missing)', 'Product Quality (Valid/Missing)', 'Usage Level (Valid/Missing)', and 'Satisfaction Level (Valid/Missing)'. We examine the pattern of correlations to see if there is are large correlations among multiple pairs of variables that do not have an obvious explanation. An obvious explanation would be that subjects only answered these questions if their answer to another question were some value, e.g. only answer the question about job satisfaction if you are employed. If there are variables that show a strong pattern of systematic missing data without an obvious explanation, we should evaluate the impact that this pattern has on our research questions, and make our decision about including, eliminating, or substituting for these variables. Analyzing Patterns of Missing Data

Slide 56 Requesting the Correlation Matrix Analyzing Patterns of Missing Data

Slide 57 The Correlation Matrix Output Analyzing Patterns of Missing Data