Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12
Objectives To establish methods of uncovering coding errors To discuss techniques for implementing logical tests To present methods of selecting cases To reinforce the SPSS skills presented to date
Boolean operators: AND The AND operator is a logical operator in Boolean algebra Imagine two statements: X and Y For the operation (X AND Y) to be true X has to be true and Y has to be true The rules for Boolean operators are commonly displayed in Truth Tables
Truth table: AND
Boolean operators: OR The OR operator is a logical operator in Boolean algebra Imagine two statements: X and Y For the operation (X OR Y) to be true either X is true or Y is true or both X and Y are true
Truth table: OR
Data cleaning Check the data for errors Clean the data before any data analysis
Types of error There are two broad areas of error: –Coding errors –Logical errors
Coding error Data entry errors Out-of-range values
Detecting out-of-range values For categorical variables, having declared valid values, frequency counts will highlight any peculiar entries For continuous variables, descriptive statistics, in particular the range and a histogram, will highlight any peculiar values
Examples Age: generate descriptive statistics Treatment type: generate a frequency distribution
StatisticStd. Error AgeMean % Confidence Interval for Mean Lower Bound31.16 Upper Bound % Trimmed Mean31.31 Median31.00 Variance Std. Deviation Minimum1 Maximum77 Range76 Interquartile Range20.00 Skewness Kurtosis Descriptives
FrequencyPercentValid PercentCumulative Percent ValidInpatient Outpatient Total MissingSystem8.5 Total Treatment type
Resolving errors The questionnaires should be checked If possible, return to the interviewer or interviewee If still unresolved, consider setting the value as missing Note the importance of ID numbers for linking the computer to the questionnaire
Selecting cases The ability to select a set of cases according to a criterion is essential in data cleaning Generating statistics for subsets of the data is also a useful analytical tool
Example: Age Descriptive statistics of Age indicate that there is a case with a value of 1 and a case with the value 77 It is advisable to check the extreme values NMinimumMaximumMeanStd. Deviation Age Valid N (listwise)1563 Descriptive Statistics
Example: Age It would be reasonable to check for values 10 and under and 70 and over The task is to select those cases and display the results Data/Select Cases generates the following dialogue box
Choose these options to define selection criteria.
Data/Select Cases SPSS creates a new variable in the data set called filter_$ which = 1 when AGE = 70 All subsequent analysis will be on the reduced data set until Data/Select Cases/All Cases is chosen The filtered cases are identified by a slash through the case number
FrequencyPercentValid PercentCumulative Percent Valid Total Age
Generating a report Analyse/Reports/Case Summaries Select the variables to be included in the summary
Case number IDAgeRaceEducationEmploymentMarital statusTreatment type 1 st most frequently used drug 116 8WhiteSecondaryWorking full- time Married liv w. spouse InpatientALCOHOL WhiteTertiaryPensionerWidowedInpatientALCOHOL WhiteSecondaryPensionerMarried liv w. spouse InpatientALCOHOL WhiteTertiaryPensionerMarried liv w. spouse InpatientALCOHOL White.Student/pupilNever marriedInpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE TotalN Case summaries a a. Limited to first 100 cases.
Note: All Cases Don’t forget that, once certain cases have been selected, all subsequent analysis is on the selected cases only Once you have finished working with the subset, restore the file to All Cases before doing any further analysis –Data/Select Cases… –Select the All Cases radio button –OK
Locating a case From the Data Editor: –Data/Go To Case OR –Select a variable, then Edit/Find
Logical errors Detecting logical errors involves comparing answers to ensure that they are consistent The type of logical checks appropriate to identify particular errors will depend on the questions in the questionnaire
Detecting logical errors Cross-tabulations between categorical variables can be used to highlight errors Check criteria using conditional statements and the Compute facility Some software, such as SPSS Databuilder, allows tests for logical and coding errors to be built into a data entry form
Example: Cross-tabulation Cross-tabulations provide a simple method of investigating the joint distribution of two variables The following slide is a cross-tabulation of Drug1 against Mode1 to check that appropriate modes of ingestion have been reported
Most Frequently Used Drug (Cross-tabulation) Mode of ingestion Drug1 SwallowSmokeSnortInjectTotal DAGGA HEROIN CODEINE55 COCAINE24446 CRACK97198 AMPHETAMINE4127 ECSTASY24125 SEDATIVES & TRANQUILLIZERS 33 BENZODIAZEPINES16 MANDRAX12 VALIUM22 LSD55 SOLVENTS & INHALANTS2136 WHITE PIPE309 ALCOHOL717 ROHYPNOL33 MISC. PRESCRIPTION DRUGS9110 MISC. DRUGS11 Total Most frequently used drug
Example: conditional statements Main.sav contains information on the three most frequently used drugs: Drug1, Drug2 and Drug3 In a single case, no drug should appear in more than one of the three variables To check this, generate a test variable on the basis of a conditional statement; the test variable should take the value 0 if all three drug variables are different and the value 1 if there is any duplication
Compute: Test = 0 Transform/Compute Enter the name of the new variable: TEST Click the Type and Label button and declare the variable as numeric with the label: TEST VARIABLE FOR DRUG DUPLICATION Set TEST = 0
Compute: TEST = 1 If any of the drug options are the same, TEST should equal 1 EXCEPT when Drug2 = Drug3 = 77 (not applicable) The condition is if –Drug1 = Drug2 OR –Drug1 = Drug3 OR –(Drug2 = Drug3 AND Drug2 77) – THEN Test = 1
Click If… button to define the conditional statement.
1st most frequently used drug 2nd most frequently used drug 3rd most frequently used drug ID 1BENZODIAZEPINESMISC. PRESCRIPTION DRUGS 734 2CRACK ECSTASY807 3CRACKWHITE PIPECRACK835 4HEROINSEDATIVES & TRANQUILLIZERS SEDATIVES & TRANQUILLIZERS MISC. PRESCRIPTION DRUGS SEDATIVES & TRANQUILLIZERS MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS Not Applicable1245 8MISC. PRESCRIPTION DRUGS ALCOHOL1250 TotalN8888 Case summaries a a. Limited to first 100 cases.
Exercise Check for consistency between the drug reported and the method of ingestion for the second and third drugs of use What additional logical tests could be completed on the data in main.sav?
Summary Data entry errors Out-of-range errors Logical errors Conditional statements Selecting cases Reports