Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.

Similar presentations


Presentation on theme: "Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12."— Presentation transcript:

1 Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12

2 Objectives To establish methods of uncovering coding errors To discuss techniques for implementing logical tests To present methods of selecting cases To reinforce the SPSS skills presented to date

3 Boolean operators: AND The AND operator is a logical operator in Boolean algebra Imagine two statements: X and Y For the operation (X AND Y) to be true X has to be true and Y has to be true The rules for Boolean operators are commonly displayed in Truth Tables

4 Truth table: AND

5 Boolean operators: OR The OR operator is a logical operator in Boolean algebra Imagine two statements: X and Y For the operation (X OR Y) to be true either X is true or Y is true or both X and Y are true

6 Truth table: OR

7 Data cleaning Check the data for errors Clean the data before any data analysis

8 Types of error There are two broad areas of error: –Coding errors –Logical errors

9 Coding error Data entry errors Out-of-range values

10 Detecting out-of-range values For categorical variables, having declared valid values, frequency counts will highlight any peculiar entries For continuous variables, descriptive statistics, in particular the range and a histogram, will highlight any peculiar values

11 Examples Age: generate descriptive statistics Treatment type: generate a frequency distribution

12 StatisticStd. Error AgeMean31.78.315 95% Confidence Interval for Mean Lower Bound31.16 Upper Bound32.40 5% Trimmed Mean31.31 Median31.00 Variance154.614 Std. Deviation12.434 Minimum1 Maximum77 Range76 Interquartile Range20.00 Skewness-.427.062 Kurtosis-.503.124 Descriptives

13

14 FrequencyPercentValid PercentCumulative Percent ValidInpatient102765.465.7 Outpatient53534.134.299.9 41.1 100.0 Total156399.5100.0 MissingSystem8.5 Total1571100.0 Treatment type

15 Resolving errors The questionnaires should be checked If possible, return to the interviewer or interviewee If still unresolved, consider setting the value as missing Note the importance of ID numbers for linking the computer to the questionnaire

16 Selecting cases The ability to select a set of cases according to a criterion is essential in data cleaning Generating statistics for subsets of the data is also a useful analytical tool

17 Example: Age Descriptive statistics of Age indicate that there is a case with a value of 1 and a case with the value 77 It is advisable to check the extreme values NMinimumMaximumMeanStd. Deviation Age156317731.7812.434 Valid N (listwise)1563 Descriptive Statistics

18 Example: Age It would be reasonable to check for values 10 and under and 70 and over The task is to select those cases and display the results Data/Select Cases generates the following dialogue box

19 Choose these options to define selection criteria.

20

21 Data/Select Cases SPSS creates a new variable in the data set called filter_$ which = 1 when AGE = 70 All subsequent analysis will be on the reduced data set until Data/Select Cases/All Cases is chosen The filtered cases are identified by a slash through the case number

22 FrequencyPercentValid PercentCumulative Percent Valid117.1 7535.7 42.9 817.1 50.0 917.1 57.1 10321.4 78.6 7017.1 85.7 7217.1 92.9 7717.1 100.0 Total14100.0 Age

23 Generating a report Analyse/Reports/Case Summaries Select the variables to be included in the summary

24

25 Case number IDAgeRaceEducationEmploymentMarital statusTreatment type 1 st most frequently used drug 116 8WhiteSecondaryWorking full- time Married liv w. spouse InpatientALCOHOL 285 77WhiteTertiaryPensionerWidowedInpatientALCOHOL 3183 70WhiteSecondaryPensionerMarried liv w. spouse InpatientALCOHOL 4184 72WhiteTertiaryPensionerMarried liv w. spouse InpatientALCOHOL 5903 1White.Student/pupilNever marriedInpatientDAGGA 61041 7AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA 71042 7AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA 81043 7AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA 91044 7AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA 101045 7AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA 111518 9AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE 121519 10AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE 131520 10AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE 141521 10AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE TotalN14 1314 Case summaries a a. Limited to first 100 cases.

26 Note: All Cases Don’t forget that, once certain cases have been selected, all subsequent analysis is on the selected cases only Once you have finished working with the subset, restore the file to All Cases before doing any further analysis –Data/Select Cases… –Select the All Cases radio button –OK

27 Locating a case From the Data Editor: –Data/Go To Case OR –Select a variable, then Edit/Find

28 Logical errors Detecting logical errors involves comparing answers to ensure that they are consistent The type of logical checks appropriate to identify particular errors will depend on the questions in the questionnaire

29 Detecting logical errors Cross-tabulations between categorical variables can be used to highlight errors Check criteria using conditional statements and the Compute facility Some software, such as SPSS Databuilder, allows tests for logical and coding errors to be built into a data entry form

30 Example: Cross-tabulation Cross-tabulations provide a simple method of investigating the joint distribution of two variables The following slide is a cross-tabulation of Drug1 against Mode1 to check that appropriate modes of ingestion have been reported

31 Most Frequently Used Drug (Cross-tabulation) Mode of ingestion Drug1 SwallowSmokeSnortInjectTotal DAGGA1180181 HEROIN31112971 CODEINE55 COCAINE24446 CRACK97198 AMPHETAMINE4127 ECSTASY24125 SEDATIVES & TRANQUILLIZERS 33 BENZODIAZEPINES16 MANDRAX12 VALIUM22 LSD55 SOLVENTS & INHALANTS2136 WHITE PIPE309 ALCOHOL717 ROHYPNOL33 MISC. PRESCRIPTION DRUGS9110 MISC. DRUGS11 Total79163462301517 Most frequently used drug

32 Example: conditional statements Main.sav contains information on the three most frequently used drugs: Drug1, Drug2 and Drug3 In a single case, no drug should appear in more than one of the three variables To check this, generate a test variable on the basis of a conditional statement; the test variable should take the value 0 if all three drug variables are different and the value 1 if there is any duplication

33 Compute: Test = 0 Transform/Compute Enter the name of the new variable: TEST Click the Type and Label button and declare the variable as numeric with the label: TEST VARIABLE FOR DRUG DUPLICATION Set TEST = 0

34 Compute: TEST = 1 If any of the drug options are the same, TEST should equal 1 EXCEPT when Drug2 = Drug3 = 77 (not applicable) The condition is if –Drug1 = Drug2 OR –Drug1 = Drug3 OR –(Drug2 = Drug3 AND Drug2  77) – THEN Test = 1

35 Click If… button to define the conditional statement.

36

37 1st most frequently used drug 2nd most frequently used drug 3rd most frequently used drug ID 1BENZODIAZEPINESMISC. PRESCRIPTION DRUGS 734 2CRACK ECSTASY807 3CRACKWHITE PIPECRACK835 4HEROINSEDATIVES & TRANQUILLIZERS 1182 5SEDATIVES & TRANQUILLIZERS MISC. PRESCRIPTION DRUGS 1230 6SEDATIVES & TRANQUILLIZERS MISC. PRESCRIPTION DRUGS 1231 7MISC. PRESCRIPTION DRUGS Not Applicable1245 8MISC. PRESCRIPTION DRUGS ALCOHOL1250 TotalN8888 Case summaries a a. Limited to first 100 cases.

38 Exercise Check for consistency between the drug reported and the method of ingestion for the second and third drugs of use What additional logical tests could be completed on the data in main.sav?

39 Summary Data entry errors Out-of-range errors Logical errors Conditional statements Selecting cases Reports


Download ppt "Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12."

Similar presentations


Ads by Google