Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Unit 1.1 Investigating Data 1. Frequency and Histograms CCSS: S.ID.1 Represent data with plots on the real number line (dot plots, histograms, and box.
Class 14 Testing Hypotheses about Means Paired samples 10.3 p
SW388R7 Data Analysis & Computers II Slide 1 Solving Problems in SPSS The data sets Options for variable lists in statistical procedures Options for variable.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Data Processing and Fundamental Data Analysis CHAPTER fourteen.
Learning Objectives 1 Copyright © 2002 South-Western/Thomson Learning Data Processing and Fundamental Data Analysis CHAPTER fourteen.
Learning Objectives Copyright © 2004 John Wiley & Sons, Inc. Data Processing, Fundamental Data Analysis, and Statistical Testing of Differences CHAPTER.
GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 7 SPSS: Recode and Compute.
One-sample T-Test of a Population Mean
Table manners GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 10.
5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
Data analysis: Explore GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 9.
GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 6 Coding open questions.
Data analysis: cross-tabulation GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 11.
Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.
LSP 121 Week 2 Intro to Statistics and SPSS/PASW.
Data Preparation and Description
Procurement Card Training Strategic Account Management (SAM)
A Simple Guide to Using SPSS© for Windows
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
8/2/2015Slide 1 SPSS does not calculate confidence intervals for proportions. The Excel spreadsheet that I used to calculate the proportions can be downloaded.
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Excel For MATH 125 Computing Statistics. Useful link Surfstat: (an online text in introductory Statistics)
1 1 Slide © 2001 South-Western/Thomson Learning  Anderson  Sweeney  Williams Anderson  Sweeney  Williams  Slides Prepared by JOHN LOUCKS  CONTEMPORARYBUSINESSSTATISTICS.
FEBRUARY, 2013 BY: ABDUL-RAUF A TRAINING WORKSHOP ON STATISTICAL AND PRESENTATIONAL SYSTEM SOFTWARE (SPSS) 18.0 WINDOWS.
Documentation and Help GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 13.
SW388R6 Data Analysis and Computers I Slide 1 Chi-square Test of Goodness-of-Fit Key Points for the Statistical Test Sample Homework Problem Solving the.
Sampling Distribution of the Mean Problem - 1
SW318 Social Work Statistics Slide 1 Estimation Practice Problem – 1 This question asks about the best estimate of the mean for the population. Recall.
Describing Data: Numerical
Describing distributions with numbers
Chapter Twelve Data Processing, Fundamental Data Analysis, and the Statistical Testing of Differences Chapter Twelve.
Data Processing, Fundamental Data
Coding closed questions Training session 5 GAP Toolkit 5 Training in basic drug abuse data management and analysis.
Chapter 1 Displaying the Order in a Group of Numbers and… Intro to SPSS (Activity 1) Thurs. Aug 22, 2013.
Using the Frequencies Procedure in SPSS 9.0 for Windows © by Julia Hartman © Copyright 2000, Julia Hartman.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
9/18/2015Slide 1 The homework problems on comparing central tendency and variability extend the focus central tendency and variability to a comparison.
Range, Variance, and Standard Deviation in SPSS. Get the Frequency first! Step 1. Frequency Distribution  After reviewing the data  Start with the “Analyze”
SW388R6 Data Analysis and Computers I Slide 1 Central Tendency and Variability Sample Homework Problem Solving the Problem with SPSS Logic for Central.
The introduction to SPSS Ⅱ.Tables and Graphs for one variable ---Descriptive Statistics & Graphs.
110/10/2015Slide 1 The homework problems on comparing central tendency and variability extend our focus on central tendency and variability to a comparison.
How to find measures variability using SPSS
Describing distributions with numbers
As shown in Table 1, the groups differed in terms of language skills and the type of job last held. The intake form asked the client to indicate languages.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
Analyses using SPSS version 19
Perform Descriptive Statistics Section 6. Descriptive Statistics Descriptive statistics describe the status of variables. How you describe the status.
June 21, Objectives  Enable the Data Analysis Add-In  Quickly calculate descriptive statistics using the Data Analysis Add-In  Create a histogram.
SW318 Social Work Statistics Slide 1 Measure of Variability: Range (1) This question asks about the range, or minimum and maximum values of the variable.
15-1 Chapter Fifteen DATA PREPARATION AND DESCRIPTION.
Semester 2: Lecture 3 Quantitative Data Analysis: Univariate Analysis II Prepared by: Dr. Lloyd Waller ©
Mr. Magdi Morsi Statistician Department of Research and Studies, MOH
SW388R6 Data Analysis and Computers I Slide 1 Comparing Central Tendency and Variability across Groups Impact of Missing Data on Group Comparisons Sample.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
1/23/2016Slide 1 We have seen that skewness affects the way we describe the central tendency and variability of a quantitative variable: if a distribution.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Selecting Cases PowerPoint Prepared by Alfred.
SW388R7 Data Analysis & Computers II Slide 1 Solving Homework Problems in SPSS The data sets Options for variable lists in statistical procedures Options.
Data Processing, Fundamental Data Analysis, and the Statistical Testing of Differences Chapter Twelve.
Probability and Statistics 12/11/2015. Statistics Review/ Excel: Objectives Be able to find the mean, median, mode and standard deviation for a set of.
Chapter Fourteen Copyright © 2004 John Wiley & Sons, Inc. Data Processing and Fundamental Data Analysis.
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
EMPA Statistical Analysis
Descriptive Statistics
BS1037 Research Methods and Statistics in the Public Sector
REDCap Data Migration from CSV file
More Weather Stats.
Presentation transcript:

Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12

Objectives To establish methods of uncovering coding errors To discuss techniques for implementing logical tests To present methods of selecting cases To reinforce the SPSS skills presented to date

Boolean operators: AND The AND operator is a logical operator in Boolean algebra Imagine two statements: X and Y For the operation (X AND Y) to be true X has to be true and Y has to be true The rules for Boolean operators are commonly displayed in Truth Tables

Truth table: AND

Boolean operators: OR The OR operator is a logical operator in Boolean algebra Imagine two statements: X and Y For the operation (X OR Y) to be true either X is true or Y is true or both X and Y are true

Truth table: OR

Data cleaning Check the data for errors Clean the data before any data analysis

Types of error There are two broad areas of error: –Coding errors –Logical errors

Coding error Data entry errors Out-of-range values

Detecting out-of-range values For categorical variables, having declared valid values, frequency counts will highlight any peculiar entries For continuous variables, descriptive statistics, in particular the range and a histogram, will highlight any peculiar values

Examples Age: generate descriptive statistics Treatment type: generate a frequency distribution

StatisticStd. Error AgeMean % Confidence Interval for Mean Lower Bound31.16 Upper Bound % Trimmed Mean31.31 Median31.00 Variance Std. Deviation Minimum1 Maximum77 Range76 Interquartile Range20.00 Skewness Kurtosis Descriptives

FrequencyPercentValid PercentCumulative Percent ValidInpatient Outpatient Total MissingSystem8.5 Total Treatment type

Resolving errors The questionnaires should be checked If possible, return to the interviewer or interviewee If still unresolved, consider setting the value as missing Note the importance of ID numbers for linking the computer to the questionnaire

Selecting cases The ability to select a set of cases according to a criterion is essential in data cleaning Generating statistics for subsets of the data is also a useful analytical tool

Example: Age Descriptive statistics of Age indicate that there is a case with a value of 1 and a case with the value 77 It is advisable to check the extreme values NMinimumMaximumMeanStd. Deviation Age Valid N (listwise)1563 Descriptive Statistics

Example: Age It would be reasonable to check for values 10 and under and 70 and over The task is to select those cases and display the results Data/Select Cases generates the following dialogue box

Choose these options to define selection criteria.

Data/Select Cases SPSS creates a new variable in the data set called filter_$ which = 1 when AGE = 70 All subsequent analysis will be on the reduced data set until Data/Select Cases/All Cases is chosen The filtered cases are identified by a slash through the case number

FrequencyPercentValid PercentCumulative Percent Valid Total Age

Generating a report Analyse/Reports/Case Summaries Select the variables to be included in the summary

Case number IDAgeRaceEducationEmploymentMarital statusTreatment type 1 st most frequently used drug 116 8WhiteSecondaryWorking full- time Married liv w. spouse InpatientALCOHOL WhiteTertiaryPensionerWidowedInpatientALCOHOL WhiteSecondaryPensionerMarried liv w. spouse InpatientALCOHOL WhiteTertiaryPensionerMarried liv w. spouse InpatientALCOHOL White.Student/pupilNever marriedInpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientDAGGA AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE AfricanPrimaryStudent/pupilNever marriedOutpatientWHITE PIPE TotalN Case summaries a a. Limited to first 100 cases.

Note: All Cases Don’t forget that, once certain cases have been selected, all subsequent analysis is on the selected cases only Once you have finished working with the subset, restore the file to All Cases before doing any further analysis –Data/Select Cases… –Select the All Cases radio button –OK

Locating a case From the Data Editor: –Data/Go To Case OR –Select a variable, then Edit/Find

Logical errors Detecting logical errors involves comparing answers to ensure that they are consistent The type of logical checks appropriate to identify particular errors will depend on the questions in the questionnaire

Detecting logical errors Cross-tabulations between categorical variables can be used to highlight errors Check criteria using conditional statements and the Compute facility Some software, such as SPSS Databuilder, allows tests for logical and coding errors to be built into a data entry form

Example: Cross-tabulation Cross-tabulations provide a simple method of investigating the joint distribution of two variables The following slide is a cross-tabulation of Drug1 against Mode1 to check that appropriate modes of ingestion have been reported

Most Frequently Used Drug (Cross-tabulation) Mode of ingestion Drug1 SwallowSmokeSnortInjectTotal DAGGA HEROIN CODEINE55 COCAINE24446 CRACK97198 AMPHETAMINE4127 ECSTASY24125 SEDATIVES & TRANQUILLIZERS 33 BENZODIAZEPINES16 MANDRAX12 VALIUM22 LSD55 SOLVENTS & INHALANTS2136 WHITE PIPE309 ALCOHOL717 ROHYPNOL33 MISC. PRESCRIPTION DRUGS9110 MISC. DRUGS11 Total Most frequently used drug

Example: conditional statements Main.sav contains information on the three most frequently used drugs: Drug1, Drug2 and Drug3 In a single case, no drug should appear in more than one of the three variables To check this, generate a test variable on the basis of a conditional statement; the test variable should take the value 0 if all three drug variables are different and the value 1 if there is any duplication

Compute: Test = 0 Transform/Compute Enter the name of the new variable: TEST Click the Type and Label button and declare the variable as numeric with the label: TEST VARIABLE FOR DRUG DUPLICATION Set TEST = 0

Compute: TEST = 1 If any of the drug options are the same, TEST should equal 1 EXCEPT when Drug2 = Drug3 = 77 (not applicable) The condition is if –Drug1 = Drug2 OR –Drug1 = Drug3 OR –(Drug2 = Drug3 AND Drug2  77) – THEN Test = 1

Click If… button to define the conditional statement.

1st most frequently used drug 2nd most frequently used drug 3rd most frequently used drug ID 1BENZODIAZEPINESMISC. PRESCRIPTION DRUGS 734 2CRACK ECSTASY807 3CRACKWHITE PIPECRACK835 4HEROINSEDATIVES & TRANQUILLIZERS SEDATIVES & TRANQUILLIZERS MISC. PRESCRIPTION DRUGS SEDATIVES & TRANQUILLIZERS MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS Not Applicable1245 8MISC. PRESCRIPTION DRUGS ALCOHOL1250 TotalN8888 Case summaries a a. Limited to first 100 cases.

Exercise Check for consistency between the drug reported and the method of ingestion for the second and third drugs of use What additional logical tests could be completed on the data in main.sav?

Summary Data entry errors Out-of-range errors Logical errors Conditional statements Selecting cases Reports