Dale Rhoda & Mary Kay Trimner Stata Conference 2018

Slides:



Advertisements
Similar presentations
Stata as a Data Entry Management Tool
Advertisements

Maintaining data quality: fundamental steps
Chapter 07: Lecture Notes (CSIT 104) 1111 Exploring Microsoft Office Excel 2007 Chapter 7 Data Consolidation, Links, and Formula Auditing.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
©2004, 2006, 2008 UIW Department of Instructional Technology Meat and Potatoes SPSS Presented by Terence Peak.
Adding Automated Functionality to Office Applications.
Database Software Application
Microsoft Office Word 2013 Expert Microsoft Office Word 2013 Expert Courseware # 3251 Lesson 4: Working with Forms.
Electronic EDI e-EDI. The EDI has been in use since 1999 using a paper-based system and computerized spreadsheets to collect and manage EDI data. Over.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 15.
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Working with a Database
Gadgets & More…. “Date Range” Gadgets Allows you to choose a specific date, before or after a date or a range of dates using the Workflows calendar.
PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008.
King Fahd University of Petroleum & Minerals Department of Management and Marketing MKT 345 Marketing Research Dr. Alhassan G. Abdul-Muhmin Editing and.
Chapter 17 Creating a Database.
Session Session 15 FAFSA on the Web - Onward and Upward!
Microsoft® Excel Key and format dates and times. 1 Use Date & Time functions. 2 Use date and time arithmetic. 3 Use the IF function. 4 Create.
Copyright © 2008 Pearson Prentice Hall. All rights reserved Copyright © 2008 Prentice-Hall. All rights reserved. Committed to Shaping the Next.
Writing ODK Surveys in XLSFORM
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
An electronic document that stores various types of data.
Day in the Life (DITL) Production Operations with Energy Builder Copyright © 2015 EDataViz LLC.
Invoices and Service Invoices Training Presentation for Raytheon Supply Chain Platform (RSCP) April 2016.
Mannheim Research Institute for the Economics of Aging SHARE Data Cleaning General rules and procedures Stephanie Stuck MEA Antwerp.
University of Colorado at Denver and Health Sciences Center Department of Preventive Medicine and Biometrics Contact:
SJSU College of Business Business Productivity Tools Fall 2016 Summary of Lessons and Learning Objectives.
Understanding Microsoft Excel
Understanding Microsoft Excel
Excel Tutorial 8 Developing an Excel Application
Session 15 Merging Data in SPSS
Microsoft Office Access 2010 Lab 1
Session 5 – Questionnaire Checklists
Understanding Microsoft Excel
Exploring Primary Sources
AP CSP: Cleaning Data & Creating Summary Tables
Integrity Developer Team Promotion Process
Systems Planning and Analysis
Data Collection Interview
Practical Office 2007 Chapter 10
Microsoft Excel Basics
Configuration Management and Prince2
7/19/2018 Data, and Metrics, and Reports! Oh, my!: Just Follow the Yellow Brick Road Presented at the CSM Symposium 2016 Andrew Brubaker strategic_data_gathering_and_dashboarding-3.pptx.
Use of CAPI for agricultural surveys
Introduction to WRDS data platform
Mobile Data collection
Template library tool and Kestrel training
Microsoft Access 2003 Illustrated Complete
Page Layout Header & Footer Font Styles Image wrapping List Styles
The webinar will start at 12:10pm.
REDCap Data Migration from CSV file
Collaboration with Google Docs
MS-Office It is a Software Package It contains some programs like
ECONOMETRICS ii – spring 2018
Understanding Microsoft Excel
ETS Submission Process for New Project Applications
2018 NM Community Survey Data Entry Training
Warm up – Unit 4 Test – Financial Analysis
Secondary Data Analysis Lec 10
Spreadsheets, Modelling & Databases
Understanding Microsoft Excel
PA430 - Data coding March 7/8, 2000.
Learning Intention I will learn about selection with multiple conditions.
By A.Arul Xavier Department of mathematics
WSP-ATR Submission Process 2019.
Indicator 3.05 Interpret marketing information to test hypotheses and/or to resolve issues.
WSP-ATR Submission Process 2019.
Presentation transcript:

Dale Rhoda & Mary Kay Trimner Stata Conference 2018 New data cleaning command: assertlist improves speed & accuracy of collaborative correction Dale Rhoda & Mary Kay Trimner Stata Conference 2018

Context: Data Cleaning Primary data collection, entry & cleaning (e.g., household survey) There are artifacts (paper, photos, phone numbers) that might be used to confirm correct values Data sometimes stored on paper forms Or, with touchscreen, there is sometimes a photo of important docs (e.g., child’s vaccination record) Sometimes collect respondent’s mobile phone number to coordinate interview time & place or ask follow-up questions

Where Do We Do It?

Distributed Data Cleaning Team Constructing the Analysis Dataset Check values for correctness, completeness, consistency Flag disallowed or inconsistent values for review Amend the dataset to include corrections; document as appropriate (Sometimes) Change uncorrectable values to special missing Team with Access to Survey Forms Check flagged values (via paper form, photo, or phone call) Identify corrected data values where possible

Example: Clean Dataset Note: hhid = household ID; respid = respondent ID; iid = interviewer ID; idate = interview date; vxd = vaccinated

List of Expectations No variable should be missing for any respondent No duplicate combos of ID variables: stratum cluster hhid respid The variable vxd holds only values of “Yes” and “No” Hint: If there appear to be problems in the ID variables, make a variable named line that is never missing and always unique and use it for corrections to ID variables

Example: Dataset with Problems

One Approach

Another Approach

Conceptually, This Is What We Want

Stata Command Assertlist Syntax: assertlist boolean_assertion [if] [in], options boolean_assertion is syntax that resolves to either true (1) or false (0) Can involve the ‘and’ (&) or the ‘or’ (|) operators sex must be Male or Female: sex == “M” | sex == “F” response must be 1 or 2 or 99: response == 1 | response == 2 | response == 99

Assertlist Options list(varlist) variables to list if expression is false tag(string) description to list along with output excel(filename) name of excel file for output note: excel option requires sheet option sheet(sheetname) name of worksheet for output fix provide option to correct some values note: fix requires idlist & checklist idlist(varlist) variables that uniquely identify rows checklist(varlist) variables to check & correct

assertlist approach

Output: Assertlist_Summary Tab Includes one row for every check Lists the boolean_assertion & tag Lists # of observations checked, # passed and # failed

Output: Checks Sheet The sheet named ‘checks_fix’ provides a space for making corrections Column N here contains an Excel formula that writes Stata code as soon as you type a correction in Column M

Output: checks_fix sheet You can paste the Stata syntax into a downstream .do file (and add some comments)

Data Corrections in Stata Program

Notes We’re moving quickly thru this talk, but use a 2-stage approach Clean up the ID variables first, so they will serve as a unique key: a) to the dataset, and b) to the forms, photos or phone numbers Then clean up everything else, knowing you have a clean idlist The boolean_assertion can be as complex as needs be Both idlist and checklist can include numerous variables (Example of correction columns when checklist holds two variables)

Workflow for Collaborative Correction Establish list of expectations as soon as questionnaire is finalized Valid values Skip patterns No duplicates Responses internally consistent (i.e., date of birth ≤ vaccination dates ≤ interview date) Convert to assertlist syntax Run the job & send the spreadsheet to data managers Receive back the spreadsheet populated with corrections Copy replace syntax into .do file [Repeat Steps 3-5 as needed] Decide what to do with uncorrected violations of expectations

Companion Program to Clean Up Title Rows When you are finished sending output to a particular Excel file, run assertlist_cleanup to switch to nicer column titles (assertlist uses some nerdy variable names as column titles because it re-reads & writes its output if you send results from 2+ assertions to the same Excel worksheet)

Can Also Flag Problems in Near-Real-Time When data are collected via touchscreen, they are often uploaded to a server on a daily basis Assertlist can be wrapped inside a loop that makes a daily list of concerns for each survey team E-mail the appropriate output to each supervisor to: Correct errors right away Coach (or replace) error-prone workers

Summary Assertlist flags data elements that are illegal or illogical Useful to document prevalence of problems Useful for replacing those elements with corrected values Many benefits: Facilitates clear communication Helps document corrections Saves time Prevents typing errors Caution: Requires some care with ID variables

Where Do I Find It? Today: Link to our GitHub page here: www.biostatglobal.com Soon: Find it on SSC

Questions. Dale. Rhoda@biostatglobal. com MaryKay Questions? Dale.Rhoda@biostatglobal.com MaryKay.Trimner@biostatglobal.com Link to GitHub: www.biostatglobal.com