1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2.

Slides:



Advertisements
Similar presentations
Organisation Of Data (1) Database Theory
Advertisements

Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Chapter 3: Editing and Debugging SAS Programs. Some useful tips of using Program Editor Add line number: In the Command Box, type num, enter. Save SAS.
P20 Seminar November 12, Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.
Chapter 07: Lecture Notes (CSIT 104) 1111 Exploring Microsoft Office Excel 2007 Chapter 7 Data Consolidation, Links, and Formula Auditing.
SIS – NBS Online Specimen Tracking System Training
How to enter data in SPSS
1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Good Data Management Practices Patty Glynn 10/31/05
Pet Fish and High Cholesterol in the WHI OS: An Analysis Example Joe Larson 5 / 6 / 09.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Data Management & Basic Analysis Interpretation of Diagnostic test.
Managing Your Own Data (…if you have to) Kathryn A. Carson, Sc.M. Senior Research Associate Department of Epidemiology Johns Hopkins Bloomberg School of.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Biostatistics Analysis Center Center for Clinical Epidemiology and Biostatistics University of Pennsylvania School of Medicine Minimum Documentation Requirements.
ACRIN 6698 Diffusion-weighted MRI Biomarkers for Assessment of Breast Cancer Response to Neoadjuvant Treatment: An I-SPY 2 Trial Substudy Presented by:
© 2011 Octagon Research Solutions, Inc. All Rights Reserved. The contents of this document are confidential and proprietary to Octagon Research Solutions,
Data Processing, Fundamental Data
Data Quality Data Cleaning Beverly Musick, M.S. May 20, This module was recorded at the health informatics –training course— data management series.
Microsoft Access 2000 Creating Tables and Relationships.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
OCAN College Access Program Data Submissions Vonetta Woods HEI Analyst, Ohio Board of Regents
Running a Report.  List Bibliography Report  Found under: All Titles Purpose : Creates customized bibliographies by catalog, call number, or item characteristics.
1 Performing Spreadsheet What-If Analysis Applications of Spreadsheets.
PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008.
Data Specifications Didactics on development of a concept sheet EA IeDEA Meeting May 16-17, 2011 Beverly Musick.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
EPIB 698C Lecture 2 Notes Instructor: Raul Cruz 2/14/11 1.
System Analysis and Design
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
System Development Lifecycle Verification and Validation.
(Spring 2015) Instructor: Craig Duckett Lecture 10: Tuesday, May 12, 2015 Mere Mortals Chap. 7 Summary, Team Work Time 1.
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
ITGS Databases.
Analyses using SPSS version 19
BMTRY 789 Lecture 11: Debugging Readings – Chapter 10 (3 rd Ed) from “The Little SAS Book” Lab Problems – None Homework Due – None Final Project Presentations.
Copyright © 2008 Pearson Prentice Hall. All rights reserved Copyright © 2008 Prentice-Hall. All rights reserved. Committed to Shaping the Next.
Remote Data Entry Updates Lori Wangsness Kim Gallimore Lori Wangsness Kim Gallimore.
Creating a Database Angelo Lafratta- Website: Search: Keith Valley Physical.
Chapter 1: Overview of SAS System Basic Concepts of SAS System.
Lesson 13 Databases Unit 2—Using the Computer. Computer Concepts BASICS - 22 Objectives Define the purpose and function of database software. Identify.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
FORMAT statements can be used to change the look of your output –if FORMAT is in the DATA step, then the formats are permanent and stored with the dataset.
Chapter 2 Getting Data into SAS Directly enter data into SAS data sets –use the ViewTable window. You can define columns (variables) with the Column Attributes.
Data Management Seminar, 8-11th July 2008, Hamburg WinDEM- Verification Checks Part I.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Use the SET statement to: –create an exact copy of a SAS dataset –modify an existing SAS dataset by creating new variables, subsetting (using a subsetting.
HEI/OCAN College Access Program Data Submissions.
SAS Programming Training Instructor:Greg Grandits TA: Textbooks:The Little SAS Book, 5th Edition Applied Statistics and the SAS Programming Language, 5.
Based on Learning SAS by Example: A Programmer’s Guide Chapters 1 & 2
TOPSpro Special Topics I: Database Managemen t. Agenda for Module I: Database Management  TOPSpro Backup/Restore Wizard  TOPS-TOPS Import/Export Wizard.
Research Documentation Betty Wilson, CIP, MS Senior Compliance Manager MU IRB Lori Wilcox, EdD Director of Academic Compliance, Corporate Compliance.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 14 & 19 By Tasha Chapman, Oregon Health Authority.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
Programming Standards and Practices
Just the basics: Learning about the essential steps to do some simple things in SPSS Larkin Lamarche.
REDCap Data Migration from CSV file
Dale Rhoda & Mary Kay Trimner Stata Conference 2018
Claire Osgood November 2017
Queries Training Module.
Presentation transcript:

1

Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2

Research Data Request A concept proposal is a detailed plan for a research project and typically includes: The aims of the study Associated hypotheses Statistical analysis plan Description of the cohort Specific data/variables needed (example Hunt Proposal) 3

4

Steps to Fulfill Research Data Request 1.Identify and resolve questions regarding requirements (may be repeated) 2.Determine data source and variables needed 3.Define any derived variables 4.Define cohort and subset of visits/observations to be included 5.Data cleaning 6.Dataset preparation 7.Documentation 5

Raw Data Cleaning For data that are stored in Access, Excel, or text files, data cleaning should begin with the original table, spreadsheet or file. Back-up the original data files. Eliminate blank records and any records used for testing. Locate duplicate records and resolve. For numeric variables, identify outliers by sorting and reviewing the overall minimum and maximum. This is particularly useful for continuous variables such as dates, ages, weights etc. For categorical variables such as gender or marital status, sorting will reveal invalid response codes or use of mixed case (f, F, m, M for gender). Review the frequency of missing data when records are sorted. Does it make sense that x records have no value for variable y? 6

Raw Data to SAS Datasets Create a SAS program that converts the database file(s) to permanent SAS dataset(s). For Access or Excel files can use ‘Proc Import’ PROC IMPORT OUT= WORK.demog DATATABLE= "tblDEMOG" DBMS=ACCESS REPLACE; DATABASE="I:\Projects\Kenya\CFAR\cfar.mdb"; dbpwd=‘password' ; RUN; For text files can write specific input statement data copd ; infile 'c:\kenya\hiv\copd.txt' ; patientid $9. ; run ; 7

SAS Dataset Creation Merge or append (concatenate) tables as necessary. Double-check the merging process by looking at the number of observations in each dataset before and after the merge. The number of records is dependent on the overlap among the datasets. This relationship should be known in advance and the expected outcome confirmed. 8

Merge Example data vis1 ; set h.visitDemo(keep=patient_id apptdate age weight height bmi cd4) ; if patient_id in (1,2,3,4,5) ; run ; NOTE: There were observations read from the data set H.visitDemo. NOTE: The data set WORK.VISIT has 71 observations and 7 variables. data vis2 ; set h.visitDemo(keep=patient_id apptdate clinic hgb sao2) ; if patient_id in (4,5,6) ; run ; NOTE: There were observations read from the data set H.visitDemo. NOTE: The data set WORK.VIS2 has 46 observations and 5 variables. data bothvis ; merge visit vis2 ; by patient_id apptdate ; run ; NOTE: There were 71 observations read from the data set WORK.VIS1. NOTE: There were 46 observations read from the data set WORK.VIS2. NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables. 9

SAS Dataset Creation (cont.) Confirm that the total number of variables in the merged dataset is correct. The number should be the sum of all variables minus the (number of key fields*(number of datasets in merge minus 1)). In the previous example: – 2*(2-1) = 10 If the number of variables is less than this, then you know that you have the same variable(s) in one or more of the datasets. This should be strictly avoided. 10

Merge Example data vis1 ; set h.visitDemo(keep=patient_id apptdate age weight height bmi cd4) ; if patient_id in (1,2,3,4,5) ; run ; NOTE: There were observations read from the data set H.visitDemo. NOTE: The data set WORK.VISIT has 71 observations and 7 variables. data vis2 ; set h.visitDemo(keep=patient_id apptdate clinic hgb sao2) ; if patient_id in (4,5,6) ; run ; NOTE: There were observations read from the data set H.visitDemo. NOTE: The data set WORK.VIS2 has 46 observations and 5 variables. data bothvis ; merge visit vis2 ; by patient_id apptdate ; run ; NOTE: There were 71 observations read from the data set WORK.VIS1. NOTE: There were 46 observations read from the data set WORK.VIS2. NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables. 11

SAS Dataset Creation (cont.) Always review the SAS log ERROR, WARNING, and NOTE messages. The following messages are often overlooked but do require action: "NOTE: MERGE statement has more than one data set with repeats of BY values.“ This indicates that one or more of the datasets that you are trying to merge contains multiple observations that are not uniquely distinguishable based on the variables listed in the by statement. Merged dataset will contain spurious and unexpected results. Further processing should not continue until this note has been resolved. “NOTE: Variable _____ is uninitialized” “NOTE: Variable _____ has never been referenced” These indicate that the variable has not been properly defined. Many times a variable name has just been misspelled. “NOTE: Character values have been converted to numeric values…” This indicates that SAS has automatically converted a character variable to numeric. Because unexpected results can occur, it’s best to do the conversion manually with the input or put function. “WARNING: Multiple lengths were specified for the BY variable _____ by input data sets. This may cause unexpected results.” This indicates that the by variable is not consistent across all data sets. 12

SAS Dataset Creation (cont.) To create permanent datasets for analysis: Recode missing values used in the raw data tables/files to appropriate SAS missing values. For example, if 9's were used to indicate missing data for numeric fields in a data table then these should be converted to.'s. Calculate appropriate summary scores (ex. AUDIT-3, BMI) Calculate differences between dates such as time from enrollment to ART initiation. Label all calculated and created variables. Attach formats to the variable values where necessary. 13

Cleaning Data in SAS Create a cleanup program. Generate frequencies, means, and univariates to better understand the dataset and to check for invalid data. Plot the data. For the numeric and date fields look at minimums and maximums to verify all values are within expected range. Locate duplicate records and resolve. Compare fields when appropriate (i.e. dob and age, confirm date of initial visit < date of follow-up). 14

Cleaning Data in SAS (cont.) Identify important fields such as summary scores and verify their values. Merge all longitudinal datasets to identify date inconsistencies, variable format inconsistencies, and to locate missing questionnaires. Merge cross-sectional (demographics) dataset with longitudinal datasets to identify subjects in one but not the other. 15

SAS Program Files Save all logs and outputs from SAS programs especially when creating analysis datasets for publication Naming conventions – studyx.sas, studyx.log, studyx.lst Only the program that generates the permanent dataset should overwrite it. Never overwrite a permanent dataset (even with a proc sort) from any other program. 16

Documentation Internally document SAS programs. At minimum include file name, location, purpose, author, date, and revisions. May be helpful to include the names of any permanent SAS datasets created within the program All SAS printouts should have at least one title, which includes the project name (ex. title ‘Treatment Interruptions Analysis Dataset’ ;) It’s helpful to use the footnote option to display the path and file name of the SAS program on the listing ( ex. options footnote ‘R:\AMPATH\Research\Braitstein\TxInt\TxInterrupt.sas’ ; ) 17

Documentation (cont.) If any variable values have been formatted, include a copy of the “proc format” section in the documentation. Generate form keys. Provide a description of any variables included in the datasets that are not found on the form keys. 18

Summary Score Documentation Detailed algorithms of how summary scores are calculated should include the following: a. which variables are used to calculate which summary scores b. which variables (if any) are recoded and how c. what is the minimum number of non-missing items needed to calculate the score d. how are missing values addressed. Typically when calculating a total or sum score the mean should be imputed for missing data. If the summary score is a mean itself then the missing data can be ignored. In both of these cases it is essential that c. above is followed and that summary scores are coded as missing if there is insufficient data to calculate. e. what is the meaning of the score and how is it scaled. Indicate the possible range and how a high score differs from a low score. For example include something like “Higher score indicates more depression”. 19

Dataset Cover Sheet Notes on Analysis Datasets Project Name: Principal Investigator: Date of Original Data Request {please attach}: Datasets Created: Name and Location of (SAS) Program used to Generate Datasets: Creation Date: Created By: Biostatisticians: Cohort: Derived Variables (name of variable, coding, precise description): SAS Formats: proc format Preliminary Statistics: 20

Practicum Create a Data Cover Sheet for dataset created during Programming Standards Practicum (male patients with at least 2 CD4’s) 21

SAS General Notes If the study is longitudinal, at least two datasets are needed: one containing the demographics and other information which does not change over time; and one containing the data for multiple time points. Never put cross-sectional variables such as gender in the longitudinal dataset. Format all date fields with 4-digit year (ddmmyy10. or date9.) Choose data type numeric whenever possible. 22

Distributing SAS Datasets If possible, have another data manger review the datasets and documentation before distributing The following should be included: – The form keys – All appropriate SAS datasets (should have the extension.sas7bdat) – Dataset Cover Sheet – Latest Data Request Form – Any other documents or notes which would further explain the data. 23

Distributing SAS Datasets (cont.) In most cases the following should not be distributed: Any Protected Health Information (PHI) such as study subject’s name, address, phone numbers, social security number, hospital id number. Date of birth should only be included if absolutely necessary. But usually age can be calculated and given instead. SAS generation programs. These often contain PHI. 24

File Maintenance & Archiving For your own records at minimum, you should have: A copy of everything you give to the biostatistician or investigator and the date given. A copy of the log of all the SAS programs especially those that create any permanent SAS datasets which were passed along to others Grant protocols, meeting notes, scoring algorithms, instructions for data entry, corrections made, etc. It may be helpful to maintain a subdirectory that exactly mirrors the subdirectory of the pc where the data is actually being entered. This subdirectory would include all the RDMS programs, format files, and tables. For longitudinal studies in particular, it is important to archive datasets and SAS programs/logs, which were used for analysis for abstracts, papers, grant proposals, and other publications. 25

Data Managers Working with Investigators and Biostatisticians Attend study meetings Date all documents and meeting notes Comment on proposed study changes Understand the statistical analysis plan Review statistical reports (preferably before presented to research team) Review and critique abstracts/manuscripts Your contribution is EXTREMELY important! 26

27