PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008.

Slides:



Advertisements
Similar presentations
Organisation Of Data (1) Database Theory
Advertisements

The SAS ® System Additional Information on Statistical Analysis Programming.
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Slide C.1 SAS MathematicalMarketing Appendix C: SAS Software Uses of SAS  CRM  datamining  data warehousing  linear programming  forecasting  econometrics.
Chapter 3: Editing and Debugging SAS Programs. Some useful tips of using Program Editor Add line number: In the Command Box, type num, enter. Save SAS.
Review Questions Business 205
P20 Seminar November 12, Statistical Collaboration Part 1: Working with Statisticians from Start to Finish Part 2: Essentials of Data Management.
How to enter data in SPSS
Introduction to SPSS Allen Risley Academic Technology Services, CSUSM
Descriptive Statistics In SAS Exploring Your Data.
1. Preparing Research Datasets Data Request Data Cleaning Dataset Preparation Documentation Beverly Musick 2.
Good Data Management Practices Patty Glynn 10/31/05
Pet Fish and High Cholesterol in the WHI OS: An Analysis Example Joe Larson 5 / 6 / 09.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS Essentials - Elliott & Woodward1.
Managing Your Own Data (…if you have to) Kathryn A. Carson, Sc.M. Senior Research Associate Department of Epidemiology Johns Hopkins Bloomberg School of.
© 2008 The McGraw-Hill Companies, Inc. All rights reserved. M I C R O S O F T ® Preparing for Electronic Distribution Lesson 14.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Biostatistics Analysis Center Center for Clinical Epidemiology and Biostatistics University of Pennsylvania School of Medicine Minimum Documentation Requirements.
Data Quality Data Cleaning Beverly Musick, M.S. May 20, This module was recorded at the health informatics –training course— data management series.
Microsoft Access 2000 Creating Tables and Relationships.
1 Chapter 5: Creating Summarized Output 5.1 Generating Summary Statistics 5.2 Creating a Summary Report with the Summary Tables Task 5.3 Creating and Applying.
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
DE&T (QuickVic) Reporting Software Overview Term
OCAN College Access Program Data Submissions Vonetta Woods HEI Analyst, Ohio Board of Regents
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
Running a Report.  List Bibliography Report  Found under: All Titles Purpose : Creates customized bibliographies by catalog, call number, or item characteristics.
1 Performing Spreadsheet What-If Analysis Applications of Spreadsheets.
Using AMRS Data in Research September 15, 2008 Beverly Musick Indiana University, Division of Biostatistics.
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
EPIB 698C Lecture 2 Notes Instructor: Raul Cruz 2/14/11 1.
System Analysis and Design
System Development Lifecycle Verification and Validation.
1 Lab 2 and Merging Data (with SQL) HRP223 – 2009 October 19, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning:
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
ITGS Databases.
Analyses using SPSS version 19
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Chapter 22: Using Best Practices 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
BMTRY 789 Lecture 11: Debugging Readings – Chapter 10 (3 rd Ed) from “The Little SAS Book” Lab Problems – None Homework Due – None Final Project Presentations.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
1 PRINCIPAL INVESTIGATOR USE OF THE ST ScI ELECTRONIC GRANTS MANAGEMENT SYSTEM January, 2001.
Creating a Database Angelo Lafratta- Website: Search: Keith Valley Physical.
Chapter 1: Overview of SAS System Basic Concepts of SAS System.
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
FORMAT statements can be used to change the look of your output –if FORMAT is in the DATA step, then the formats are permanent and stored with the dataset.
Chapter 2 Getting Data into SAS Directly enter data into SAS data sets –use the ViewTable window. You can define columns (variables) with the Column Attributes.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
Chapter 10: Working with Large Data Spreadsheet-Based Decision Support Systems Prof. Name Position (123) University Name.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Use the SET statement to: –create an exact copy of a SAS dataset –modify an existing SAS dataset by creating new variables, subsetting (using a subsetting.
HEI/OCAN College Access Program Data Submissions.
1 EPIB 698C Lecture 1 Instructor: Raul Cruz-Cano
SAS Programming Training Instructor:Greg Grandits TA: Textbooks:The Little SAS Book, 5th Edition Applied Statistics and the SAS Programming Language, 5.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
Based on Learning SAS by Example: A Programmer’s Guide Chapters 1 & 2
TOPSpro Special Topics I: Database Managemen t. Agenda for Module I: Database Management  TOPSpro Backup/Restore Wizard  TOPS-TOPS Import/Export Wizard.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 14 & 19 By Tasha Chapman, Oregon Health Authority.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
Data quality & VALIDATION
eSchoolPLUS District Data Coordinator May Webex
Just the basics: Learning about the essential steps to do some simple things in SPSS Larkin Lamarche.
Queries Training Module.
IRB protocol no PI: Dr. David F. Chhieng
Presentation transcript:

PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008 Beverly Musick Indiana University

Raw Data Cleaning For data that are stored in Access, Excel, or text files data cleaning should begin with the original table, spreadsheet or file. Back-up the original data files. Back-up the original data files. Eliminate blank records and any records used for testing. Eliminate blank records and any records used for testing. Locate duplicate records and resolve. Locate duplicate records and resolve. For numeric variables, identify outliers by sorting and reviewing the overall minimum and maximum. This is particularly useful for continuous variables such as dates, ages, weights etc. For numeric variables, identify outliers by sorting and reviewing the overall minimum and maximum. This is particularly useful for continuous variables such as dates, ages, weights etc. For categorical variables such as gender or travel time to clinic, sorting will reveal invalid response codes or use of mixed case (f, F, m, M for gender). For categorical variables such as gender or travel time to clinic, sorting will reveal invalid response codes or use of mixed case (f, F, m, M for gender). Can also assess the amount of missing data when records are sorted. Does it make sense that x records have no value for variable y? Can also assess the amount of missing data when records are sorted. Does it make sense that x records have no value for variable y?

Raw Data to SAS Datasets Create a SAS program that converts the database file(s) to permanent SAS dataset(s). For Access or Excel files can use ‘Proc Import’ For Access or Excel files can use ‘Proc Import’ PROC IMPORT OUT= WORK.demog DATATABLE= "tblDEMOG" DATATABLE= "tblDEMOG" DBMS=ACCESS REPLACE; DATABASE="I:\Projects\Kenya\CFAR\cfar.mdb"; DATABASE="I:\Projects\Kenya\CFAR\cfar.mdb"; dbpwd=‘password' ; dbpwd=‘password' ;RUN; For text files can write specific input statement For text files can write specific input statement data copd ; infile 'c:\kenya\hiv\copd.txt' ; patientid $9. ; patientid $9. ; run ;

Raw Data to SAS Datasets (cont.) Merge or append (concatenate) tables as necessary. Merge or append (concatenate) tables as necessary. Double-check the merging process by looking at the number of observations in each dataset before and after the merge. Double-check the merging process by looking at the number of observations in each dataset before and after the merge. 831 data visit ; set h.hivvisit2(keep=patientid apptdate age weight height bmi cd4) ; 832 if patientid in ('1271BS-1',' ','3280CH-4','4709KT-6','625NT-5') ; 833 run ; NOTE: There were observations read from the data set H.HIVVISIT2. NOTE: The data set WORK.VISIT has 71 observations and 7 variables. 843 data vis2 ; set h.hivvisit2(keep=patientid apptdate clinic hgb sao2) ; 844 if patientid in ('13836MT-4','4709KT-6','625NT-5') ; 845 run ; NOTE: There were observations read from the data set H.HIVVISIT2. NOTE: The data set WORK.VIS2 has 46 observations and 5 variables. 846 data bothvis ; merge visit vis2 ; 847 by patientid apptdate ; 848 run ; NOTE: There were 71 observations read from the data set WORK.VISIT. NOTE: There were 46 observations read from the data set WORK.VIS2. NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables. The number of records is dependent on the overlap among the datasets. This relationship should be known in advance and the expected outcome confirmed. The number of records is dependent on the overlap among the datasets. This relationship should be known in advance and the expected outcome confirmed.

Confirm that the total number of variables in the merged dataset is correct. Confirm that the total number of variables in the merged dataset is correct. The number should be the sum of all variables minus the (number of key fields * (number of datasets in merge minus 1)). The number should be the sum of all variables minus the (number of key fields * (number of datasets in merge minus 1)). In the previous example: – 2*(2-1) = 10 If the number of variables is less than this, then you know that you have the same variable(s) in one or more of the datasets. This should be strictly avoided. If the number of variables is less than this, then you know that you have the same variable(s) in one or more of the datasets. This should be strictly avoided. Raw Data to SAS Datasets (cont.)

Investigate messages such as Investigate messages such as  "NOTE: MERGE statement has more than one data set with repeats of BY values."  “Variable _____ is uninitialized”  “Variable _____ has never been referenced”  “Character values have been converted to numeric…”  “Variable _____ has been defined as both character and numeric”  “Warning: Multiple lengths were specified for the BY variable _____ by input data sets. This may cause unexpected results.” Raw Data to SAS Datasets (cont.)

SAS Dataset Creation To create permanent datasets for analysis: Recode missing values used in the raw data tables/files to appropriate SAS missing values. For example, if 9's were used to indicate missing data for numeric fields in a data table then these should be converted to.'s. Recode missing values used in the raw data tables/files to appropriate SAS missing values. For example, if 9's were used to indicate missing data for numeric fields in a data table then these should be converted to.'s. Calculate appropriate summary scores (ex. AUDIT-3, BMI) Calculate appropriate summary scores (ex. AUDIT-3, BMI) Calculate differences between dates such as time from enrollment to ART initiation. Calculate differences between dates such as time from enrollment to ART initiation. Label all calculated and created variables. Label all calculated and created variables. Attach formats to the variable values where necessary. Attach formats to the variable values where necessary.

Cleaning Data in SAS Create a cleanup program. Generate frequencies, means, and univariates to better understand the dataset and to check for invalid data. Generate frequencies, means, and univariates to better understand the dataset and to check for invalid data. Plot the data. Plot the data. For the numeric and date fields look at minimums and maximums to verify all values are within expected range. For the numeric and date fields look at minimums and maximums to verify all values are within expected range. Locate duplicate records and resolve. Locate duplicate records and resolve. Compare fields when appropriate (i.e. dob and age, confirm date of initial visit < date of follow-up). Compare fields when appropriate (i.e. dob and age, confirm date of initial visit < date of follow-up).

Cleaning Data in SAS (cont.) Identify important fields such as summary scores and verify their values. Identify important fields such as summary scores and verify their values. Merge all longitudinal datasets to identify date inconsistencies, variable format inconsistencies, and to locate missing questionnaires. Merge all longitudinal datasets to identify date inconsistencies, variable format inconsistencies, and to locate missing questionnaires. Merge cross-sectional (demographics) dataset with longitudinal datasets to identify subjects in one but not the other. Merge cross-sectional (demographics) dataset with longitudinal datasets to identify subjects in one but not the other.

SAS Program Files Save all logs and outputs from SAS programs especially when creating analysis datasets for publication Save all logs and outputs from SAS programs especially when creating analysis datasets for publication Naming conventions – studyx.sas, studyx.log, studyx.lst Naming conventions – studyx.sas, studyx.log, studyx.lst Only the program that generates the permanent dataset should overwrite it. Only the program that generates the permanent dataset should overwrite it. Never overwrite a permanent dataset (even with a proc sort) from any other program. Never overwrite a permanent dataset (even with a proc sort) from any other program.

Documentation Internally document SAS programs. At minimum include file name, location, purpose, author, date, and revisions. Internally document SAS programs. At minimum include file name, location, purpose, author, date, and revisions. May be helpful to include the names of any permanent SAS datasets created within the program. May be helpful to include the names of any permanent SAS datasets created within the program. All SAS printouts should have at least one title, which includes the project name. (“title” statement) All SAS printouts should have at least one title, which includes the project name. (“title” statement) It’s helpful to use the footnote option to display the path and file name of the SAS program on the listing. [EX: options footnote ‘I:\alz\clin\cperm.sas’ ; ] It’s helpful to use the footnote option to display the path and file name of the SAS program on the listing. [EX: options footnote ‘I:\alz\clin\cperm.sas’ ; ]

Documentation (cont.) If any variable values have been formatted, include a copy of the “proc format” section in the documentation. If any variable values have been formatted, include a copy of the “proc format” section in the documentation. Generate form keys. Generate form keys. Provide a description of any variables included in the datasets that are not found on the form keys. Provide a description of any variables included in the datasets that are not found on the form keys.

Documentation (cont.) Detailed algorithms of how summary scores are calculated should include the following: Detailed algorithms of how summary scores are calculated should include the following: a. which variables are used to calculate which summary scores b. which variables (if any) are recoded and how c. what is the minimum number of non-missing items needed to calculate the score d. how are missing values addressed. Typically when calculating a total or sum score the mean should be imputed for missing data. If the summary score is a mean itself then the missing data can be ignored. In both of these cases it is essential that c. above is followed and that summary scores are coded as missing if there is insufficient data to calculate. e. what is the meaning of the score and how is it scaled. Indicate the possible range and how a high score differs from a low score. For example include something like “Higher score indicates more depression”.

SAS General Notes If the study is longitudinal, at least two datasets are needed: one containing the demographics and other information which does not change over time; and one containing the data for multiple time points. If the study is longitudinal, at least two datasets are needed: one containing the demographics and other information which does not change over time; and one containing the data for multiple time points. Never put cross-sectional variables such as gender in the longitudinal dataset. Never put cross-sectional variables such as gender in the longitudinal dataset. Format all date fields with 4-digit year (ddmmyy10. or date9.) Format all date fields with 4-digit year (ddmmyy10. or date9.) Choose data type numeric whenever possible. Choose data type numeric whenever possible.

Distributing SAS Datasets After a senior data manager has reviewed the datasets and documentation, the statistician should be given READ ONLY access to: The form keys The form keys All appropriate SAS datasets (should have the extension.sas7bdat) All appropriate SAS datasets (should have the extension.sas7bdat) A description of any variables included in the datasets that are not found on the form keys A description of any variables included in the datasets that are not found on the form keys Notes on calculation of the summary scores Notes on calculation of the summary scores Proc format statements Proc format statements Any other documents or notes which would further explain the data. Any other documents or notes which would further explain the data.

Distributing SAS Datasets (cont.) Statisticians should not be given nor have access to: Any Protected Health Information (PHI) such as study subject’s name, address, phone numbers, social security number, hospital id number. Date of birth should only be included if absolutely necessary. But usually age can be calculated and given instead. Any Protected Health Information (PHI) such as study subject’s name, address, phone numbers, social security number, hospital id number. Date of birth should only be included if absolutely necessary. But usually age can be calculated and given instead. Your SAS generation programs. These often contain PHI. If you must share SAS programs with the statisticians, please carefully review the programs and then copy to a separate folder to which they have read access rather than giving access to your main folder. Your SAS generation programs. These often contain PHI. If you must share SAS programs with the statisticians, please carefully review the programs and then copy to a separate folder to which they have read access rather than giving access to your main folder.

Distributing SAS Datasets (cont.) For your own records at minimum, you should have: A copy of everything you give to the statistician and the date given. A copy of everything you give to the statistician and the date given. A copy of the log of all the SAS programs especially those that create any permanent SAS datasets which were passed along to others A copy of the log of all the SAS programs especially those that create any permanent SAS datasets which were passed along to others Grant protocols, meeting notes, scoring algorithms, instructions for data entry, corrections made, etc. Grant protocols, meeting notes, scoring algorithms, instructions for data entry, corrections made, etc. It may be helpful to maintain a subdirectory that exactly mirrors the subdirectory of the pc where the data is actually being entered. This subdirectory would include all the RDMS programs, format files, and tables. It may be helpful to maintain a subdirectory that exactly mirrors the subdirectory of the pc where the data is actually being entered. This subdirectory would include all the RDMS programs, format files, and tables. For longitudinal studies in particular, it is important to archive datasets and SAS programs/logs, which were used for analysis for abstracts, papers, grant proposals, and other publications. For longitudinal studies in particular, it is important to archive datasets and SAS programs/logs, which were used for analysis for abstracts, papers, grant proposals, and other publications.

Organizing Project Folders Example of folder structure: Example of folder structure: –I:\projects\studyname – contains raw data, documentation, SAS programs, etc. –I:\projects\studyname\Datasets – stores datasets that have been approved for distribution. May also include the SAS formats in this folder. Statisticians should have READ ONLY access to this folder. –I:\projects\studyname\Keys – stores the form keys, the scoring algorithms and other data documentation. Statisticians should have READ ONLY access to this folder. –I:\projects\studyname\Grant – stores the original grant application, protocols, papers, etc. All data management staff and statisticians involved in this project should have full access to this folder.

DM Working with Biostatisticians Attend study meetings Attend study meetings Date all documents and meeting notes Date all documents and meeting notes Comment on proposed study changes Comment on proposed study changes Understand the statistical analysis plan Understand the statistical analysis plan Review statistical reports (preferably before presented to research team) Review statistical reports (preferably before presented to research team) Review and critique abstracts/manuscripts Review and critique abstracts/manuscripts