PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008.

PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008 Beverly Musick Indiana University

Raw Data Cleaning For data that are stored in Access, Excel, or text files data cleaning should begin with the original table, spreadsheet or file. Back-up the original data files. Back-up the original data files. Eliminate blank records and any records used for testing. Eliminate blank records and any records used for testing. Locate duplicate records and resolve. Locate duplicate records and resolve. For numeric variables, identify outliers by sorting and reviewing the overall minimum and maximum. This is particularly useful for continuous variables such as dates, ages, weights etc. For numeric variables, identify outliers by sorting and reviewing the overall minimum and maximum. This is particularly useful for continuous variables such as dates, ages, weights etc. For categorical variables such as gender or travel time to clinic, sorting will reveal invalid response codes or use of mixed case (f, F, m, M for gender). For categorical variables such as gender or travel time to clinic, sorting will reveal invalid response codes or use of mixed case (f, F, m, M for gender). Can also assess the amount of missing data when records are sorted. Does it make sense that x records have no value for variable y? Can also assess the amount of missing data when records are sorted. Does it make sense that x records have no value for variable y?

Raw Data to SAS Datasets Create a SAS program that converts the database file(s) to permanent SAS dataset(s). For Access or Excel files can use ‘Proc Import’ For Access or Excel files can use ‘Proc Import’ PROC IMPORT OUT= WORK.demog DATATABLE= "tblDEMOG" DATATABLE= "tblDEMOG" DBMS=ACCESS REPLACE; DATABASE="I:\Projects\Kenya\CFAR\cfar.mdb"; DATABASE="I:\Projects\Kenya\CFAR\cfar.mdb"; dbpwd=‘password' ; dbpwd=‘password' ;RUN; For text files can write specific input statement For text files can write specific input statement data copd ; infile 'c:\kenya\hiv\copd.txt' ; input @1 patientid $9. @@ ; input @1 patientid $9. @@ ; run ;

Raw Data to SAS Datasets (cont.) Merge or append (concatenate) tables as necessary. Merge or append (concatenate) tables as necessary. Double-check the merging process by looking at the number of observations in each dataset before and after the merge. Double-check the merging process by looking at the number of observations in each dataset before and after the merge. 831 data visit ; set h.hivvisit2(keep=patientid apptdate age weight height bmi cd4) ; 832 if patientid in ('1271BS-1','26277-4','3280CH-4','4709KT-6','625NT-5') ; 833 run ; NOTE: There were 933654 observations read from the data set H.HIVVISIT2. NOTE: The data set WORK.VISIT has 71 observations and 7 variables. 843 data vis2 ; set h.hivvisit2(keep=patientid apptdate clinic hgb sao2) ; 844 if patientid in ('13836MT-4','4709KT-6','625NT-5') ; 845 run ; NOTE: There were 933654 observations read from the data set H.HIVVISIT2. NOTE: The data set WORK.VIS2 has 46 observations and 5 variables. 846 data bothvis ; merge visit vis2 ; 847 by patientid apptdate ; 848 run ; NOTE: There were 71 observations read from the data set WORK.VISIT. NOTE: There were 46 observations read from the data set WORK.VIS2. NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables. The number of records is dependent on the overlap among the datasets. This relationship should be known in advance and the expected outcome confirmed. The number of records is dependent on the overlap among the datasets. This relationship should be known in advance and the expected outcome confirmed.

Confirm that the total number of variables in the merged dataset is correct. Confirm that the total number of variables in the merged dataset is correct. The number should be the sum of all variables minus the (number of key fields * (number of datasets in merge minus 1)). The number should be the sum of all variables minus the (number of key fields * (number of datasets in merge minus 1)). In the previous example: 7 + 5 – 2*(2-1) = 10 If the number of variables is less than this, then you know that you have the same variable(s) in one or more of the datasets. This should be strictly avoided. If the number of variables is less than this, then you know that you have the same variable(s) in one or more of the datasets. This should be strictly avoided. Raw Data to SAS Datasets (cont.)

Investigate messages such as Investigate messages such as  "NOTE: MERGE statement has more than one data set with repeats of BY values."  “Variable _____ is uninitialized”  “Variable _____ has never been referenced”  “Character values have been converted to numeric…”  “Variable _____ has been defined as both character and numeric”  “Warning: Multiple lengths were specified for the BY variable _____ by input data sets. This may cause unexpected results.” Raw Data to SAS Datasets (cont.)

SAS Dataset Creation To create permanent datasets for analysis: Recode missing values used in the raw data tables/files to appropriate SAS missing values. For example, if 9's were used to indicate missing data for numeric fields in a data table then these should be converted to.'s. Recode missing values used in the raw data tables/files to appropriate SAS missing values. For example, if 9's were used to indicate missing data for numeric fields in a data table then these should be converted to.'s. Calculate appropriate summary scores (ex. AUDIT-3, BMI) Calculate appropriate summary scores (ex. AUDIT-3, BMI) Calculate differences between dates such as time from enrollment to ART initiation. Calculate differences between dates such as time from enrollment to ART initiation. Label all calculated and created variables. Label all calculated and created variables. Attach formats to the variable values where necessary. Attach formats to the variable values where necessary.

Cleaning Data in SAS Create a cleanup program. Generate frequencies, means, and univariates to better understand the dataset and to check for invalid data. Generate frequencies, means, and univariates to better understand the dataset and to check for invalid data. Plot the data. Plot the data. For the numeric and date fields look at minimums and maximums to verify all values are within expected range. For the numeric and date fields look at minimums and maximums to verify all values are within expected range. Locate duplicate records and resolve. Locate duplicate records and resolve. Compare fields when appropriate (i.e. dob and age, confirm date of initial visit < date of follow-up). Compare fields when appropriate (i.e. dob and age, confirm date of initial visit < date of follow-up).

Cleaning Data in SAS (cont.) Identify important fields such as summary scores and verify their values. Identify important fields such as summary scores and verify their values. Merge all longitudinal datasets to identify date inconsistencies, variable format inconsistencies, and to locate missing questionnaires. Merge all longitudinal datasets to identify date inconsistencies, variable format inconsistencies, and to locate missing questionnaires. Merge cross-sectional (demographics) dataset with longitudinal datasets to identify subjects in one but not the other. Merge cross-sectional (demographics) dataset with longitudinal datasets to identify subjects in one but not the other.

SAS Program Files Save all logs and outputs from SAS programs especially when creating analysis datasets for publication Save all logs and outputs from SAS programs especially when creating analysis datasets for publication Naming conventions – studyx.sas, studyx.log, studyx.lst Naming conventions – studyx.sas, studyx.log, studyx.lst Only the program that generates the permanent dataset should overwrite it. Only the program that generates the permanent dataset should overwrite it. Never overwrite a permanent dataset (even with a proc sort) from any other program. Never overwrite a permanent dataset (even with a proc sort) from any other program.

Documentation Internally document SAS programs. At minimum include file name, location, purpose, author, date, and revisions. Internally document SAS programs. At minimum include file name, location, purpose, author, date, and revisions. May be helpful to include the names of any permanent SAS datasets created within the program. May be helpful to include the names of any permanent SAS datasets created within the program. All SAS printouts should have at least one title, which includes the project name. (“title” statement) All SAS printouts should have at least one title, which includes the project name. (“title” statement) It’s helpful to use the footnote option to display the path and file name of the SAS program on the listing. [EX: options footnote ‘I:\alz\clin\cperm.sas’ ; ] It’s helpful to use the footnote option to display the path and file name of the SAS program on the listing. [EX: options footnote ‘I:\alz\clin\cperm.sas’ ; ]

Documentation (cont.) If any variable values have been formatted, include a copy of the “proc format” section in the documentation. If any variable values have been formatted, include a copy of the “proc format” section in the documentation. Generate form keys. Generate form keys. Provide a description of any variables included in the datasets that are not found on the form keys. Provide a description of any variables included in the datasets that are not found on the form keys.

Documentation (cont.) Detailed algorithms of how summary scores are calculated should include the following: Detailed algorithms of how summary scores are calculated should include the following: a. which variables are used to calculate which summary scores b. which variables (if any) are recoded and how c. what is the minimum number of non-missing items needed to calculate the score d. how are missing values addressed. Typically when calculating a total or sum score the mean should be imputed for missing data. If the summary score is a mean itself then the missing data can be ignored. In both of these cases it is essential that c. above is followed and that summary scores are coded as missing if there is insufficient data to calculate. e. what is the meaning of the score and how is it scaled. Indicate the possible range and how a high score differs from a low score. For example include something like “Higher score indicates more depression”.

SAS General Notes If the study is longitudinal, at least two datasets are needed: one containing the demographics and other information which does not change over time; and one containing the data for multiple time points. If the study is longitudinal, at least two datasets are needed: one containing the demographics and other information which does not change over time; and one containing the data for multiple time points. Never put cross-sectional variables such as gender in the longitudinal dataset. Never put cross-sectional variables such as gender in the longitudinal dataset. Format all date fields with 4-digit year (ddmmyy10. or date9.) Format all date fields with 4-digit year (ddmmyy10. or date9.) Choose data type numeric whenever possible. Choose data type numeric whenever possible.

Distributing SAS Datasets After a senior data manager has reviewed the datasets and documentation, the statistician should be given READ ONLY access to: The form keys The form keys All appropriate SAS datasets (should have the extension.sas7bdat) All appropriate SAS datasets (should have the extension.sas7bdat) A description of any variables included in the datasets that are not found on the form keys A description of any variables included in the datasets that are not found on the form keys Notes on calculation of the summary scores Notes on calculation of the summary scores Proc format statements Proc format statements Any other documents or notes which would further explain the data. Any other documents or notes which would further explain the data.

Distributing SAS Datasets (cont.) Statisticians should not be given nor have access to: Any Protected Health Information (PHI) such as study subject’s name, address, phone numbers, social security number, hospital id number. Date of birth should only be included if absolutely necessary. But usually age can be calculated and given instead. Any Protected Health Information (PHI) such as study subject’s name, address, phone numbers, social security number, hospital id number. Date of birth should only be included if absolutely necessary. But usually age can be calculated and given instead. Your SAS generation programs. These often contain PHI. If you must share SAS programs with the statisticians, please carefully review the programs and then copy to a separate folder to which they have read access rather than giving access to your main folder. Your SAS generation programs. These often contain PHI. If you must share SAS programs with the statisticians, please carefully review the programs and then copy to a separate folder to which they have read access rather than giving access to your main folder.

Distributing SAS Datasets (cont.) For your own records at minimum, you should have: A copy of everything you give to the statistician and the date given. A copy of everything you give to the statistician and the date given. A copy of the log of all the SAS programs especially those that create any permanent SAS datasets which were passed along to others A copy of the log of all the SAS programs especially those that create any permanent SAS datasets which were passed along to others Grant protocols, meeting notes, scoring algorithms, instructions for data entry, corrections made, etc. Grant protocols, meeting notes, scoring algorithms, instructions for data entry, corrections made, etc. It may be helpful to maintain a subdirectory that exactly mirrors the subdirectory of the pc where the data is actually being entered. This subdirectory would include all the RDMS programs, format files, and tables. It may be helpful to maintain a subdirectory that exactly mirrors the subdirectory of the pc where the data is actually being entered. This subdirectory would include all the RDMS programs, format files, and tables. For longitudinal studies in particular, it is important to archive datasets and SAS programs/logs, which were used for analysis for abstracts, papers, grant proposals, and other publications. For longitudinal studies in particular, it is important to archive datasets and SAS programs/logs, which were used for analysis for abstracts, papers, grant proposals, and other publications.

Organizing Project Folders Example of folder structure: Example of folder structure: –I:\projects\studyname – contains raw data, documentation, SAS programs, etc. –I:\projects\studyname\Datasets – stores datasets that have been approved for distribution. May also include the SAS formats in this folder. Statisticians should have READ ONLY access to this folder. –I:\projects\studyname\Keys – stores the form keys, the scoring algorithms and other data documentation. Statisticians should have READ ONLY access to this folder. –I:\projects\studyname\Grant – stores the original grant application, protocols, papers, etc. All data management staff and statisticians involved in this project should have full access to this folder.

DM Working with Biostatisticians Attend study meetings Attend study meetings Date all documents and meeting notes Date all documents and meeting notes Comment on proposed study changes Comment on proposed study changes Understand the statistical analysis plan Understand the statistical analysis plan Review statistical reports (preferably before presented to research team) Review statistical reports (preferably before presented to research team) Review and critique abstracts/manuscripts Review and critique abstracts/manuscripts

PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008.

Similar presentations

Presentation on theme: "PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008.

Similar presentations

Presentation on theme: "PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008."— Presentation transcript:

Similar presentations

About project

Feedback