Biostatistics Analysis Center Center for Clinical Epidemiology and Biostatistics University of Pennsylvania School of Medicine Minimum Documentation Requirements Amy Praestgaard March 6, 2008
2 Project Directory File Structure Located on the network to ensure regular back-up. Name of directory usually includes project number (e.g., BIO999 or EPI888) Has (at least) the following subdirectories: Documentation Data Original Analytic Programs Current Archived Results Current Archived
3 Documentation Subdirectory Contact sheet (REQUIRED) Must include names and contact information for PI, Biostatistics faculty members, and BAC staff members. Project registration is sufficient for many projects. Larger projects may include need to include project managers, research assistants, and CRCU contacts. BAC Project Plan (REQUIRED -- will overlap with overall project plan) Study objective(s), Outcome definition, Statement of analytic methodology (analysis plan). At minimum, a note in the issues log should contain the BAC project goal, even if as simple as “this project converts a SAS file to an SPSS file” Analytic efforts related to timeline for abstracts, meetings, manuscripts. Study information (STRONGLY RECOMMENDED) Study protocol or grant proposal, data collection forms, meeting minutes or summaries, To-do lists (if applicable), all abstract and manuscript drafts. Issues log (STRONGLY RECOMMENDED) Includes brief description, priority, and estimated resolution date. Keep resolved issues on the list, noted with date resolved.
4 Data Subdirectory BIO999\Data Good idea to separate the original data from the created Original (Raw) Data: BIO999\Data\Original Original versions of source data (e.g., DVD, external hard drive, Excel spreadsheet) must be stored in a secure location, like a locked desk drawer. An electronic “gold copy” of the source data must be retained in the data subdirectory. It may not hurt to include “GOLD” in the file name. Analytic Data: BIO999\Data\Analytic Original extract from source data “Cleaned” data, in original format Incorporates data changes, deletions Does not include derived variables Final analytic data set, including derived variables
5 Data Dictionary At minimum, must contain: Results from PROC CONTENTS on the analytic dataset containing labeled variables. Results from PROC MEANS that includes N for each variable. A separate document may be necessary for some projects. If the data base is very large, spend billable time creating labels only for the variables you will be using for the analysis unless the client specifically asks you to do it for all variables.
6 Program headers All programs have a header with program name, purpose, location, project name, faculty and BAC name, input, output, last modified, and other relevant details
7 Code Documentation Code should have a sufficient number of comments, especially on macros, complex merges, and code for analysis See the “Code Complete” book for a fuller discussion of good documentation Within a program, the following types of code must be annotated with a description of purpose and expected outcome: All macros or functions Any code generating a data set that is to be retained for interim or final analysis Any code generating output (e.g., frequency tables, test statistics, p- values) for an interim or final deliverable. Any code used for data validation, data changes, and derived variable calculation.
8 Program Archiving During active analytic periods, log and listing files must be archived as follows: With date clearly indicated in the file name, as in “gmm_8_8Dec07.log” and “gmm_8_8Dec07.lst.” In a separate “archive” or “history” section of the appropriate BAC project folder At least weekly, and more frequently if substantial changes are made. Log and list files can be run in Windows, no need to use unix to generate them Programs generally should have a date in their names in order to keep track of versions
9 Documenting Deliverables All interim and final deliverables must include the following information: Analyst name Deliveree Date produced (good practice to include in file name) Input data, output data, and program location Statistical software (including version) used to produce results Once distributed, the deliverable will be retained in the appropriate project folder