SIR and the Columbia University Health Sciences Data Coordinating Center Why SIR should be the database system of choice for research in the health sciences Howard Andrews John Pittman
Columbia University DCC: First Decade 1976: a Computer Science Section (CSS) is established within the Epidemiology Department to provide for the data management and computer technology needs of researchers in the department. Staffing level: 3-4 FTEs. 1977: John Pittman purchases SIR software and begins to use SIR for certain research projects. 1988: CSS, now called the Statistical Analysis Unit (SAU), is asked by the Sergievsky Center to manage data for WHICAP, a longitudinal followup study of Alzheimer’s disease in a cohort of 2,000 elderly residents of Northern Manhattan. SIR is used to manage WHICAP data. Staffing level: 4-5 FTEs.
Columbia University DCC: Second Decade : WHICAP investigators are highly successful, the grant is renewed for an additional 5 years; many additional projects to the same investigators are funded. All data are managed using SIR : HIV Center for Clinical and Behavioral Studies and the Center for Child Environmental Health ask the SAU—now the Data Coordinating Center (DCC)—to provide comprehensive data management services using SIR. Staffing level: 6-8 FTEs.
Columbia University DCC Today: Highlights Databases: 25 active research databases using SIR. Staffing: 10 full-time, 4 part-time staff. Services: DCC provides a full range of data-related services, including data analysis, subject recruitment, and project management as well as data management. Hardware: Dual-processor 512mb RAM server with 3 18gb RAID-configured hard drives; 24gb tape drive; Windows NT operating system
Factors that Promoted Growth of the Columbia DCC Professional staff with training and experience in data management, data analysis and/or programming. Commitment to high standards of data management. Enormous need for centralized institutional resource providing expertise in research data coordination. Increasing concern & regulatory oversight by granting agencies of data management, data security and confidentiality issues. ‘Buy-in and Accountability’ concept: each funded grant directly supports DCC staff and infrastructure. Use of SIR software: maximizing power, minimizing cost.
Buy-in and Accountability Infrastructure for Managing Data in Funded Research Projects DCC is involved at the proposal stage, providing a description of the role of the DCC and individual staff. Staff, equipment and software needs are budgeted ‘up front’ and become available to DCC if grant is funded. The buy-in mechanism creates more accountability to the investigator than an institutionally-funded center. Buy-in therefore becomes an attractive alternative to investigators who would otherwise hire and manage their own data management staff and infrastructure. Efficiencies inherent in an accountable, professional DCC are also attractive to institutions and funding agencies.
Why SIR continues to be the software of choice SIR was designed for the research community. SIR’s case-structured approach is ideal for (health-related, and most other research), in which all data is invariably ‘owned’ by a subject, case or proband. SIR is the only database software that provides integrated variable and value labeling, and user-defined missing values. These features are ported automatically to SAS and SPSS files, the statistical packages used by most statisticians. The ability to create self-documenting, analysis-ready statistical files from SIR greatly reduces the latency between data entry, data analysis, and publication of research findings.
DCC Operating Principles and Features I. Database and implementation Determine data structures that will be required for analysis. Identify reports that will be required. Work with investigators on instrument (paper form) design to ensure adherence to good data entry practices. Formal DCC review and approval of any changes to paper instruments. Plan mode and location of data entry for all data types, e.g. –Entry by DCC staff of full evaluation packets collected on paper. –Batch entry of laboratory results. –Error correction by project staff via secured remote access
DCC Operating Principles and Features II. Database and implementation Remote data entry and data delivery : –Secured remote access to the DCC server is provided to the project database over the Columbia intranet. –SIR MASTER is used to allow concurrent data entry and database access from the DCC and authorized remote locations. –Database products—SPSS & SAS files and programmed reports—are deposited in secured project-specific directories on the DCC server to which authorized project staff have access. Daily Back-up and product generation: –All SIR databases are verified and backed up nightly. –SIR PQL programs are run to generate updated reports, SAS and SPSS files reflecting the day’s data entry activities. –WINBATCH software is used.
Promoting & Supporting SIR in the Research Community Build SIR licensing into the budget of each new project (departmental or project- specific). Establish courses in database management and data coordination. Professionalize data management: promote certification and academic degrees in research data management.
What SIR can contribute to increasing market share in research Advertise in scientific and statistical journals— most researchers have never heard of SIR. Develop a research-based alternative to the COMPANY database for SIR presentations. Don’t neglect appearances. e.g., creating attractive ‘modern’-looking (dare we say ACCESS-like?) PQL/Forms is as important for promotion as is making PQL/Forms powerful and easy-to-use (ditto for DataVisor forms).