Data Migration (ETL) Approach Scheduler Roll-Out at HCCRC Institutions Open Source Community Call January 26, 2016 http://catalyst.harvard.edu
Contents State of Data at the Institutions Data Loaded into Scheduler Approach to ETL Challenges & Lessons Learned
State of Data at the Institutions Varying methods for scheduling used at each of the four CRCs Methods include: Paper and pencil (scheduling book) Turbo software Outlook Epic Each method yielded different levels of available data in varying formats
Data Loaded into Scheduler Scheduled visits Each visit was put in as an overbook Subjects Resource intensities added to existing visit templates Notes Prior to data migration, studies and visit templates were modeled out and then built directly in the system, resources were added directly to the database For some sites, we were able to move data over programmatically. For other sites, we decided to enter certain data manually based on: Levels of effort for each approach Data available Developer and operational staff availability
Approach to ETL Established formats in which each institution would provide their data File format (CSV) and layout Data formats (e.g. dates, strings) Required close communication, trial and error with Harvard Catalyst engineers and institution engineers Coded a program to extract institution data from the CSV files and transform it into our domain objects Created a code branch with a copy of Scheduler code used to populate the database Send sample data (no PHI) Code program, including proper validation messages Run the program on a test system and report errors Engineer at Institution Harvard Catalyst Engineer
Challenges & Lessons Learned Because we are outside of the hospital and there is sensitive patient data, we were only able to utilize sample data when coding and testing This caused a lot of back and forth between engineers at Harvard Catalyst and at the hospital The method used to populate the Scheduler database is not re-usable and was a one-time effort