Download presentation
Presentation is loading. Please wait.
Published bySharon Davidson Modified over 7 years ago
1
Moving Biomedical Big Data Sharing Forward An adoption of the RDA Data Citation of Evolving Data Recommendation to Electronic Health Records Leslie McIntosh, PHD, MPH RDA P8 Cynthia Hudson Vitale, MA Denver, USA September 2016 @mcintold @cynhudson
2
Background Millions of dollars are being put into data sharing and brokerage using clinical data. Clinical decisions are being made based upon these (and other) decisions. Hence, it is important to make this process rigorous. But why do I care?
3
Leslie McIntosh, PHD, MPH Cynthia Hudson Vitale, MA
Director, Center for Biomedical Informatics Data Services Coordinator Leslie McIntosh, PHD, MPH Cynthia Hudson Vitale, MA
4
BDaaS Biomedical Data as a Service
Researchers Data Broker i2b2 Application Biomedical Data Repository Center for Biomedical
5
Move some of the responsibility of reproducibility
Biomedical Researcher Biomedical Pipeline How to we facilitate moving some of the responsibility of reproducibility from the biomedical researcher and incorporating the principles into the biomedical pipeline.
6
RDA/MacArthur Grant We saw the RDA/MacArthur grant as an opportunity to move forward with our vision.
7
Biomedical Adoption Project Goals
Implement RDA Data Citation WG recommendation to local Washington U i2b2 Engage other i2b2 community adoptees Contribute source code back to i2b2 community Specifically, through this project, we set out to: Implement the RDA Data Citation WG Rec Engage with other i2b2 adoptees in this conversation on reproducibility Contribute any code back to the i2b2 community
8
RDA Data Citation WG Recommendations
R1: Data Versioning R2: Data Timestamping R3, R9: Query Store R7: Query Timestamping R8: Query PID R10: Query Citation Data Versioning: For retrieving earlier states of data of datasets the data needs to be versioned. Markers shall indicate inserts, updates, and deletes of data in the database. Data Timestamping: Ensure that operations are timestamped, i.e. any additions, deletions are marked with a timestamp. Query Store: Provide means for storing queries and the associated metadata in order to re-execute them in future. Query Timestamping: Assign a timestamp to the query based on the last update to the entire database. Query PID: The data shall be identified via a persistent identifier (PID) pointing to a timestamped query, resolving to a landing page. Query Citation: Provide a recommended citation text and the PID to the user.
9
Internal Implementation Requirements
Scalable Available for PostgreSQL Actively supported Easy to maintain Easy for data brokers to use Must be able to handle large volume of data efficiently and relatively fast, because versioning requires storing history of changes to data, as well as scanning and updating large amounts of data Must be implementable for PostgreSQL, because this is the database system at CBMI for EHR Software tools used must have regular updates and maintenance support from community or reliable commercial source. Otherwise, the software tools used should be understood by in-house developers, so that support and updates could be provided by CBMI team, if no other source is available. Should be relatively easy to implement to allow for easy maintenance and upkeep. Should be relatively easy to use for data brokers, as they would need to run queries and access query audit.
10
RDA-MacArthur Grant Focus
Data Broker Researchers
11
R1 and R2 Implementation 3 1 2 RDC.hist_table* triggers c1 c2 c3
sys_period RDC.table c1 c2 c3 sys_period PostgreSQL Extension “temporal_tables” In our proof of concept for RDC for timestamping and versioning we used temporal_tables extension for Postgres Picture copyright Each live data table that needs logging was given an additional column named “sys_period” which holds a timestamp range value identifying the start timestamp, end timestamp, and boundary condition of when the record is or was valid. The name “sys_period” was chosen to be consistent with the temporal_tables documentation. A second copy of each live data table was created to hold historical data. These history tables have the same structure as the live data tables, but without the same primary keys and are used for logging each data change. For consistency all history tables were given the same name as the table being logged, but prefaced with “hist_”. The trigger provided by the temporal_tables extension was added to each live data table to be called before insert, update, or delete transactions. The trigger inserts a copy of the existing row from the live data table into the historical table with the timestamp range modified to show the current date as the upper bound of the valid range. *stores history of data changes
12
ETL Incrementals Source Data Insert? Update? RDC.hist_table RDC.table
… sys_period :00, :00 … sys_period :00, NULL … sys_period :00, NULL Old Data During the Incremental ETL from Clinical Data Source, our ETL identifies whether a row of data is an update or insert to existing RDC data and if it’s an update, the data in RDC.table gets updated to the new values. However, the triggers we set up before each update or delete (if any) will first copy the original field values to the RDC.hist_table, and only then allow the update of data in RDC.table. Sys_period for live data tables will have the start of date range during which the record is valid and NULL as the end_date Inside hist_table the sys_period will have both start and end date of record validity.
13
R3, R7, R8, R9, and R10 Implementation
2 3 1 RDC.table RDC.hist_table functions triggers query audit tables PostgreSQL Extension “temporal_tables” In order to implement the RDA recommendations in RDC for: Storing query and query metadata, Query Timestamping, storing Query PID (persistent modifier) and suggested Query Citation, we did the following: In addition to using Temporal_tables extension we also created a view as a union between each live data table and its corresponding history table so that data could be queried as it was at any given time. These views were named with a name matching the live table with the addition of the suffix “_with_history”. Basically the view would have the data currently valid and all the historical versions of the same data along with the sys_period ranges of when the versions were valid. Several functions and triggers were set up to allow for storing user data and query data into query audit tables, at the time when data brokers would be querying the views. RDC.table_with_history (view)
14
Data Reproducibility Workflow
Today The Data Broker runs a query on the special View (a view over union of live data and hist_table). The query run is audited automatically, storing the query, its metadata, the date of execution, query PID and suggested Citation. The Data Broker gets back the data, and suggested citation, containing Query Date and Query PID. 2) Some Time Later The Data Broker can reproduce the results received for the Query PID, by providing that PID to a special function to re-run the Query. The function gets the Query Date range using the provided PID, and then operates over the union of live data and all the historical versions of that data, picking only the versions of data valid at the time when the Query with the provided PID was executed. The Data Broker gets the same dataset back as they did before. Tardis picture credit:
15
Bonus Feature: Determine if Change Occurred
The nice feature we added is a way for the Data Broker to find out whether the version at the time of Query with a certain PID has actually changed over time, by comparing it to the current data version. If the data haven’t changed, the result of the function would be false. If the data have changed, the comparison will be provided to the Data Broker.
16
Future Developments Develop a process for sharing Query PID with researchers in an automated way Resolve Query PIDs to a landing page with Query metadata Implement research reproducibility requirements in other systems as possible
17
Outcomes and Support
18
Obtained Outcomes Implemented WG recommendations
Engaged with other i2b2 adoptees (Harvard, Nationwide Children’s Hospital) Success of this project will be based on a number of factors including: 1. The integration of WGDC recommendations into the WU CBMI instance of i2b2 2. Feedback from existing i2b2 installations on recommended changes, RDA WGDC compliant code
19
Dissemination Poster presentation (Harvard U, July 2016)
Scientific manuscript based on our proof of concept to AMIA TBI/CRI 2017 conference Sharing the code with the community Success of this project will be based on a number of factors including: 1. The integration of WGDC recommendations into the WU CBMI instance of i2b2 2. Feedback from existing i2b2 installations on recommended changes, RDA WGDC compliant code
20
Return on Investment (ROI) - Estimated
20 hours to complete 1 study $150/hr (unsubsidized) $3000 per study 115 research studies per year 14 replication studies There are different ways to calculate the financial ROI. For one, it is to determine the hours of time we potentially save (or do not waste) trying to replicate a study. Given our calculations, it will take 14 studies to complete this.
21
Funding Support MacArthur Foundation 2016 Adoption Seeds program Foundation through a sub-contract with Research Data Alliance Washington University Institute of Clinical and Translational Sciences NIH CTSA Grant Number UL1TR and UL1TR S1 Siteman Cancer Center at Washington University NIH/NCI Grant P30 CA
22
Center for Biomedical Informatics @WUSTL
Teams for Reproducible Research NIH-NLM Supplement Leslie McIntosh Cynthia Hudson-Vitale Anthony Juehne Rosalia Alcoser Xiaoyan ‘Sean’ Liu Brad Evanoff Research Data Alliance Leslie McIntosh Cynthia Hudson-Vitale Anthony Juehne Snehil Gupta Connie Zabarovskaya Brian Romine Dan Vianello RDA Collaborators Andreas Rauber Stefan Pröll
23
WashU CBMI Research Reproducibility Resources
Repository Slides Bibliography
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.