Presentation is loading. Please wait.

Presentation is loading. Please wait.

ADVANCE CDRN: ETL and QA Successes and Challenges

Similar presentations


Presentation on theme: "ADVANCE CDRN: ETL and QA Successes and Challenges"— Presentation transcript:

1 ADVANCE CDRN: ETL and QA Successes and Challenges
Jon Puro, ADVANCE PI Jon Puro, ADVANCE PI

2 I have no relevant financial disclosures.
Jon Puro, ADVANCE PI

3 Presentation Overview
PCORnet and ADVANCE overview (2 min) ADVANCE data and infrastructure (3 min) ETL Processes and Challenges (5 min) QA Processes and Challenges (5 min)

4 Introducing PCORnet

5 PCORnet at a glance (as of February 2017)
More than 110 million patients with an ambulatory visit, inpatient admission, or ED visit in the past 5 years More than 55 million patients with an ambulatory visit, inpatient admission, or ED visit in the past year Data standardized to the PCORnet Common Data Model at ~80 DataMarts Data are routinely updated on a quarterly basis Recency and completeness of data varies by DataMart Data generally available from 2010 onward (varies by DataMart and domain)

6 PCORnet is… A Distributed Research Network
Network partners maintain possession of their own data Network partners provide aggregate results, not patient-level data, to the Coordinating Center for non-study queries Predominantly EHR data Standardized across diverse health systems Biased toward patients who seek care Highly heterogeneous Well-suited for counts & summary statistics

7 Here’s how PCORnet’s distributed research network works
The Researcher sends a question to the PCORnet Coordinating Center through the Front Door The Coordinating Center converts the question into a query with an underlying executable code, and sends it to PCORnet partners PCORnet partners review the query and provide a response, which is sent back through the Front Door to the Researcher Front Door Response Researcher Question Let’s take a look at how PCORnet works in practice. If you want to tap PCORnet to answer a research question: You submit your question to PCORnet It goes through the Front Door into PCORnet’s Coordinating Center, a network of partners who collaborate to lead PCORnet’s operations and data activities and maintain its infrastructure. These partners include the Harvard Pilgrim Health Care Institute, Duke Clinical Research Institute, and Genetic Alliance. Our administrators review all requests for use of PCORnet resources and interface with the Distributed Research Network Operations Center (DRN OC) and the Research Committee (RC) to facilitate an efficient review process. We begin working with the requestor to make sure we know specifically what information is sought. Then we convert the question to a query Phrase it in a way that PCORnet’s system will understand to ensure it gives us the result we need. We convert it into SAS code and sent securely to PCORnet network partners, who will run the query against their data and send back the result. Or we build a SQL query using the PopMedNet Menu Driven Query tool. The entire process is performed locally at PCORnet partner networks– the data never moves and remains secure. PCORnet Coordinating Center Query

8 Types of Clinical Data Research Networks (CDRNs)

9 Introducing the ADVANCE CDRN
Partnership between large safety-net networks Promotes research of safety net clinics and the populations they serve Research goals: improve care, access, and outcomes for safety net patients and clinics nationwide

10 ADVANCE CDRN Partners OCHIN, Inc. Fenway Health
97 health systems; 597 clinics; 17 states Health Choice Network (HCN) 24 health systems; 466 clinics; 8 states Fenway Health 1 health system; 3 clinics; 1 state American Academy of Family Physicians, Robert Graham Center (HealthLandscape) Care Oregon Medicaid Managed Care Plan Kaiser Permanente NW Center for Health Research Legacy Health System 6 hospitals ADVANCE is a partnership between OCHIN, Legacy Health System (hospital data), Health Choice Network (HCN, Florida), Kaiser Permanente NW Center for Health Research, Fenway Health, Care Oregon Medicaid Managed Care Plan, and American Academy of Family Physicians, Robert Graham Center

11 Primary Care Clinicians
The ADVANCE CDRN > 3.5 mil Patients 128 Health Systems >1000 Clinic Sites >20 Gov’t Institutions >10,000 Primary Care Clinicians >50 Researchers 310 Cities Current statistics as of February 2017. This map shows the breadth of reach of ADVANCE, with states color-coded by concentration of patients. Also note in particular the large number of health systems (115) and clinic sites (951). 22 States

12 ADVANCE Research Data Warehouse (RDW) includes:
PCORnet CDM Demographics (DOB, sex, race, ethnicity, etc.) Enrollment Encounter Diagnosis Labs Prescribing and Dispensing Death date and cause Vital Signs (height, weight, smoking) Condition (incl. Problem List) Patient Reported Outcomes Plus additional data needed for research on the safety net: Federal Poverty Level (FPL) Household income and size Insurance status (incl. uninsured) Homeless status Migrant/seasonal worker status Veteran status Community Vital Signs The left column contains the names of the tables in the PCORnet Common Data Model (CDM). All PCORnet Clinical Data Research Networks (CDRNs) have at least one data mart that contains all of these tables. Different CDRNs have these tables populated in their data marts to varying degrees. A note about CDRNs and data marts: A “data mart”, in PCORnet-speak, is one instance of the PCORnet CDM. Most CDRNs have multiple data marts scattered through their networks. ADVANCE is one of 3 CDRNs that have all data from all data partners contained in one central data mart. We refer to ADVANCE as have a “centralized data mart”. Data elements that are unique to ADVANCE (Safety net data) are shown in the right hand column. ADVANCE added these columns to its expanded CDM in order to support safety net research. This expanded CDM we refer to as the ADVANCE Research Data Warehouse, or RDW.

13

14 ADVANCE IT and Analytic Resources
Staff at Central site (OCHIN) Data Warehouse Architect: full-time on ADVANCE Data Warehouse Programmer: part-time on ADVANCE Analysts and biostatisticians: 8 staff – about 2 FTE on ADVANCE (not including PCORnet projects) Other OCHIN IT support Staff at data partner sites Data managers, IT, other support Systems (OCHIN) SQL Server 2012 (upgrading to 2016 next month) SAS Enterprise Guide 7.1

15 ADVANCE ETL Successes and Challenges

16 ADVANCE ETL Volume and Durations
Data Partner Refresh Frequency Avg Incremental Patients Avg Incremental Encounters (all types) Avg ETL Duration OCHIN (Epic-ambulatory) Weekly 8,000 200,000 6.5 hrs HCN (Intergy-ambulatory) Quarterly 100,000 895,000 1 hr CareOregon (claims) N/A 55,000 30 mins Legacy (hospital) 18,000 2 mins Fenway (Centricity-ambulatory)                                     2,100 35,000 3 mins HealthLandscape (geospacial) 75,000 (addresses)  N/A

17 ETL Successes and Challenges (1/4)
Updating central CDM / RDW: OCHIN manages RDW ingestion Most partners send data in the CDM format Ingestion time mostly for creating surrogate IDs and updating CDM For OCHIN the duration includes the Clarity ETL, but not data checks Centralized model provides efficiencies by centralizing key functions, esp. ETL and QA Requires broader expertise and more time to deal with varied data sources. Most of our partners send their data already in the CDM format, so this is just the time to ETL to create surrogate IDs and update the CDM, whereas for OCHIN this includes also includes the actual Clarity ETL. However this doesn’t include the time we spend checking partner data and addressing data issues. Centralized model provides efficiencies by centralizing key functions However, this also requires broader expertise and more time to deal with varied data sources.

18 ETL Successes and Challenges (2/4)
ADVANCE implemented various methods for proactive detection of unexpected data changes during loads, which could potentially affect data quality. Various factors can result in missing, out of range, or otherwise unacceptable data, e.g.: Routine EHR maintenance and upgrades New EHR tools New billing requirements New or altered workflows Data entry not following prescribed workflows

19 ETL Successes and Challenges (3/4)
ETL Challenges: Patient merges Merging/reconciliation of orphan encounters and patients. These are often the result of differential data updates which include new data created in a given period, but not necessarily updates to existing data. Internal use codes Internal/custom codes mixed in with standard codes (e.g. ICD, CPT, SNOMED-CT, LOINC, etc.). It is hard to know whether these internal codes should be mapped to standard codes. Free text Standardization of free-text values and ongoing maintenance of the normalization process, such as med and lab units, historical dates, etc.

20 ETL Successes and Challenges (4/4)
ETL Challenges: Integrating claims data with EHR data ADVANCE gets Medicaid claims data from Oregon managed care plan (CareOregon) We flag claims data with source in ADVANCE-added SiteID field Many claims are duplicative of EHR records – future project to de-dup encounters (we already de-dup dispensing records) Challenge: Getting unique provider IDs (e.g., NPI) De-duping claims: There are only a few fields in the dispensing table, no provider ID, about 50% of all dispensing claims records are also in Surescripts (same patient, NDC, dispensing date and amount), the dispensing table only includes claims data that are not also in Surescript.

21 Optimizations No heap tables: all tables have clustered primary-key indexes Non-clustered indexes on PCORnet category columns and many other frequently used columns Logical partitions for large tables e.g. physical lab_result_cm is partitioned into logical tables by year of result_date All tables are available to internal ADVANCE analysts as Views This allow us include/exclude data and additional columns to the PCORnet CDM (as needed) Plan to use columnstore indexes, other SQL 2016 DW enhancements after upgrading in April

22 QA-ing the Data

23 QA Processes (1/3) Basic row counts by table:
Reduction in record count stops ETL (critical error) Patients in CDM are never removed or archived Referential integrity checks enforced with primary-foreign keys: Prevent orphan PATIDs, ENCOUNTERIDs Prevent any discrepancies between tables containing PATID and ENCOUNTERID (e.g. same ENCOUNTERID between tables, but different PATID) Basic row counts by table: any reduction in the number of records between the current and previous load stops the ETL process (critical error) and prevents the refresh of production data with staged data. Once patients are in the CDM, their data are never removed or archived (PCORnet requirement). Thus, we don’t expect fewer records in any table after updates. Referential integrity checks enforced with primary-foreign keys: prevent orphan PATIDs, ENCOUNTERIDs or discrepancies between tables that contain both PATID and ENCOUNTERID (e.g. same ENCOUNTERID between tables, but different PATID).

24 QA Processes (2/3) After each load, run processes to gather statistical information on all tables and columns by data partner and ADVANCE as a whole: All columns: Total rows, Null counts (including NI, UN and blank), Distinct values (uniqueness) Date columns: Minimum and maximum Numeric columns: Minimum, maximum, mean, and standard deviation Text columns: Minimum length, maximum length, and average length Column value distribution for all columns (except primary keys or those with a very high level of uniqueness). These are the data we use to detect potential data issues between updates at the column level. Load statistics are used to detect potential data quality issues over time (e.g. run charts to detect unexpected shifts, trends and patterns). After each load, run processes to gather statistical information on all tables and columns by data partner and ADVANCE as a whole: All columns: Total rows, Null counts (including NI, UN and blank), Distinct values (uniqueness) Date columns: Minimum and maximum Numeric columns: Minimum, maximum, mean and standard deviation Text columns: Minimum length, maximum length and average length Column value distribution for all columns (except primary keys or those with a very high level of uniqueness). These are the data we use to detect potential data issues between updates at the column level. We store these statistics for each load, and they can be used to detect potential data quality issues over time (e.g. run charts to detect unexpected shifts, trends and patterns).

25 QA Processes (3/3) Mapping of raw EHR values to CDM dataset values rely on crosswalk tables (as much as possible). System notifies us of new raw values in the source data We then map these new values And update the crosswalk tables We keep column-level value distribution stats for all tables and columns in the CDM for each data load. We use statistics to monitor changes over time. For example, we monitor counts of all diabetes diagnoses (ICD9/ICD-10) in the condition table by partner site and the total for ADVANCE, then compare to previous counts. Data profile/characterization reports are available to other users as SSRS reports on a report server site.

26 QA Challenges Keeping up with dirty and incomplete data as more clinics and providers join systems. Keeping up with data model changes (CDM is stable now, but we continue to add elements to RDW). No direct access to external partner data or source systems for validation purposes. Limited access to staff and resources at partner sites to help with questions and QA.

27 PCORnet Information and Contacts
Website: PCORnet Commons: YouTube: PCORI YouTube Playlist Vimeo: PCORI Vimeo Playlist PCORnet communications contacts:

28 ADVANCE Contacts Jon Puro, ADVANCE PI, puroj@ochin.org
Jen DeVoe, ADVANCE Co-I, Pedro Rivera, ADVANCE Data Warehouse Architect, Vance Bauer, VP of Research and ADVANCE Director, Molly Krancari, ADVANCE Project Manager, Nikki Stover, ADVANCE Project Coordinator,

29 OCHIN Inc. @OCHINinc OCHIN Inc.


Download ppt "ADVANCE CDRN: ETL and QA Successes and Challenges"

Similar presentations


Ads by Google