Architecture-Specific Considerations and Methods for Data Quality Assessment in Collaborative Clinical Data Research Networks Chunhua Weng, PhD Associate Professor of Biomedical Informatics Columbia University, New York City May 27, 2015
No competing interests to disclose Disclosures No competing interests to disclose
EHR data are subject to quality problems “With the advent of the information era in medicine, we are pouring out a torrent of medical record misinformation. Medical records, which have long been faulty, contain more distorted, deleted, and misleading information than ever before.” Burnum (1989) The misinformation era: the fall of the medical record.
What data quality problems should we try to prevent when creating clinical data research networks for CER?
Three CDRN Architecture Models Centralized Query of Everything Federated Query of Patient Counts Federated Query of Research Results
CDRN Model 1 Central CDW CDM-based De-identified data CDM-based index index index CDM-based De-identified data CDM-based De-identified data CDM-based De-identified data ETL ETL ETL CDW-1 CDW-2 CDW-n
What are data quality concerns for model 1? When and Where Data Quality Concerns At local level before ETL Sampling bias1, correctness, currency2, completeness3, concordance, plausibility, At local level during ETL Information loss, transformation/coding error At the central level during indexing Inconsistency or redundancy across sites At the central level after indexing currency, data provenance, research suitability 1. Rusanov A*, Weiskopf NG*, Wang S, Weng C, Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research, BMC Medical Informatics and Decision Making, 2014 Jun 11;14(1):51, 2. Weiskopf NG, Weng C, Methods and Dimensions of EHR Data Quality Assessment: Enabling Reuse for Clinical Research, J Am Med Inform Assoc. 2013 Jan 1;20(1):144-51 3. Weiskopf NG, Hripcsak G, Sushmita S, Weng C, Defining and measuring completeness for electronic health records for secondary use. J Biomed Inform, 2013 Oct;46(5):830-6.
Example Data Quality Issues at Local Level Bias: Biases in lab ordering for specific population subgroups Biases towards specific population subgroups: do data represent the overall population? Bias in data measurement Completeness: Missing required data elements or variables for CER of particular diseases Completeness: Missing data due to data fragmentation Correctness: Accuracy of ICD-9 diagnosis Correctness: Incorrect coding/use of CDM or terminology Currency: Outdated gender information for transgender patients Plausibility: Discharge date is 25 days earlier than admission date
Example Data Quality Issues at Central Level Concordance: The same patient has different values from data submitted by different sources Redundancy: The same patient is created multiple times in the central database and treated as multiple patients Currency: Lack of timely sync between local and central data Data provenance: Unable to trace back to data sources: e.g., did you put in discharge diagnosis or admission diagnosis? Not granular: Lack of granularity of coding
Defining and Measuring Data Completeness
Defining and Measuring Data Completeness
CDRN Model 2: SHRINE Source of Information: https://open.med.harvard.edu/project/shrine/
CDRN Model 2: i2b2-SHRINE CDM-based data CDM-based data CDM-based data Shared Query Aggregated Counts CDM-based data CDM-based data CDM-based data CDM-based data ETL ETL ETL ETL CDW-1 CDW-2 CDW-3 CDW-n
What are data quality concerns for model 2? When and Where Data Quality Concerns At local level before ETL Sampling bias1, correctness, completeness3, concordance, plausibility, currency2 At local level during ETL Information loss, transformation/coding error At the central level during indexing Inconsistency or redundancy across sites At the central level currency, data provenance, research suitability 1. Rusanov A*, Weiskopf NG*, Wang S, Weng C, Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research, BMC Medical Informatics and Decision Making, 2014 Jun 11;14(1):51, 2. Weiskopf NG, Weng C, Methods and Dimensions of EHR Data Quality Assessment: Enabling Reuse for Clinical Research, J Am Med Inform Assoc. 2013 Jan 1;20(1):144-51 3. Weiskopf NG, Hripcsak G, Sushmita S, Weng C, Defining and measuring completeness for electronic health records for secondary use. J Biomed Inform, 2013 Oct;46(5):830-6.
CDRN Model 3: OHDSI
CDRN Model 3: OHDSI CDM-based data CDM-based data CDM-based data Shared Protocol Aggregated Evidence CDM-based data CDM-based data CDM-based data CDM-based data ETL ETL ETL ETL CDW-1 CDW-2 CDW-3 CDW-n
What are data quality concerns for model 3? When and Where Data Quality Concerns At local level before ETL Sampling bias1, correctness, completeness3, concordance, plausibility, currency2 At local level during ETL Information loss, transformation/coding error At the central level during indexing Inconsistency or redundancy across sites At the central level currency, data provenance, research suitability, metadata transparency and completeness 1. Rusanov A*, Weiskopf NG*, Wang S, Weng C, Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research, BMC Medical Informatics and Decision Making, 2014 Jun 11;14(1):51, 2. Weiskopf NG, Weng C, Methods and Dimensions of EHR Data Quality Assessment: Enabling Reuse for Clinical Research, J Am Med Inform Assoc. 2013 Jan 1;20(1):144-51 3. Weiskopf NG, Hripcsak G, Sushmita S, Weng C, Defining and measuring completeness for electronic health records for secondary use. J Biomed Inform, 2013 Oct;46(5):830-6.
Tool-Assisted Data Quality Assessment
Take home Different CDRN architectures entail different data quality assessment requirements Federated approach involving autonomous sites may minimize data query checking complexities Tools such as Achilles from OHDSI can be used for data quality assessment We need a rapid learning system for DQA