Funding was provided by a contract from AcademyHealth. Additional support was provided by AHRQ 1R01HS (Scalable PArtnering Network for CER: Across Lifespan, Conditions, and Settings), AHRQ 1R01HS (Scalable Architecture for Federated Translational Inquiries Network), and NIH/NCRR Colorado CTSI Grant Number UL1 RR (Colorado Clinical and Translational Sciences Institute). A Pragmatic Model for Data Quality Assessment in Clinical Research Michael G. Kahn, M.D., Ph.D. Department of Pediatrics University of Colorado, Denver Colorado Clinical and Translational Sciences Institute Department of Clinical Informatics, The Children’s Hospital Electronic Data Methods Forum Methods Symposium 17-October-2011
Disclosures Presentation based on AcademyHealth supported paper: “A Pragmatic Framework for Single-Site and Multi-Site Data Quality Assessment in Electronic Health Record-Based Clinical Research” Michael G. Kahn *,1,2, Marsha A. Raebel 3,4, Jason M. Glanz 3,5, Karen Riedlinger 6, John F. Steiner 3 1. Department of Pediatrics, University of Colorado Anschutz Medical Center, Aurora Colorado 2. Colorado Clinical and Translational Sciences Institute, University of Colorado Anschutz Medical Center, Aurora Colorado 3. Institute for Health Research, Kaiser Permanente Colorado, Denver Colorado 4. School of Pharmacy, University of Colorado, Aurora, Colorado 5. Department of Epidemiology, Colorado School of Public Health, Aurora, Colorado 6. Northwest Kaiser Center for Health Research, Portland Oregon 2
What is the issue? Poor data quality can invalidate research findings –Cohort identification –Risk factors / exposures / confounders –Interventions –Outcomes Data quality in non-research settings even more problematic –Documentation practices –Workflow –Diligence to data quality Our focus: how to assess data quality systematically? 3
Why is a systematic data quality assessment framework useful? We all do various data quality reviews –We know what we’ve looked at –We may not know what we haven’t looked at A comprehensive evaluation of data quality is too resource intensive –Need to focus on aspects that matter –If needs changes, are existing DQ assessments adequate? 4
Key Features of this Presentation A comprehensive data quality framework adapted from information sciences for clinical research –Definitions of data quality Multi-dimensional Context-dependent Operationalize DQ assessments –Uses framework to ensure coverage A formative proposal – data quality meta-data tags 5
Data Quality Assessment Stages Stage 1: initial assessments of source data sets prior to analysis –Simple global analyses, visualizations, descriptive statistics –Both single-site and multi-site Stage 2: Study-specific analytic subsets with complex models and detailed data validations focused on dependent and independent variables. 6
7 A trivial example: Martial Status by Age Would this result be worrisome?
8 It’s tough being 6 years old…….
9 Should we be worried? No –Large numbers will swamp out effect of anomalous data or use trimmed data –Simulation techniques are insensitive to small errors Yes –Observed site variation may be driven by differences in data quality, not clinical practices –Genomic associations look for small signals (small differences in risks) amongst populations
Hyperkalemia with K+-sparing Agents 10
Comparative Temporal Trends: Serum Glucose 11
3. Final analytic data set Extraction from EMR Data quality assessments 1. Site level Data quality assessments Data merging 2. Multi-site level Data quality assessment lifecycles 12
Multi-site quality assessment workflows Many loops Many decisions (diamonds) 13
Data quality dimensions from the IS/CS literature Terms used in Information Sciences literature to describe data quality 14 Wand Y, Wang R. Anchoring data quality dimensions in ontological foundations. Comm ACM. 1996;39(11):86-95.
Defining data quality: The “Fit for Use” Model Borrowed from industrial quality frameworks –Juran (1951): “Fitness for Use” design, conformance, availability, safety, and field use Multiple adaptations by information science community –Not all adaptations are clearly specified –Not all adaptations are consistent –Not linked to measurement/assessment methods 15
The Wang and Strong Data Quality Model Interviews with broad set of data consumers Yielded 118 data quality features Four categories Fifteen dimensions Includes features of the data and the data system Our modification: Two data categories Five dimensions 16 Wang, R. and D. Strong (1996). "Beyond accuracy: What data quality means to data consumers." J. Management Information Systems 12(4): 5-34.
17
How to measure data quality? Need to link conceptual framework with methods Maydanchik: Five classes of data quality rules –Attribute domain: validate individual values –Relational integrity: accurate relationships between tables, records and fields across multiple tables –Historical: time-vary data –State-dependent: changes follow expected transitions –Dependency: follow real-world behaviors 18 Maydanchik, A. (2007). Data quality assessment. Bradley Beach, NJ, Technics Publications.
Data Quality Assessment METHODS Five classes of data quality rules 30 assessment methods –Attribute domain rules (5 methods) –Relational integrity: (4 methods) –Historical: (9 methods) –State-dependent: (7 methods) –Dependency: (5 methods) 19 Time and change assessments dominate!!
Dimension 1: Attribute domain constraints 20
Dimension 2: Relational integrity rules 21
22
Dimension 4: State-dependent rules 23
Dimension 5:Attribute dependency rules 24
How to use this framework Determine which aspects of data quality matter most at Stage 1 –What is needed to support Stage 2 –What is doable with data sources? –What can the project afford to do? –What needs to be done once versus repeatedly Write up a data quality assessment plan –What’s in, what’s out –And why 25
Extra credit slides: A formative proposal President’s Council of Advisors on Science and Technology (PCAST) –Recommended mandatory “metadata” tags attached to all HIT data elements Metadata are descriptions of the data PCAST proposed tags: data provenance, privacy permissions/restrictions 26
27
Extra credit slides: A formative proposal CER community defines metadata tags that describe data quality for data elements and data sets –Simple distributions (mean, median, min, max, missingness, histograms) ala OMOP OSCAR –More comprehensive set of measures derived from this framework If you are interested in this concept, contact me! 28
Funding was provided by a contract from AcademyHealth. Additional support was provided by AHRQ 1R01HS (Scalable PArtnering Network for CER: Across Lifespan, Conditions, and Settings), AHRQ 1R01HS (Scalable Architecture for Federated Translational Inquiries Network), and NIH/NCRR Colorado CTSI Grant Number UL1 RR (Colorado Clinical and Translational Sciences Institute). A Pragmatic Model for Data Quality Assessment in Clinical Research Michael G. Kahn, M.D., Ph.D.