Data Heterogeneity Study (Not Data Quality) (OR) “Type 2 Diabetes: A modern day St. Valentine’s Day Massacre” Feb.14, 2011
Purposes Compare Mayo and Intermountain data –ICD-9 / diagnostic codes –CPT procedure codes –Medications –Labs (fasting glucose / HbA1c) –Associated conditions (obesity, …?) –Practice characteristics (specialties) –Health access characteristics
Specific Aims To determine the relative frequency of the occurrence of each of the ICD-9 codes for T2DM and significant comorbidities or prediabetes syndromes, including obesity indicators, at the 2 institutions, by age, sex, and ethnicity To determine the relative frequency of medications documented for treatment of T2DM To determine the relative frequency of the performance of diagnostic tests for T2DM (fasting or non-fasting BG, Hba1c), and the values of results To pilot test the Northwestern algorithm for electronically defining T2DM in an equivalent way at the 2 institutions and describe differences attributable to variation in the EHR data
Study Design Phase 1 Determine sample space of findings –Association mining (Susan Welch). Use set of “seed” codes, retrieve broader set of codes / findings that associate with these –Run separately at each institution, then merge them –Avoids human subjectivity in selection of codes/findings
Study design Phase 2 Retrieve data at each institution –1 observation / patient / episode / data category –Assign random patient ID (discard link) –Assign random date shift –Assemble 1 dataset / data category Exchange data and merge into 1 common dataset
Phase 3 Analyze data (within institution, and between) –Raw frequencies of codes, procedures –Distributions of glucose, Hb A1c, BMI –Relative frequencies of T2DM findings –Associations between/ among T2DM findings –Associations between demographics / T2DM findings –Associations between health access / T2DM findings
Phase 4 Interpretation –What data are different between institutions? –Why are they different? –What else affects them? –What data are not different? –What is impact of time interval, time –NOT “Who has Type 2 DM”? Conclusions
Other Issues Relationship/synergy with Centerphase Project Relationship to other SHARP projects What about unstructured data / NLP? Is this dataset (or when is it?) a shared resource?