Phenotype generation from EMR by tensor factorization SEDI Durham Cohort James Lu M.D. Ph.D. Department of Electrical and Computer Engineering Department of Medicine
3.2 Trillion / yr (~21% of GDP) Health System Under Pressure
Small Molecules, Medical Devices, Biologics, diagnostics, genomics, transcriptomics…. OperationsNovel technology Align incentives, risk sharing, quality metrics, reducing readmissions, six sigma/ lean, … Where do I achieve cost arbitrage? How do we identify which patients to study? Where is my patient going to do next? Can we reorganize patient flow?
Computable phenotypes are a top down process PheKB, Northwestern
Many variations of computable phenotypes require adjudication by physicians. Richesson, et al Expensive and time consuming
EMR Data is large and Complicated Durham County, Patient level >240,000 patients Birthday Death (where available) Gender Race Ethnicity Visit level 4.4 Million patient visits Average 18 measurements recorded per visit Indicator of presence/absence of particular diseases (computed) Encounter date (start, end) Location (DHRH, DUH, DRH) Path (ED -> inpatient for example) Inpatient / Outpatient > 60,000 types of observations CPT ICD9 diagnoses ICD9 procedures Lab values Medications Vitals Intervention level Caveats: Temporal gaps – People are only patients when they are sick We want to incorporate all of this information Don’t want to be fooled by mistakes and bias
Decompose each touch with the health care system into its parts ● Each visit is a 5-D tensor (~1 billion elements) ● Patient ● Diagnosis/ Billing Codes ● Labs ● Medications ● Time ● Model as Counts ● Decompose into set of K rank 1 vectors With Piyush Rai and Changwei Hui Codes Labs Medications Time
Computational phenotypes are a bottom-up process. Factors represent latent phenotypes Evaluate pts with ~23MM data-points with morbidity outcomes in diabetes Alprazolam Urate Factor 2 Factor 10 Malignant Neoplasm Prostate Clinical Trial Participation Secondary Malignant Neoplasms of Bone External Catheter Set CEA AG 15-3 Allopurinol Evening Primrose Oil Systemic Lupus Erythematosus Side Effects from Statins Shoulder Pain Calcidiol Jo-1
Patients are composites of common and rare latent phenotypes. ER/ EKG Standard Labs (i.e. CBC/ BMP) Kidney Disease Hypertension Surgical Patient Patient by Factor Score Matrix, 40 most common phenotypes
Compare Outcome prediction to Known Algorithm (UKPDS) UKPDS: UK Prospective Diabetes Study outcomes model used to predict MI, Death, and Stroke 7 demographic + lab variables: age, ethnicity, smoking status A1c, HDL, Total Cholesterol and Systolic BP Dataset Original 7 variable model All Data Non Matrix Factorization Tensor Factorization Can we predict outcome in next year Death AMI Stroke Classification Model: Fit data with Random Forests 10 fold cross validation With Joseph Lucas
Tensor derived factors performs better than original UKPDS in all outcomes, provides comparable performance to “all-data” model Stroke is similar to Dat