Identifying RA patients from the electronic medical records at Partners HealthCare Robert Plenge, M.D., Ph.D. VA Hospital July 20, 2010 HARVARD MEDICAL SCHOOL
genotype phenotype clinical care
genotype phenotype clinical care bottleneck
July 2010: >30 RA risk loci PTPN “shared epitope” hypothesis HLA DR PADI4CTLA4 TNFAIP 3 STAT4 TRAF1- C5 IL2-IL21 CD40 CCL21 CD244 IL2RB TNFRSF 14 PRKCQ PIP4K2C IL2RA AFF3 Latest GWAS in 25,000 case-control samples with replication in 20,000 additional samples 2009 REL BLK TAGAP CD28 TRAF6 PTPRC FCGR2A PRDM1 CD2- CD58 Together explain ~35% of the genetic burden of disease IL6ST SPRED2 5q21 RBPJ IRF5 CCR6 PXK 2010 (Q2)
genotype phenotype clinical care bottleneck
Genetic predictors of response to anti-TNF therapy in RA PTPRC/CD45 allele n=1,283 patients P= Cui et al (2010) Arth & Rheum
How can we collect DNA and detailed clinical data on >20,000 RA patients?
What are the options for collecting clinical data and DNA for genetic studies?
Options for clinical + DNA designClinical data DNASample size cost clinical trial +++ +$$$ registry $$ claims data +n/a+++$ EMR $
Narrative data = free-form written text –info about symptoms, medical history, medications, exam, impression/plan Codified data = structured format –age, demographics, and billing codes Content of EMRs EMRs are increasingly utilized!
Gabriel (1994) Arthritis and Rheumatism This is not a new idea… Sens: 89% PPV: 57% Sens: 89% PPV: 57%
Gabriel (1994) Arthritis and Rheumatism Conclusion: The sole reliance on such databases for the diagnosis of RA can result in substantial misdiagnosis. …but EMR data are “dirty”
Partners HealthCare: 4 million patients
Partners HealthCare: linked by EMR
Partners HealthCare: organized by i2b2
4 million patients 31,171 patients ICD9 RA and/or CCP checked (goal = high sensitivity) 3,585 RA patients Classification algorithm (goal = high PPV)
Natural language processing (NLP) –disease terms (e.g., RA, lupus) –medications (e.g., methotrexate) –autoantibodies (e.g., CCP, RF) –radiographic erosions Codified data –ICD9 disease codes –prescription medications –laboratory autoantibodies Our library of RA phenotypes Qing Zeng Concept/termAccuracy of concept presence of erosion88% seropositive96% CCP positive98.7% RF positive99.3% etanercept100% methotrexate100% Guergana Savova
Natural language processing (NLP) –disease terms (e.g., RA, lupus) –medications (e.g., methotrexate) –autoantibodies (e.g., CCP, RF) –radiographic erosions Codified data –ICD9 disease codes –prescription medications –laboratory autoantibodies Our library of RA phenotypes Shawn Murphy
‘Optimal’ algorithm to classify RA: NLP + codified data Regression model with a penalty parameter (to avoid over-fitting) Codified dataNLP data Tianxi Cai, Kat Liao
High PPV with adequate sensitivity ✪ 392 out of 400 (98%) had definite or possible RA!
This means more patients! ~25% more subjects with the complete algorithm: 3,585 subjects (3,334 with true RA) 3,046 subjects (2,680 with true RA)
Liao et (2010) Arth. Care Research Characteristicsi2b2 RACORRONA total number 3,5857,971 Mean age (SD) 57.5 (17.5)58.9 (13.4) Female (%) Anti-CCP(%) 63N/A RF (%) Erosions (%) MTX (%) Anti-TNF (%) Clinical features of patients CCP has an OR = 1.5 for predicting erosions
4 million patients 31,171 patients ICD9 RA and/or CCP checked (goal = high sensitivity) 3,585 RA patients Classification algorithm (goal = high PPV) Discarded blood for DNA
Linking the Datamart-Crimson NLP data Codified data
OR similar in EMR cohort 1,500 RA multi-ethnic RA cases and 1,500 matched controls
Genetic risk score also similar
4 million patients 31,171 patients ICD9 RA and/or CCP checked (goal = high sensitivity) 3,585 RA patients Classification algorithm (goal = high PPV) Clinical subsets Discarded blood for DNA
Response to therapy
Non-responder to anti-TNF therapy NLP+codified data, together with statistical modeling, to define treatment response
Responder to anti-TNF therapy NLP+codified data, together with statistical modeling, to define treatment response
Responder to anti-TNF therapy 5-year NIH grant as part of the PharmacoGenomics Research Network (PGRN)
Conclusions
Options for clinical + DNA designClinical data DNASample size cost clinical trial +++ +$$$ registry $$ claims data +n/a+++$ EMR $ Conclusion: NLP + codified data, together with appropriate statistical modeling, can yield accurate clinical data.
Options for clinical + DNA designClinical data DNASample size cost clinical trial +++ +$$$ registry $$ claims data +n/a+++$ EMR $ Conclusion: Genetic studies in our EMR cohort yield effect sizes similar to traditional cohorts.
Options for clinical + DNA designClinical data DNASample size cost clinical trial +++ +$$$ registry $$ claims data +n/a+++$ EMR $ Conclusion: It should be possible to extend this same framework to classify response vs non-response to drugs used to treat RA.