Download presentation
Published byShauna Warren Modified over 9 years ago
1
Taming EHR Data Using Semantic Similarity to Reduce Dimensionality
Jim Weatherall, PhD Head, Advanced Analytics Centre, AstraZeneca Visiting Lecturer, School of Computer Science, University of Manchester 14th World Congress on Medical & Health Informatics, August 2013, Copenhagen On behalf of the authors: Leila Kalankesh, School of Computer Science, UoM James Weatherall, AstraZeneca Thamer Ba-Dhfari, School of Computer Science, UoM Iain Buchan, Institute of Population Health, UoM Andy Brass, School of Computer Science, UoM Brief: Talk for 14 mins, then 2 mins questions, then 2 mins switchover
2
Problems with mining healthcare data
Introduction Problems with mining healthcare data Large collections not easily visualised or interpreted Read Code Rubric C10F. Type II Diabetes Mellitus, 1372. Trivial smoker < 1 cig/day bd3j. Prescription of “Atenolol 25mg tablets” G20. Essential hypertension 2469. Measurement of Diastolic Blood Pressure 246A. Assessment of Diastolic Blood Pressure Research not primary purpose for collection 10s of 1000s of dimensions 100s of 1000s of codes J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
3
The Salford Integrated Record (SIR)
Data The Salford Integrated Record (SIR) Population ~220,000 Integrated primary and secondary care information Individual Read Code entries captured in primary care information systems Codes for diagnosis Codes for procedures All clinical transactions in primary care and some in secondary care Data extract for this analysis based on: GP data in date range Containing 136M Read code entries Selected 24K patients with chronic conditions Containing 443K Read code entries Type 1 DM Type 2 DM MI Angina Stroke / CVA TIA CKD Liver disease J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
4
Measure ontological distance?
Methods Semantic Similarity ? How alike are the meanings of two terms? Measure depth? Or not? Measure ontological distance? Also: Relative depth Multiple inheritance Lateral connections These are edge-based, what about node-based All this is ontology-based, what about corpus based? J.Weatherall | August 2013 From Sanchez, J.Biomed.Inform, 2011 Biometrics & Information Sciences | GMD
5
Semantic Similarity Method
Methods Semantic Similarity – which method? An ontology of methods! Semantic Similarity Method Ontological Node-based Edge-based Hybrid Corpus-based Frequency Context Proximity Combined Corpus-based is realistic and grounded in data, but relies on having a large enough corpus, and can also be biased & computationally expensive Ontology-based is often computationally lighter, and rooted in the knowledge domain of interest, but can lack realism due to lack of use of real data J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
6
Semantic similarity calculation
The Resnik measure Term probability, based on frequency, including descendants and annotations 1 2 Log transformation, gives “Information Content” 3 IC of “Most Informative Common Ancestor” gives similarity measure N = total number of all codes in data set c ϵ codes(c) = count of all instances of code, as well as annotations and descendents P. Resnik, “Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language”, J Artif Intell Res, 1999 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
7
Stepwise approach to dimensionality reduction
Analysis Plan Stepwise approach to dimensionality reduction Map patient records from diagnosis space into a similarity space 1 Map patient records into a low-dimensional vector space via PCA 2 Project patient records onto low-dimensional vector space and cluster patients by similarity 3 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
8
“The Similarity Matrix”
Analysis – Step 1 Mapping from diagnosis space to similarity space p1 p2 … pn sim(p1,p1) sim(p1,p2) sim(p1,pn) sim(p2,p1) sim(p2,p2) sim(p2,pn) sim(pn,p1) sim(pn,p2) sim(pn,pn) “The Similarity Matrix” pi = patient i sim(pi,pj) = similarity score between patients i and j J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
9
Analysis – Steps 2 + 3 PCA on the similarity matrix, visualisation & clustering Natural co-morbidity: Diabetes is a risk factor for angina due to its accelerating effect on atherosclerosis J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
10
Discussion & Conclusion
Review & Outlook Patients with similar diagnosis codes are grouped together Therefore, the semantic similarity technique works, to some degree Therefore, this is a viable route to dimensionality reduction in complex healthcare data sets Exploring co-morbidity and co-treatment effects? New biomedical hypotheses? Transferability of method? Population level characterisation? New data mining paradigms? J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
11
Thank You!
12
Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0) , F: +44 (0) , J.Weatherall | August 2013 Biometrics & Information Sciences | GMD
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.