An Unsupervised Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline Bridget T McInnes University of Minnesota Twin Cities Background and Introduction Word Sense Disambiguation is the problem of determining the appropriate sense of a word that has multiple senses. This is a problem for biomedical applications such as medical coding and indexing. We explore the question of whether biomedical knowledge sources, such as the Unified Medical Language System (UMLS) and Medline, can be used to help identify the appropriate sense of a word. To do this, we introduce an unsupervised vector approach to disambiguate words in biomedical text using contextual information from the UMLS and compare our results to Humphrey, et al. (JAMIA, 2006) and SenseClusters (Pedersen, et al. Background and Introduction Word Sense Disambiguation is the problem of determining the appropriate sense of a word that has multiple senses. This is a problem for biomedical applications such as medical coding and indexing. We explore the question of whether biomedical knowledge sources, such as the Unified Medical Language System (UMLS) and Medline, can be used to help identify the appropriate sense of a word. To do this, we introduce an unsupervised vector approach to disambiguate words in biomedical text using contextual information from the UMLS and compare our results to Humphrey, et al. (JAMIA, 2006) and SenseClusters (Pedersen, et al. Algorithm UMLS Extract Context for Possible Concepts Extract Context for Possible Concepts Medline (Training Data) Medline (Training Data) Test Data NLM-WSD Results Conclusion The CUI —> ST definition obtains the highest accuracy when compared to other context definitions Our approach makes for disambiguation distinctions for words that have the same ST, unlike Humphrey et al. Our approach can be used to perform all-words disambiguation, unlike SenseClusters Conclusion The CUI —> ST definition obtains the highest accuracy when compared to other context definitions Our approach makes for disambiguation distinctions for words that have the same ST, unlike Humphrey et al. Our approach can be used to perform all-words disambiguation, unlike SenseClusters C : Mole, unit of measurement It is the amount of substance that contains as many elementary units as there are atoms in kg of carbon-12. C : Melanocytic nevus A benign growth on the skin that contains a cluster of melanocytes and surrounding supportive tissue. C : Mole, unit of measurement It is the amount of substance that contains as many elementary units as there are atoms in kg of carbon-12. C : Melanocytic nevus A benign growth on the skin that contains a cluster of melanocytes and surrounding supportive tissue. Extract Possible Concepts Three vectors C vector: amount 4 substance 4 elementary 8 units 12 atoms 32 carbon-12 3 benign 0 growth 0 skin 0 cluster 0 melanocytes 0 tissue 0 C vector: amount 0 substance 0 elementary 0 units 0 atoms 0 carbon-12 0 benign 10 growth 12 skin 34 cluster 11 melanocytes 5 tissue 6 Target word vector: amount 0 substance 4 elementary 0 units 0 atoms 0 carbon-12 0 benign 0 growth 0 skin 0 cluster 0 melanocytes 0 tissue 0 Three vectors C vector: amount 4 substance 4 elementary 8 units 12 atoms 32 carbon-12 3 benign 0 growth 0 skin 0 cluster 0 melanocytes 0 tissue 0 C vector: amount 0 substance 0 elementary 0 units 0 atoms 0 carbon-12 0 benign 10 growth 12 skin 34 cluster 11 melanocytes 5 tissue 6 Target word vector: amount 0 substance 4 elementary 0 units 0 atoms 0 carbon-12 0 benign 0 growth 0 skin 0 cluster 0 melanocytes 0 tissue 0 EXAMPLE: Disambiguating mole Instance: He calculated three moles of the substance in the first sample and five in the second. Data and Resources National Library of Medicine WSD dataset Conflate Dataset actin - antigens (a_a) angiotensin II – olgomycin (a_o) endogenous – extracellular matrix (e_e) allogenic – arginine – ischemic (a_a_i) X chromosome – peptide – plasmid (x_p_p) diacetate – apamin – meatus – enterocyte (d_a_m_e) CuiTools Software Package version 0.13 Data and Resources National Library of Medicine WSD dataset Conflate Dataset actin - antigens (a_a) angiotensin II – olgomycin (a_o) endogenous – extracellular matrix (e_e) allogenic – arginine – ischemic (a_a_i) X chromosome – peptide – plasmid (x_p_p) diacetate – apamin – meatus – enterocyte (d_a_m_e) CuiTools Software Package version 0.13 Calculate Cosine Concept of Target Word Vectors of Possible Concepts Vector of Target Word Possible Concepts and their context Create Vectors Conflate Results Training Data... was around 1 mole mole dose of angiotensin large mole with brown... Training Data... was around 1 mole mole dose of angiotensin large mole with brown... Test Data He calculated three mole of the substance in the first sample and five in the second. Test Data He calculated three mole of the substance in the first sample and five in the second. Calculate the Cosine Θ² Θ¹ Assign Sense He calculated three mole of the substance in the first sample and five in the second. Assign Sense He calculated three mole of the substance in the first sample and five in the second. Create Vectors Context Context of Possible Concepts: Definition of possible concepts Concept Unique Identifier (CUI) Definition of possible concepts Semantic Types (ST) Definition of possible concepts CUI unless one does not exist then use the definition of its ST (CUI->ST) Definition of possible concepts CUI and ST (CUI+ST) Context Context of Possible Concepts: Definition of possible concepts Concept Unique Identifier (CUI) Definition of possible concepts Semantic Types (ST) Definition of possible concepts CUI unless one does not exist then use the definition of its ST (CUI->ST) Definition of possible concepts CUI and ST (CUI+ST) Acknowledgements Ted Pedersen, University of Minnesota Duluth John Carlis, University of Minnesota Twin Cities Acknowledgements Ted Pedersen, University of Minnesota Duluth John Carlis, University of Minnesota Twin Cities