Download presentation
Presentation is loading. Please wait.
1
K NOWLEDGE - BASED M ETHOD FOR D ETERMINING THE M EANING OF A MBIGUOUS B IOMEDICAL T ERMS U SING I NFORMATION C ONTENT M EASURES OF S IMILARITY Bridget McInnes Ted Pedersen Ying Liu Genevieve B. Melton Serguei Pakhomov 1
2
O BJECTIVE OF THIS WORK Develop and evaluate a method than can disambiguate terms in biomedical text by exploiting similarity information extrapolated from the Unified Medical Language System Evaluate the efficacy of Information Content-based similarity measures over path-based similarity measures for Word Sense Disambiguation, WSD 2
3
W ORD S ENSE D ISAMBIGUATION Word sense disambiguation is the task of determining the appropriate sense of a term given context in which it is used. TERM: tolerance Drug Tolerance Immune Tolerance 3
4
W ORD S ENSE D ISAMBIGUATION Word sense disambiguation is the task of determining the appropriate sense of a term given context in which it is used. Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance Immune Tolerance 4
5
S ENSE INVENTORY : U NIFIED M EDICAL L ANGUAGE S YSTEM Unified Medical Language Sources (UMLS) Semantic Network Metathesaurus ~1.7 million biomedical and clinical concepts; integrated semi-automatically CUIs (Concept Unique Identifiers), linked: Hierarchical: PAR/CHD and RB/RN Non-hierarchical: SIB, RO Sources viewed together or independently Medical Subject Heading (MSH) SPECIALIST Lexicon Biomedical and clinical terms, including variants 5
6
W ORD S ENSE D ISAMBIGUATION Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance: C0013220 Immune Tolerance: C0020963 Concept Unique Identifiers: CUIs 6
7
S ENSE R ELATE ALGORITHM Each possible sense of a target word is assigned a score [sum similarity between it and its surrounding terms] Assign target word the sense with highest score Proposed by Patwardhan and Pedersen 2003 using WordNet UMLS::SenseRelate is a modification of this algorithm using information from the UMLS NEXT UP: an example 7
8
S ENSE R ELATE E XAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer 8
9
S ENSE R ELATE E XAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance: C0013220 Immune Tolerance: C0020963 9
10
S ENSE R ELATE E XAMPLE Busprione attenuates tolerance to morphine in mice with skin cancer Drug Tolerance: C0013220 Immune Tolerance: C0020963 Busprione: C0006462 Morphine: C0026549 Mice: C0026809 Skin cancer: C0007114 10
11
S ENSE R ELATE E XAMPLE 0.09 0.16 0.11 Busprione attenuates tolerance to morphine in mice with skin cancer 0.09 Drug Tolerance: C0013220 Immune Tolerance: C0020963 Busprione: C0006462 Morphine: C0026549 Mice: C0026809 Skin cancer: C0007114 11
12
S ENSE R ELATE E XAMPLE 0.09 0.16 0.11 Busprione attenuates tolerance to morphine in mice with skin cancer 0.09 Drug Tolerance: C0013220 Immune Tolerance: C0020963 Busprione: C0006462 Morphine: C0026549 Mice: C0026809 Skin cancer: C0007114 Drug Tolerance Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45 12
13
S ENSE R ELATE E XAMPLE 0.09 0.16 0.11 0.09 0.05 0.04 Busprione attenuates tolerance to morphine in mice with skin cancer 0.09 Drug Tolerance: C0013220 Immune Tolerance: C0020963 Busprione: C0006462 Morphine: C0026549 Mice: C0026809 Skin cancer: C0007114 Drug Tolerance Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45 13
14
S ENSE R ELATE E XAMPLE 0.09 0.16 0.11 0.09 0.05 0.04 Busprione attenuates tolerance to morphine in mice with skin cancer 0.09 Drug Tolerance: C0013220 Immune Tolerance: C0020963 Busprione: C0006462 Morphine: C0026549 Mice: C0026809 Skin cancer: C0007114 Drug Tolerance Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45 Immune Tolerance Score = 0.09 + 0.09 + 0.05 + 0.05 = 0.27 14
15
S ENSE R ELATE E XAMPLE 0.09 0.16 0.11 0.09 0.05 0.04 Busprione attenuates tolerance to morphine in mice with skin cancer 0.09 Drug Tolerance: C0013220 Immune Tolerance: C0020963 Busprione: C0006462 Morphine: C0026549 Mice: C0026809 Skin cancer: C0007114 Drug Tolerance Score = 0.09 + 0.09 + 0.16 + 0.11 = 0.45 Immune Tolerance Score = 0.09 + 0.09 + 0.05 + 0.05 = 0.27 15
16
S ENSE R ELATE A SSUMPTION An ambiguous word is often used in the sense that is most similar to the sense of the terms that surround it 16
17
S ENSE R ELATE C OMPONENTS Identifying the concepts of surrounding terms Calculating semantic similarity 17
18
I DENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS 18
19
I DENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS Busprione attenuates tolerance to morphine in mice with skin cancer 19
20
I DENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS... skin cancer skin grafting skin disease... SPECIALIST LEXICON Busprione attenuates tolerance to morphine in mice with skin cancer 20
21
I DENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in the UMLS... skin cancer skin grafting skin disease... skin cancer C0007114 skin grafting C0037297 skin disease C0037274... SPECIALIST LEXICON MRCONSO Busprione attenuates tolerance to morphine in mice with skin cancer 21
22
S EMANTIC S IMILARITY M EASURES Path-based measures Path Wu and Palmer Leacock and Chodorow Ngyuen and Al-Mubaid Information content (IC)-based measures Resnik Lin Jiang and Conrath 22
23
P ATH - BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy 23
24
P ATH - BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy Path measure sim(c1,c2) = 1 / minpath(c2,c2) where minpath is the shortest path between the two concepts 24
25
P ATH - BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy Path measure sim(c1,c2) = 1/minpath(c2,c2) where minpath is the shortest path between the two concepts Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2)) where LCS is the least common subsumer of the two concepts 25
26
P ATH - BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy Path measure sim(c1,c2) = 1/ minpath(c2,c2) where minpath is the shortest path between the two concepts Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2)) where LCS is the least common subsumer of the two concepts Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) ) where D is the total depth of the taxonomy 26
27
P ATH - BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy Path measure sim(c1,c2) = 1/ minpath(c2,c2) where minpath is the shortest path between the two concepts Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) ) where D is the total depth of the taxonomy Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2)) where LCS is the least common subsumer of the two concepts Nyguen and Al-Mubaid, 2006 sim(c1,c2) = log ( (2 + minpath(c1,c2) - 1) * (D - depth(LCS(c1,c2))) ) 27
28
P ATH - BASED S IMILARITY M EASURES USE ONLY THE PATH INFORMATION OBTAINED FROM A TAXONOMY Disease: C0012634 Drug Related Disorder: C0277579 Drug Tolerance: C0013220 Neoplasm: C1302761 Neoplastic Disease: C1882062 Malignant Neoplasm: C0006826 Skin cancer: C0007114 28
29
I NFORMATION CONTENT - BASED M EASURES Incorporate the probability of the concepts IC = -log(P(concept)) 29
30
I NFORMATION CONTENT - BASED M EASURES Incorporate the probability of the concepts IC = -log(P(concept)) P(concept) Calculated by summing the probability of the concept and the probability of its descendants Probabilities are obtained from an external corpus 30
31
I NFORMATION CONTENT - BASED M EASURES Incorporate the probability of the concepts IC = -log(P(concept) Resnik, 1995 sim(c1,c2) = IC(LCS(c1,c2)) 31
32
I NFORMATION CONTENT - BASED M EASURES Incorporate the probability of the concepts IC = -log(P(concept) Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2)) Jiang and Conrath, 1997 sim(c1,c2) = 1 / (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2)) 32
33
I NFORMATION CONTENT - BASED M EASURES Incorporate the probability of the concepts IC = -log(P(concept) Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2)) Jiang and Conrath, 1997 sim(c1,c2) = 1 ÷ (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2)) Lin, 1998 sim(c1,c2) = (2*IC(LCS(c2,c2))) / (IC(c1)+IC(c2)) 33
34
IC- BASED SIMILARITY MEASURES Disease: C0012634 Drug Related Disorder: C0277579 Drug Tolerance: C0013220 Neoplasm: C1302761 Neoplastic Disease: C1882062 Malignant Neoplasm: C0006826 Skin cancer: C0007114 PATH INFORMATION PROBABILITY OF CONCEPTS EXTERNAL CORPUS 34
35
E XPERIMENTAL F RAMEWORK Use open-source UMLS::Similarity package to obtain the similarity between the terms and possible senses in the SenseRelate algorithm Path information: parent/child relations in MSH source Information content: calculated using the UMLSonMedline dataset created by NLM Consists of concepts from 2009AB UMLS and the frequency they occurred in Medline using the Essie Search Engine (Ide et al 2007) Medline: database of citations of biomedical/clinical articles 35
36
E VALUATION D ATA : MSH WSD MSH-WSD dataset (Jimeno-Yepes, et al 2011) 203 target words (ambiguous word) from Medline 106 termse.g. tolerance 88 acronymse.g. CA (calcium, california) 9 mixturese.g. bat (brown adipose tissue) Each target word contains ~187 instances (Medline abstracts) abstract = ~ 500 words Each target word in the instances assigned a concept from MSH by exploiting the manually assigned MSH concepts assigned to the abstract Average of 2.08 possible senses per target word Majority sense over all the target words is 54.5% 36
37
R ESULTS baseline path lchwup nam resjcn accuracyaccuracy Path-basedIC-based lin 37
38
C OMPARISON ACROSS SUBSETS OF MSH - WSD accuracyaccuracy 38
39
C OMPARISON ACROSS SUBSETS OF MSH - WSD accuracyaccuracy 39
40
C OMPARISON ACROSS SUBSETS OF MSH - WSD accuracyaccuracy 40
41
C OMPARISON ACROSS SUBSETS OF MSH - WSD accuracyaccuracy 41
42
C OMPARISON ACROSS SUBSETS OF MSH - WSD accuracyaccuracy 42
43
W INDOW SIZES Use the terms surrounding the target word within a specified window: 1, 2, 5, 10, 25, 50, 60, 70 Busprione attenuates tolerance to morphine in mice with skin_cancer WINDOW SIZE = 2 43
44
C OMPARISON OF WINDOW SIZES FOR LIN accuracyaccuracy window size 44
45
S URROUNDING TERMS Not all terms have a concept in the UMLS therefore Not all surrounding terms in the window mapped to CUIs 45
46
W INDOW SIZES VERSUS MAPPED TERMS numberofmappingsnumberofmappings window size 46
47
F UTURE WORK : MAPPING T ERMS Currently looking at mapping the terms to CUIs using information from the concept mapping system MetaMap Obtain the terms from MetaMap and do a dictionary look up in MRCONSO Hypothesis – the terms obtained by MetaMap are more accurate than using the SPECIALIST Lexicon Obtain the CUIs from MetaMap Hypothesis – the CUIs obtained by MetaMap will be more accurate than the dictionary look-up 47
48
O BJECTIVE #1 Develop and evaluate a method than can disambiguate terms in biomedical text by exploiting similarity information extrapolated from the UMLS UMLS::SenseRelate statistically significantly higher disambiguation accuracy than the baseline On par with previous unsupervised methods for terms 48
49
O BJECTIVE #2 Evaluate the efficacy of IC-based similarity measures over path- based measures on a secondary task There is no statistically significant difference between the accuracies obtained by the IC-based measures There is a statistically significant difference between the IC- based measures and the path-based measures 49
50
T AKE HOME MESSAGE : An ambiguous word is often used in the sense that is most similar to the sense of the concepts of the terms that surround it 50
51
R ESOURCES Software: UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/ UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/ Data MSH-WSD http://wsd.nlm.nih.gov/collaboration.shtml http://wsd.nlm.nih.gov/collaboration.shtml 51
52
R ESOURCES Software: UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/ UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/ Data MSH-WSD http://wsd.nlm.nih.gov/collaboration.shtml http://wsd.nlm.nih.gov/collaboration.shtml THANK YOU 52
53
R ESOURCES Software: UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/ UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/ Data MSH-WSD http://wsd.nlm.nih.gov/collaboration.shtml http://wsd.nlm.nih.gov/collaboration.shtml QUESTIONS? 53
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.