Download presentation
Presentation is loading. Please wait.
1
Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Ontology and Biomedical Informatics Rome, Italy – May 1, 2005
2
2 Introduction u Biomedical ontologies l Precisely defined (e.g., formal ontology) l Limited size l Built manually u Large amounts of knowledge l Not represented explicitly by symbolic relations l But expressed implicitly n By lexico-syntactic relations (i.e., embedded in terms) n By statistical relations (e.g., co-occurrence) l Can be extracted automatically
3
3 Formal vs. casual ontology u Formal ontology l Provides a framework for building sound ontologies l Too labor-intensive for building large ontologies u Casual ontology l Usually unsuitable for reasoning l Tools for automatic acquisition available
4
4 General framework u Ontology learning l l [Maedche & Staab, Velardi] l l ECAI, IJCAI u Term variation[Jacquemin] u Terminology / KnowledgeTKE, TIA u Knowledge acquisition/captureK-CAP u Information extraction
5
5 Sources of knowledge for casual ontology u Long tradition of terminology building l Over 100 terminologies available in electronic format u Large corpora available (e.g., MEDLINE) l Entity recognition tools available n E.g., MetaMap (UMLS-based) n Several for gene/protein names l Information extraction methods u Large annotation databases available l MEDLINE citations indexed with MeSH l Model organism databases annotated with GO
6
6 Formal methods for casual ontology u Lexico-syntactic methods l Lexico-syntactic patterns l Nominal modification l Prepositional phrases l Reified relations l Semantic interpretation u Statistical methods l Clustering l Statistical analysis of co-occurrence data l Association rule mining
7
Lexico-syntactic methods
8
8 Synonymy u Source: terminology u Lexical similarity l Lexical variant generation program (UMLS) l norm u Limitations l Clinical synonymy vs. Synonymy l Molecular biology [McCray & al., SCAMC, 1994]
9
9 Normalization Hodgkin’s diseases, NOS Hodgkin diseases, NOS Remove genitive Hodgkin diseases, Remove stop words hodgkin diseases, Lowercase hodgkin diseases Strip punctuation hodgkin disease UninflectSort words disease hodgkin
10
10 Normalization Example Hodgkin Disease HODGKINS DISEASE Hodgkin's Disease Disease, Hodgkin's Hodgkin's, disease HODGKIN'S DISEASE Hodgkin's disease Hodgkins Disease Hodgkin's disease NOS Hodgkin's disease, NOS Disease, Hodgkins Diseases, Hodgkins Hodgkins Diseases Hodgkins disease hodgkin's disease Disease, Hodgkin normalize disease hodgkin
11
11 Taxonomic relations Lexico-syntactic patterns u Source: text corpus u Example of patterns l Lamivudin is a nucleoside analogue with potent antiviral properties. l The treatment of schizophrenia with old typical antipsychotic drugs such as haloperidol can be problematic. [Hearst, COLING, 1992] [Fiszman & al., AMIA, 2003]
12
12 Taxonomic relations Nominal modification u Source: text corpus / terminology u Example of modifiers l Adjective n Tuberculous Addison’s disease n Acute hepatitis l Noun (noun-noun compounds) n Prostate cancer n Carbon monoxide poisoning [Jacquemin, ACL, 1999] [Bodenreider & al., TIA, 2001] Terminology: constrained environment (increased reliability)
13
13 Reified relations u Source: terminology u Example: reification of part of u Augmented relations from reified part-of relations l Reified: l Reified: l Augmented: l Augmented: [Zhang & al., ISWC/Sem. Int., 2003]
14
14 Prepositional attachment u Source: text corpus / terminology u Example: of l Lobe of lung part of Lung l Bone of femur part of Femur u Restrictions l Validity of preposition-to-relation correspondence may be limited to a subdomain (e.g., anatomy) l Not applicable to complex terms n Groove for arch of aorta NOT part of Aorta [Zhang & al., ISWC/Sem. Int., 2003]
15
15 Semantic interpretation u Source: text corpus / terminology u Correspondence between l Linguistic phenomena l Semantic relations u Semantic constraints provided by ontologies [Navigli & al., TKE, 2002] [Romacker, AIME, 2001] [Rindflesch & al., JBI, 2003]
16
16 Semantic interpretation Hemofiltration in digoxin overdose Disease or Syndrome Therapeutic or Preventive Procedure Digoxin overdoseHemofiltration Mapping to UMLS Syntactic analysis indigoxin overdose noun phrase Hemofiltration preposition treats Semantic rules Semantic network relationships AntibiotictreatsDisease or Syndrome Medical DevicetreatsInjury or Poisoning Pharmacologic SubstancetreatsCongenital Abnormality Ther. or Prev. ProceduretreatsDisease or Syndrome […] Select matching rule treats Semantic interpretation
17
17 Compositional features of terms u Lexical items u Terms within a vocabulary l Clinical vocabularies l Gene Ontology u Terms across vocabularies l SNOMED / LOINC l GO / ChEBI u Lexicon / Terms l Semantic lexicon [Baud & al., AMIA, 1998] [Ogren & al., PSB, 2004] [Mungall, CFG, 2004] [Johnson, JAMIA, 1999] [Verspoor, CFG, 2005] [Burgun, SMBM, 2005] [Dolin, JAMIA, 1998] [McDonald & al., AMIA, 1999]
18
Statistical methods
19
19 Taxonomic relations Clustering u Source: text corpus u Principle: similarity between words reflected in their contexts l Co-occurring words (+ frequencies) l Hierarchical clustering algorithms n Similarity measure (cosine, Kullback Leibler) u Can be refined using classification techniques (e.g., k nearest neighbors) [Faure & al., LREC, 1998] [Maedche & al., HoO, 2004]
20
20 Associative relations u Source: text corpus / annotation databases u Principle: dependence relations l Associations between terms u Several methods l Vector space model l Co-occuring terms l Association rule mining u Limitations: no semantics [Bodenreider & al., PSB, 2005]
21
21 Similarity in the vector space model Annotation database GO terms Genes g1g1 g2g2 …gngn t1t1 t2t2 … tntn GO terms Genes g1g1 g2g2 … gngn t1t1 t2t2 …tntn
22
22 Similarity in the vector space model GO terms t1t1 t2t2 … tntn t1t1 t2t2 …tntn Similarity matrix Sim(t i,t j ) = t i · t j GO terms Genes g1g1 g2g2 …gngn t1t1 t2t2 … tntn tjtj titi
23
23 Analysis of co-occurring GO terms Annotation database GO terms Genes g1g1 g2g2 … gngn t1t1 t2t2 …tntn g2g2 t2t2 t7t7 t9t9 … t3t3 t7t7 t9t9 g5g5 t5t5 t 2 -t 7 1 t 2 -t 9 1 t 7 -t 9 2 … t5t5t5t51 t7t7t7t72 t9t9t9t92 …
24
24 Analysis of co-occurring GO terms u Statistical analysis: test independence l Likelihood ratio test (G 2 ) l Chi-square test (Pearson’s χ 2 ) u Example from GOA (22,720 annotations) l C0006955 [BP]Freq. = 588 l C0008009 [MF]Freq. = 53 presentabsentTotalpresent46542588 absent721,58322,132 total5322,12522,720 GO:0008009 immune response GO:0006955chemokineactivity Co-oc. = 46 G 2 = 298.7 p < 0.000
25
25 Association rule mining Annotation database GO terms Genes g1g1 g2g2 … gngn t1t1 t2t2 …tntn g2g2 t2t2 t7t7 t9t9 transaction apriori Rules: t 1 => t 2 Confidence: >.9 Support:.05
26
26 Example of associations (GO) u Vector space model l l MF:ice binding l l BP:response to freezing u u Co-occurring terms l l MF:chromatin binding l l CC:nuclear chromatin u u Association rule mining l l MF:carboxypeptidase A activity l l BP:peptolysis and peptidolysis
27
Discussion and Conclusions
28
28 Combine methods u Affordable relations l Computer-intensive, not labor-intensive u Methods must be combined l Cross-validation l Redundancy as a surrogate for reliability l Relations identified specifically by one approach n False positives n Specific strength of a particular method u Requires (some) manual curation l Biologists must be involved
29
29 Limited overlap among approaches u Lexical vs. non-lexical u Among non-lexical 76655493 230 VSM COC ARM 3689 2587 453 305309 121 201 [Bodenreider & al., PSB, 2005]
30
30 Reusing thesauri u First approximation for taxonomic relations l No need for creating taxonomies from scratch in biomedicine u Beware of purpose-dependent relations l Addison’s disease isa Autoimmune disorder u Relations used to create hierarchies vs. hierarchical relations u Requires (some) manual curation [Wroe & al., PSB, 2003] [Hahn & al., PSB, 2003]
31
31 Formal vs. Casual u Formal ontology l Provides a framework for building sound ontologies l Too labor-intensive for building large ontologies u Casual ontology l Usually unsuitable for reasoning l Tools for automatic acquisition available What is not useful Formal ontology = righteousFormal ontology = righteous Casual ontology = sloppy Casual ontology = sloppy
32
32 Formal and Casual u Formal ontology l Provides a framework which can be used as a reference l Help us think clearly (?) about n Concepts n Relations (e.g., isa: is a kind of / is an instance of) u Casual ontology l Supported by “cheap” (but formal) methods l Extracted from large amounts of data l Helps populating the framework from formal ontology
33
33 Combining formal and casual u Formal ontology l Provides a framework for building sound ontologies l Too labor-intensive for building large ontologies l Can benefit from loosely defined ontologies u Casual ontology l Usually unsuitable for reasoning l Tools for automatic acquisition available l Can benefit from formal ontology n Organization n Validation
34
34 Casual ontology as a bridge u Casual ontology l Speaks the language of biologists n Extracted from text or terminologies l Passes (part of) the rigorous framework of formal ontology on to biologists u Casual ontologist l Not a sloppy ontologist n Uses the formal methods of casual ontology l Mediator between formal ontology and biology
35
Medical Ontology Research Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Contact:Web:olivier@nlm.nih.govmor.nlm.nih.gov
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.