Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enhancing Ontologies Through Annotations Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland.

Similar presentations


Presentation on theme: "Enhancing Ontologies Through Annotations Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland."— Presentation transcript:

1 Enhancing Ontologies Through Annotations Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA May 22, 2006 Schloß Dagstuhl Biomedical Ontology in

2 2 Outline u u Dependence relations in MeSH and co-occurrence in MEDLINE u u Identifying associative relations in the Gene Ontology u u Linking the Gene Ontology to other biological ontologies: GO-ChEBI

3 Using Dependence Relations in MeSH as a Framework for the Analysis of Disease Information in Medline

4 4 Acknowledgments u Lowell Vizenor National Library of Medicine, USA

5 5 Relations among biomedical entities u Symbolic relations l Represented in biomedical terminologies/ontologies l Explicit semantics n Hierarchical (isa, part of) n Associative (location of, causes, …) u Statistical relations l Represented in text n Among lexical items (entity recognition) n Annotations l No explicit semantics

6 6 Example Viral meningitis Viral meningitis CNS disease isa Infectious disease isa Meninges located in Virus caused by Anitviral agents treated by Herpesviridae Infections T-Lymphocytes 18 9

7 7 Statistical relations u Crucial for text mining applications l Entity recognition l Frequency of co-occurrence u No semantics u Frequency of co-occurrence used as an indicator of the salience of the relation

8 8 An example from MEDLINE Hurwitz JL, Korngold R, Doherty PC. Specific and nonspecific T-cell recruitment in viral meningitis: possible implications for autoimmunity. Cell Immunol. 1983 Mar;76(2):397-401. Specific and nonspecific T-cell invasion into cerebrospinal fluid has been investigated in the nonfatal viral meningoencephalitis induced by intracerebral inoculation of mice with vaccinia virus. At the peak of the inflammatory process on Day 7 approximately 5 to 10% of the Lyt 2+ T cells present are apparently specific for vaccinia virus. Concurrently, in mice primed previously with influenza virus, 0.5 to 1.0% of the appropriate T-cell set located in cerebrospinal fluid is reactive to influenza- infected target cells. This vaccinia virus-induced inflammatory exudate may thus contain as many as 500 influenza-immune memory T cells. These findings are discussed from the aspect that such nonspecific T-cell invasion into the central nervous system during aseptic viral meningitis could result in exposure of potentially brain-reactive T cells to central nervous system components. PMID: 6601524 u Brain/immunology u Cytotoxicity, Immunologic u Exudates and Transudates/cytology u Exudates and Transudates /immunology u Meningitis, Viral/immunology* u T-Lymphocytes/immunology* u Vaccinia virus u Animals u Humans u Mice u Research Support, Non-U.S. Gov't u Research Support, U.S. Gov't, P.H.S.

9 9 Ontological analysis u Formal ontological distinction l Dependence relations n n Every instance of a class is related to some instances of another class n n A is ontologically dependent on B if and only if A exists then B exists l Contingent relations n n Only some instances of a class are related to some instances of another class CNS diseaseCNS Viral meningitisVirus aspirinheadache

10 10 Statistical vs. ontological u u Can we use formal ontology to help analyze statistical relations? u u What is the relation between ontological and statistical relations? u u Hypothesis: l l Correspondence between n n Dependence relations(ontological) n n High frequencies of co-occurrence(statistical) l l Dependence relations ↔ Systematically high frequencies of co-occurrence

11 11 More formal-ontological distinctions Oxygen B-lymphocyte Mitral valve Streptococcus ContinuantsOccurrents continue to exist through time unfold through time in successive phases Dependent continuants Independent continuants require the existence of any other entity in order to exist? yesno Oxygen transport Glucose metabolism Echocardiography Oxygen transporter

12 12 Application to diseases u Diseases are (mostly) processes, i.e., occurrents u Diseases are dependent entities u Diseases depend on independent continuants l Anatomical structures n Classification by “location” (body system) l Agents (pathogens) n Classification by etiology

13 13 Participation relation u Participation relations are dependence relations u Between processes and biomedical continuants u Passive participation: has_participant l Viral meningitis has_participant Meninges u Active participation: has_agent l Viral meningitis has_agent Virus u Defined at the instance level but can be adapted at the class level

14 14 Statistical relations u Independent events l l P(E1 ∩ E2) = P(E1). P(E2) u Tests of independence l χ 2 test l G 2 test (likelihood ratio test)

15 15 Objectives u u Analyze dependence relations in MeSH and to compare them to statistical relations obtained from co-occurrence data u u Restricted to the relations between disease categories and other categories of biomedical interest u u Hypothesis: l l Co-occurrence relations between diseases and other categories n n Highest proportion for the dependent category, systematically across diseases n n Smaller proportions for other non-dependent categories

16 Materials

17 17 Medical Subject Headings (MeSH) u Controlled vocabulary used to index MEDLINE u 22,658 descriptors (2004 version) u 16 tree-like hierarchies l Anatomy l Organisms l Diseases l …

18 18 MEDLINE u 385,491 citations (year 2004) u u Indexed with 20,085 distinct MeSH descriptors u u Restrictions l Starred descriptors only (3.5 / citation, on average) l Frequency of co-occurrence ≥ 10 l Associations between diseases and other categories

19 Methods and Results

20

21 21 Identifying dependence relations u Manual examination of the 23 top-level disease categories [C tree] l Exceptions n Pathological conditions, signs and symptoms (C23) l Additions n Mental disorders (F03) u Identify the categories in (active or passive) participation relation with the process

22 22 Identifying dependence relations Results has_participant

23 23 Identifying dependence relations Results has_agent

24 24 Identifying statistical relations u Aggregation at the level of top-level categories Viral meningitis (C02, C10) Meninges (A08) (C02, A08) and (C10, A08)

25 25 Identifying statistical relations u Contingency table u Testing independence l G 2 test (likelihood ratio test) YesNo Yes n AB n Ab No n aB n ab Indexed with term B Indexed with term A

26 26 Identifying statistical relations Results u Quantitative results l 25,376 pairs of co-occurring descriptors l All but 68 of these statistically significant (G 2 test) l 7,896 pairs with frequency of co-occurrence ≥ 10 l 6,525 between diseases and other categories

27 27 Identifying statistical relations Results 23 disease top-level categories 88 other top-level categories 45.75

28 28 Identifying statistical relations Results u Qualitative results (1) l Generally one top-level category of the Anatomy and Organisms trees accounting for the highest frequency of co-occurrence for a given disease n n Cardiovascular Diseases  Cardiovascular System l Exceptions n n Neoplasms [C04] n n Congenital, Hereditary, and Neonatal Diseases and Abnormalities [C16] n n Endocrine Diseases [C19] n n Immunologic Diseases [C20]

29 29 Identifying statistical relations Results u Qualitative results (2) l l Most Anatomy and Organisms categories are preferentially associated with one disease category n n Cardiovascular System  Cardiovascular Diseases l Categories other than tend not to be associated with l Categories other than Anatomy and Organisms tend not to be associated with one particular disease category (contingent rather than dependent relations) n n Pathological Conditions, Signs and Symptoms [C23] n n Amino Acids, Peptides, and Proteins [D12] n n Diagnosis [E01] n n Therapeutics [E02] n n Surgical Procedures, Operative [E04]

30 Discussion

31 31 Applications u To semantic mining l Formal ontological analysis of relations provides a useful framework for elucidating statistical associations u To terminology creation and maintenance l Most terminologies do not represent trans-ontological relations explicitly l Concepts in dependence relation should not be modified independently of each other

32 32 Summary u We have studied statistical associations between MeSH terms co-occurring in MEDLINE citations u We have shown that the ontological relation of dependence is generally corroborated by a strong, systematic statistical association u These techniques l Provide a framework for semantic mining of diseases l Can help maintain terminologies

33 33 References u Vizenor L, Bodenreider O. Using dependence relations in MeSH as a framework for the analysis of disease information in Medline. Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM-2006) 2006:76-83. http://mor.nlm.nih.gov/pubs/pdf/2006-smbm-lv.pdf http://mor.nlm.nih.gov/pubs/pdf/2006-smbm-lv.pdf

34 Non-lexical Approaches to Identifying Associative Relations in the Gene Ontology

35 35 Acknowledgments u Marc Aubry UMR 6061 CNRS, Rennes, France u Anita Burgun University of Rennes, France

36 36 Gene Ontology u Annotate gene products u Coverage l Molecular functions l Cellular components l Biological processes u Explicit relations to other terms within the same hierarchy u No (explicit) relations l To terms across hierarchies l To concepts from other biological ontologies

37 37 Gene Ontology Molecular functions Cellular components Biological processes BP:metal ion transport MF:metal ion transporter activity

38 38 Motivation u Richer ontology l Beyond hierarchies u Easier to maintain l Explicit dependence relations u More consistent annotations l Quality assurance l Assisted curation

39 39 Related work u Ontologizing GO l GONG u Identifying relations among GO terms across hierarchies l Lexical approach l Non-lexical approaches u Identifying relations between GO terms and OBO terms l ChEBI u Representing relations among GO terms and between GO terms and OBO terms l Obol u See also: [Ogren & al., PSB 2004-2005] [Bodenreider & al., PSB 2005] [Wroe & al., PSB 2003] [Mungall, CFG 2005] [Burgun & al., SMBM 2005] [Bada & al., 2004], [Kumar & al., 2004], [Dolan & al., 2005]

40 40 GO and annotation databases u 5 model organisms l FlyBase l GOA-Human l MGI l SGD l WormBase Brca1GO:0000793IDA Brca1GO:0003677IEA Brca1GO:0003684IDA Brca1GO:0003723ISS Brca1GO:0004553ISS Brca1GO:0005515 ISS, TAS Brca1GO:0005622IEA Brca1GO:0005634 IDA, ISS Brca1… 270,000 gene-term associations

41 41 Three non-lexical approaches All based on annotation databases Similarity in the vector space model  Similarity in the vector space model Statistical analysis of co-occurring GO terms  Statistical analysis of co-occurring GO terms Association rule mining  Association rule mining

42 42 Similarity in the vector space model Annotation database GO terms Genes g1g1 g2g2 …gngn t1t1 t2t2 … tntn GO terms Genes g1g1 g2g2 … gngn t1t1 t2t2 …tntn 

43 43 Similarity in the vector space model GO terms t1t1 t2t2 … tntn t1t1 t2t2 …tntn Similarity matrix Sim(t i,t j ) = t i · t j  GO terms Genes g1g1 g2g2 …gngn t1t1 t2t2 … tntn tjtj titi 

44 44 Analysis of co-occurring GO terms  Annotation database GO terms Genes g1g1 g2g2 … gngn t1t1 t2t2 …tntn g2g2 t2t2 t7t7 t9t9 … t3t3 t7t7 t9t9 g5g5 t5t5 t 2 -t 7 1 t 2 -t 9 1 t 7 -t 9 2 … t5t5t5t51 t7t7t7t72 t9t9t9t92 …

45 45 Analysis of co-occurring GO terms u Statistical analysis: test independence l Likelihood ratio test (G 2 ) l Chi-square test (Pearson’s χ 2 ) u Example from GOA (22,720 annotations) l C0006955 [BP]Freq. = 588 l C0008009 [MF]Freq. = 53 presentabsentTotalpresent46542588 absent721,58322,132 total5322,12522,720 GO:0008009 immune response GO:0006955chemokineactivity Co-oc. = 46 G 2 = 298.7 p < 0.000

46 46 Association rule mining  Annotation database GO terms Genes g1g1 g2g2 … gngn t1t1 t2t2 …tntn g2g2 t2t2 t7t7 t9t9 transaction apriori Rules: t 1 => t 2 Confidence: >.9 Support:.05

47 47 Examples of associations u Vector space model l l MF:ice binding l l BP:response to freezing u u Co-occurring terms l l MF:chromatin binding l l CC:nuclear chromatin u u Association rule mining l l MF:carboxypeptidase A activity l l BP:peptolysis and peptidolysis

48 48 Associations identified VSMCOCARMLEX MF-CC499893362917 MF-BP305716285772523 CC-BP76010473292053 Total4316356812685493 7665 by at least one approach

49 49 Associations identified VSM VSMCOCARMLEX MF-CC499893362917 MF-BP305716285772523 CC-BP76010473292053 Total4316356812685493 MF:ice binding BP:response to freezing

50 50 Associations identified COC VSMCOCARMLEX MF-CC499893362917 MF-BP305716285772523 CC-BP76010473292053 Total4316356812685493 MF:chromatin binding CC:nuclear chromatin

51 51 Associations identified ARM VSMCOCARMLEX MF-CC499893362917 MF-BP305716285772523 CC-BP76010473292053 Total4316356812685493 MF:carboxypeptidase A activity BP:peptolysis and peptidolysis

52 52 Limited overlap among approaches u Lexical vs. non-lexical u Among non-lexical 76655493 230 VSM COC ARM 3689 2587 453 305309 121 201

53 53 References u Bodenreider O, Aubry M, Burgun A. Non-lexical approaches to identifying associative relations in the Gene Ontology. In: Altman RB, Dunker AK, Hunter L, Jung TA, Klein TE, editors. Pacific Symposium on Biocomputing 2005: World Scientific; 2005. p. 91-102. http://mor.nlm.nih.gov/pubs/pdf/2005-psb-ob.pdf http://mor.nlm.nih.gov/pubs/pdf/2005-psb-ob.pdf

54 Linking the Gene Ontology to other biological ontologies

55 55 Acknowledgments u Anita Burgun University of Rennes, France

56 56 Related domains u Organisms u Cell types u Physical entities l Gross anatomy l Molecules u Functions l Organism functions l Cell functions u Pathology cytosolic ribosome (sensu Eukaryota) T-cell activation brain development transferrin receptor activity visual perception T-cell activation regulation of blood pressure

57 57 GO and other domains [adapted from B. Smith] Physical entity FunctionProcess Organism Gross anatomy Organism functions Organism processes Cell Cellular components Cellular functions Cellular processes MoleculeMolecules Molecular functions Molecular processes

58 58 GO and other domains (revisited) [adapted from B. Smith] Resolution Physical whole Physical part FunctionProcess Organism Organism components Organism functions Organism processes Cell Cellular components Cellular functions Cellular processes Molecule Molecular components Molecular functions Molecular processes

59 59 Biological ontologies (OBO)

60 60 GO and other domains (revisited) [adapted from B. Smith] Resolution Physical whole Physical part FunctionProcess Organism Organism components Organism functions Organism processes Cell Cellular components Cellular functions Cellular processes Molecule Molecular components Molecular functions Molecular processes

61 61 Integrating biological ontologies Cellular components Molecular functions Molecule Biologocal processes Organism Other OBO ontologies Other OBO ontologies Cell

62 Linking GO to ChEBI

63 63 ChEBI u Member of the OBO family u Ontology of Chemical Entities of Biological Interest l Atom l Molecule l Ion l Radical u 10,516 entities l 27,097 terms [Dec. 22, 2004]

64 64 Methods u Every ChEBI term searched in every GO term u Maximize precision l Ignored ChEBI terms of 3 characters or less l Proper substring u Maximize recall l Case insensitive matches l Normalized ChEBI names (generated singular forms from plurals)

65 65 Examples u u iron [CHEBI:18248] u u uronic acid [CHEBI:27252] u u carbon [CHEBI:27594] BPiron ion transport [GO:0006826] MFiron superoxide dismutase activity [GO:0008382] CCvanadium-iron nitrogenase complex [GO:0016613] BPresponse to carbon dioxide [GO:0010037] MFcarbon-carbon lyase activity [GO:0016830] BPuronic acid metabolism [GO:0006063] MFuronic acid transporter activity [GO:0015133]

66 66 Quantitative results u 2,700 ChEBI entities (27%) identified in some GO term u 9,431 GO terms (55%) include some ChEBI entity in their names 10,516 entities17,250 terms 20,497 associations

67 67 Generalization MFCCBP CHEBI FMA Cell types REXFIX Mouse Pathology cell membrane viscosity enzymatic reaction Leydic cell tumor iron deposition cerebellar aplasia

68 Conclusions

69 69 Conclusions (1) u Links across OBO ontologies need to be made explicit l Between GO terms across GO hierarchies l Between GO terms and OBO terms l Between terms across OBO ontologies u Automatic approaches l Effective (GO-GO, GO-ChEBI) l At least to bootstrap the process l Needs to be refined

70 70 Conclusions (2) u Affordable relations l Computer-intensive, not labor-intensive u Methods must be combined l Cross-validation l Redundancy as a surrogate for reliability l Relations identified specifically by one approach n False positives n Specific strength of a particular method u Requires (some) manual curation l Biologists must be involved

71 71 References u Burgun A, Bodenreider O. An ontology of chemical entities helps identify dependence relations among Gene Ontology terms. Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM-2005) Electronic proceedings: CEUR-WS/Vol-148 http://mor.nlm.nih.gov/pubs/pdf/2005-smbm-ab.pdf http://mor.nlm.nih.gov/pubs/pdf/2005-smbm-ab.pdf

72 Medical Ontology Research Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Contact:Web:olivier@nlm.nih.govmor.nlm.nih.gov


Download ppt "Enhancing Ontologies Through Annotations Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland."

Similar presentations


Ads by Google