Download presentation
Presentation is loading. Please wait.
Published byKristopher O’Neal’ Modified over 9 years ago
1
Ontologies and Biomedicine What is the "right" amount of semantics?
2
Ontologies and Biomedicine The “right” amount of semantics depends on what you want to do with it
3
Ontologies and Biomedicine Research is based on inference from what is known, and therefore it demands rigor
4
Ontologies and Biomedicine Without rigor, we won’t— know what we know, or where to find it, or what to infer from it.
5
Natural Language Computabl e Ontology Highly expressive Ambiguous Less expressive Logical and precise Semantic Spectrum
6
Ad hoc tagging approach Let the users defined words and phrases Foregoes the use of an expertly curated vocabulary or ontology. Fast and distributed approach yields a vast amount of content No recruitment and training of people to maintain the ontology is required. No recruitment and training of annotators to interpret the material is required.
8
Ad hoc tagging approach Tagging approach places the burden of interpretation and classification on every end user Overall this is more costly and wasteful Is inappropriate in the scientific domain The problem is not about people communicating. It is about computers and HCI.
9
Build, apply, and use Ontology captures current scientific theory that seeks to explain all of the existing evidence and is used to draw inferences and make predictions Acts like a review Requires curators who are experts in both the science and logic Ontology application is the real bottleneck But overall is less costly and wasteful
10
1.Univocity : Terms should have the same meanings on every occasion of use 2.Positivity : Terms such as ‘non-mammal’ or ‘non-membrane’ do not designate genuine classes. 3.Objectivity : Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds. 4.Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level 5.Intelligible Definitions : The terms used in a definition should be simpler (more intelligible) than the term to be defined 6.Reality Based : When building or maintaining an ontology, always think carefully at how classes relate to instances in reality 7.Distinguish Classes and Instances : What is necessarily true for instances is not necessarily true for classes
11
Annotation bottleneck An active lab can easily generate 10- 100GB of data per month, and it is very difficult to manage data on this scale. Even the best analytic schemes will be for naught if we cannot find our data. And the data is complex Yet, the annotation effort required will be utterly wasted if it cannot be reliably computed upon.
13
Implies numerous “light” ontologies 3-dimensions Protein function Cell type Tissue Stage Cellular component Organism And more…
14
Or it implies a single complex one 3-dimensions Protein function Cell type Tissue Stage Cellular anatomy Organism And more… Plus all of the relations between these elements
15
Practicalities 1.The ontology should be robust or the annotator’s time is wasted 2.Research won’t wait, data must be annotated at the rate at which it is generated 3.Complex ontologies are much more difficult to get right than lighter ones 4.Light ontologies are easier to build and maintain 5.Complex ontologies can be built from lighter ones
16
A “successful” case study Gene Ontology
17
The aims of GO 1.To develop comprehensive shared vocabularies of terms describing aspects of molecular biology. 2.To describe the gene products held in each contributing model organism database. 3.To provide a scientific resource for access to the vocabularies, the annotations, and associated data. 4.To provide a software resource to assist in curation of GO term assignments to biological objects.
18
The primary strength of the GO The GO covers three domains of biology Molecular Function Biological Process Cellular Component These are “precisely defined” axes of classification
19
The breakdown of work Task 1 Building the ontology: a computable description of the biological world Task 2 Describing your gene product—annotation Biological process Molecular function Cellular localization
20
The early key decisions The vocabulary itself requires a serious and ongoing effort. Carefully define every concept Initially keep things as simple as possible and only use a minimally sufficient data representation. Focus initially on molecular aspects that are shared between many organisms.
21
GO databases: distributed and centralized Support cross-database queries By having a mutual understanding of the definition and meaning of any word used to describe a gene product Provide database access to a common repository of annotations By submitting a summary of gene products that have been annotated
22
GO CVS FTP Anonymous CVS GO data HTTPD Scripts
23
GO CVS Many Scripts GO Database AmiGO
24
GODatabase.org Hits = 77,012 Visits = 14,063 Sites = 6,638 Averages per week
26
www.geneontology.orgwww.geneontology.org 7,240 www.godatabase.orgwww.godatabase.org 33 obo.sourceforge.netobo.sourceforge.net 10 song.sourceforge.netsong.sourceforge.net 6 genome.ucsc.edugenome.ucsc.edu 3,670 www.ncbi.nih.govwww.ncbi.nih.gov 12,000 www.ebi.ac.ukwww.ebi.ac.uk 14,900 sciencemag.orgsciencemag.org 14,900 www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov 34,500 Number of links to a site: as reported by Google
27
72020GO:0006810transport 56862GO:0005524ATP binding 53622GO:0019012virion 47773GO:0006955immune response 46943GO:0003677DNA binding 41474GO:0006508proteolysis and peptidolysis 41126GO:0006355regulation of transcription, DNA-dependent 40427GO:0004872receptor activity 34943GO:0005215transporter activity 30890GO:0007186G-protein coupled receptor protein signaling pathway 30001GO:0003700transcription factor activity 28127GO:0006118electron transport 26636GO:0005509calcium ion binding 24007GO:0006968cellular defense response 21250GO:0016486peptide hormone processing 20440GO:0008152metabolism 19742GO:0005515protein binding 19316GO:0007155cell adhesion 18254GO:0005198structural molecule activity Most Common GOIDs accessed via AmiGO
28
Arabidopsis: TAIR, taxon:3702 Caenorhabditis: WormBase, taxon:6239 Candida albicans: CGD, taxon:5476 Danio: ZFIN, taxon:7955 Dictyostelium: DictyBase, taxon:5782 Drosophila: FlyBase, taxon:7227 Mus: MGI, taxon:10090 Oryza sativa: Gramene, taxon:39947 = Oryza sativa (japonica cultivar- group); Rattus: RGD, taxon:10116 Saccharomyces: SGD, taxon:4932 Leishmania major: GeneDB, taxon:5664 Plasmodium falciparum: GeneDB, taxon:5833 Schizosaccharomyces pombe: GeneDB, taxon:4896 Trypanosoma brucei: GeneDB, taxon:185431 Bacillus anthracis: TIGR, taxon:198094 Coxiella burnetii: TIGR, taxon:227377 Geobacter sulfurreducens: TIGR, taxon:243231 Listeria monocytogenes: TIGR, taxon:265669 Methylococcus capsulatus: TIGR, taxon:243233 Pseudomonas syringae: TIGR, taxon:223283 Shewanella oneidensis: TIGR, taxon:211586 Vibrio cholerae: TIGR, taxon:686 Taxon covered by the GO (some)
29
NIH-funded experimental research that uses the GO National Institute on Aging (NIA) National Institute of Allergy and Infectious Diseases (NIAID) National Cancer Institute (NCI) National Institute on Drug Abuse (NIDA) National Institute on Deafness and Other Communication Disorders (NIDCD) National Institute of Dental & Craniofacial Research (NIDCR) National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) National Institute of Biomedical Imaging and Bioengineering (NIBIB) National Institute of Environmental Health Sciences (NIEHS) National Eye Institute (NEI) National Institute of General Medical Sciences (NIGMS) National Institute of Child Health and Human Development (NICHD) National Human Genome Research Institute (NHGRI) National Heart, Lung and Blood Institute (NHLBI) National Library of Medicine (NLM) National Institute of Neurological Disorders and Stroke (NINDS) National Center for Research Resources (NCRR)
30
Other funded experimental projects that use the GO Public Heath Service Walter Reed Army Medical Center United States Department of Agriculture Department of Defense USAID National Science Foundation
31
A “successful” case study There are still challenges to meet
32
Building upon (sharing) light, axiomatic ontologies eliminates: 1.Spelling mistakes or differences oesinophil vs. eosinophil 2.Differences in synonyms, names or naming conventions Spermatazoon, sperm cell, spermatozoid, sperm 3.Differences in definitions pericardial cell develops_from mesodermal cell vs. Nothing develops_from pericardial cell 4.Inconsistent structure
33
lamellocyte differentiatio n plasmatocyte differentiation hemocyte differentiation (sensu Arthropoda) hemocyte lamellocyte plasmocyte Inconsistent structure GO CL
34
Finer granularity in the GO GO immune cell activation, migration, chemotaxis… erythrocyte differentiation is_a myeloid blood cell differentiation” CL no such term: “immune cell” no such term: “myeloid blood cell”
35
Courser granularity in the GO GO neuroblast proliferation is_a cell proliferation CL neuroblast is_a neuronal stem cell is_a stem cell is_a cell
36
Even a “light” ontology like the GO is difficult enough A methodology that enforces clear, coherent definitions: Promotes quality assurance intent is not hard-coded into software Meaning of relationships is defined, not inferred Guarantees automatic reasoning across ontologies and across data at different granularities Consequences of inconsistencies Hard to synchronize manually Inconsistent user-search results
37
Meeting the goal: Drawing inferences A human BCD SP:1234SP:8723SP:19345 ? PMID:5555PMID:4444 toad B SP:48392 yeast BC SP:48291SP:38921 Direct evidence Indirect evidence PMID:8976 PMID:9550 PMID:3924 Human Xenopus Drosophila
38
Thank you Chris Mungall Sima Misra NCBO Reactome GO SO
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.