1 An Introduction to Ontology for Scientists Barry Smith University at Buffalo
Multiple kinds of data in multiple kinds of silos Lab / pathology data Electronic Health Record data Clinical trial data Patient histories Medical imaging Microarray data Protein chip data Flow cytometry Mass spec Genotype / SNP data 2
How to find your data? How to find other people’s data? How to reason with data when you find it? How to work out what data does not yet exist? 3
4 how solve the problem of data re-use to address NIH mandates? part of the solution must involve: standardized terminologies and coding schemes
5 NLM’s proposal: the Unified Medical Language System collection of separate terminologies built by trained experts useful for legacy information retrieval and information integration UMLS Metathesaurus a system of post hoc mappings between overlapping source vocabularies
6 SNOMED DEMONS U M L S
New York State Center of Excellence in Bioinformatics & Life Sciences R T U New York State Center of Excellence in Bioinformatics & Life Sciences R T U
8 for UMLS local usage respected regimentation frowned upon cross-framework consistency not important no concern to establish consistency with basic science different grades of formal rigor, different degrees of completeness, different update policies
In the olden days people measured lengths using inches, ulnas, perches, king’s feet, Swiss feet, leagues of Paris, etc., etc. 9
Data was not comparable
on June everything changed 11
we now have the International System of Units 12
UMLS Can we create something like the SI system of units for biomedical terminology?
Uses of ‘ontology’ in PubMed abstracts 14
15
By far the most successful: GO (Gene Ontology) 16
17
Hierarchical view of GO representing relations between represented types 18
Gene Ontology $100 mill. invested in literature and database curation using the Gene Ontology (GO) based on the idea of annotation over 11 million annotations relating gene products (proteins) described in the UniProt, Ensembl and other databases to terms in the GO multiple secondary uses – because the ontology was not built to meet one specific set of requirements 19
GO provides a controlled system of terms for use in annotating (describing, tagging) data multi-species, multi-disciplinary, open source contributing to the cumulativity of scientific results obtained by distinct research communities 20
where in the cell ? what kind of molecular function ? semantic annotation of data what kind of biological process? 21
natural language labels + definitions to make the data cognitively accessible to human beings and algorithmically accessible to computers 22
RELATION TO TIME GRANULARITY CONTINUANTOCCURRENT INDEPENDENTDEPENDENT ORGAN AND ORGANISM Organism (NCBI Taxonomy) Anatomical Entity (FMA, CARO) Organ Function (FMP, CPRO) Phenotypic Quality (PaTO) Biological Process (GO) CELL AND CELLULAR COMPONENT Cell (CL) Cellular Component (FMA, GO) Cellular Function (GO) MOLECULE Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Molecular Process (GO) OBO (Open Biomedical Ontology) Foundry proposal (Gene Ontology in yellow) 23
compare: legends for maps 24
ontologies are legends for data 25
ontologies are legends for databases MouseEcotope GlyProt DiabetInGene GluChem sphingolipid transporter activity 26
annotation using common ontologies yields integration of databases MouseEcotope GlyProt DiabetInGene GluChem Holliday junction helicase complex 27
annotation using common ontologies can support comparison of data 28
The goal: virtual science consistent (non-redundant) annotation cumulative (additive) annotation yielding, by incremental steps, a virtual map of the entirety of reality that is accessible to computational reasoning 29
This goal is realizable if we have a common ontology framework data is retrievable data is comparable data is error-checkable data is integratable data is capable of being reasoned with only to the degree that it is annotated using a common controlled vocabulary 30