Download presentation
Presentation is loading. Please wait.
Published byPrimrose Theresa Ryan Modified over 9 years ago
1
Principles for Building Biomedical Ontologies Suzanna Lewis National Center Biomedical Ontology 22 October 2005 Advanced Bioinformatics, Cold Spring Harbor
2
National Center Biomedical Ontology http://bioontology.org/ http://bioontology.org/ Mark Musen Suzanna Lewis Barry Smith Sima Misra Daniel Rubin Michael Ashburner Monte Westerfield Ida Sim PI & Core 1: computer science (SMI) Co-PI & Core 2: bioinformatics (BiKR; GO) Core 6: Outreach and training (ECOR) Associate Program Director Program Director Core 3: Phenotype Project (Cambridge; FlyBase; and GO) Core 3: Phenotype Project (UOregon; PI of ZFIN) Core 3: HIV clinical trials Project (UCSF)
3
BiKRs Sima Misra Shu Shengqiang Christopher J. Mungall Nomi Harris John Day-Richter Karen Eilbeck Mark Gibson
4
Outline for the Morning A definition of “ontology” Four sessions: Organizational Challenges Principles for Ontology Construction Case Studies from the GO Case Studies for group discussion.
5
My newbie questions What data is missing? What I’ve heard Where is the data generated? What is the motivation? How will it be gathered? Organism, environment, data quality and attribution TIGR, Sanger, JGI, and coming soon to a 954 near you! Still an issue. Low threshold of effort relative to benefits of complying Data it is accumulating on disks across the world and we’d like to be able to locate and use it The hardest part: Sharing (semantics)
6
handy ontology tells us what’s there… Where should I eat…? Ontologies help with decision making
7
Type of cuisine(Presumable) country of origin Ontologies don’t just organize data; they also facilitate inference, and that creates new knowledge, often unconsciously in the user.
8
Where delicatessen food hails from… ‘Frozen Yogurt’ cuisine in search of a national identity? What a computer would likely infer about the world from this helpful ontology: Flag of fresh juiceFresh Juice is a national cuisine…
9
Ontology is all about meaning Communities form (scientific) theories that seek to explain all of the existing evidence and can be used for prediction We make inferences and decisions based upon what we know about (biological) reality.
10
Make our meanings clear enough for a computer to understand An ontology is a computable representation of this underlying (biological) reality. An ontology enables a computer to reason over the data in (some of) the ways that we do particularly to query and locate relevant data. A shared, common, backbone taxonomy of relevant entities, and the relationships between them, within an application domain. Referred to by information scientists as an ’Ontology'.
11
But really… What is an Ontology? From Aristotle to Artificial Intelligence It is ”a formalism of what exists” Follows formal rules for creating definitions originally laid down by Aristotle. A definition is: the specification of the essence (nature, invariant structure) shared by all the members of a class or natural kind.
12
The Aristotelian Methodology Topmost nodes are the undefinable primitives. The definition of a class lower down in the hierarchy is provided by specifying the parent of the class together with the relevant differentia. Differentia tells us what marks out instances of the defined class within the wider parent class as in Plasma membrane is a cell part [immediate parent] that surrounds the cytoplasm [differentia]
13
Siamese mammal cat organism Physical object (substance) classes animal instances frog leaf class all members of the class frog share a froggy nature
14
Thorax Lung Heart Cell Anatomical structures Cornelius Rosse
15
Content of FMA Challenge: Duplicate graphical model in symbolic model Adapted from Bloom & Fawcett: Textbook of Histology 1994 12th ed Chapman & Hall Universals or classes: Kinds of anatomical entities
16
Content of FMA
17
1. Organizational Challenges http://obo.sourceforge.net
18
So you want an ontology… What do you have to do to make/get/use/steal/beg one?
19
WhySurvey Improve Domain covered ? Public ? Active ? Applied ? Communit y? Develop Salvage Collaborate & Learn yes no
20
What you must do Justify exactly why there is a need Scope it very, very tightly Communicate with people
21
The decisions you must make What domain does it cover? It is privately held? Is it active? Is it applied?
22
Why Survey Improve Domain covered ? Public ? Active ? Applied ? Communit y? Develop Salvage Collaborate & Learn (Listen to Barry) yes no
23
Due diligence & background research Step 1: Learn what is out there The most comprehensive list is on the OBO site. http://obo.sourceforge.nethttp://obo.sourceforge.net Assess ontologies critically and realistically. Make contact
24
Why Survey Improve Domain covered ? Public? Active ? Applied ? Communit y? Develop Salvage Collaborate & Learn (Listen to Barry) yes no
25
Ontologies must be shared Proprietary ontologies Belief that ownership of the terminology gives the owners a competitive edge For example, Incyte or Monsanto in the past, SNOMED for non-US. Data cannot be shared if the ontologies describing the data are not shared. Don’t reinvent—Use the power of combination and collaboration
26
Why Survey Improve Domain covered ? Public ? Active? Applied ? Communit y? Develop Salvage Collaborate & Learn (Listen to Barry) yes no
27
Pragmatic assessment of an ontology Is there access to help, e.g.: help-me@weird.ontology.net ? Does a warm body answer help mail within a ‘reasonable’ time—say 2 working days ?
28
Why Survey Improve Domain covered ? Public ? Active? Applied? Communit y? Develop Salvage Collaborate & Learn (Listen to Barry) yes no
29
Use it to improve it Every ontology improves when it is applied to actual data It improves even more when these data are used to answer questions There will be fewer problems in the ontology and more commitment to fixing remaining problems when important research data is involved that scientists depend upon Be very wary of ontologies that have never been applied
30
Work with that community To improve (if you found one) To develop (if you did not) Getting it right It is impossible to get it right the 1st (or 2nd, or 3rd, …) time. What we know about reality is continually growing Improve Collaborate and Learn
31
Implication: “prepare for change” Establish a mechanism for change. Use CVS or Subversion. Changes must be reviewed by experts Unique Identifiers Versions Archives
32
Ontology development is hard Have a stake in seeing it work. Have broad, detailed domain knowledge. Will engage in vigorous debate without engaging egos. Will do concrete work and attend frequent working sessions (quarterly), phone conferences (weekly), e-mail correspondence (daily).
33
2. Principles for Ontology Construction
34
Why do we need rules for good ontology? Ontologies must be intelligible to humans (for annotation) and to machines (for reasoning and error-checking) Unintuitive rules for classification lead to entry errors (problematic links) Facilitate training of curators Overcome obstacles to alignment with other ontology and terminology systems Enhance harvesting of content through automatic reasoning systems Following basic rules makes more useful ontologies
35
Aristotle’s categories This is Aristotle’s list of types of predication, that is, the different ways in which things can be said to be. He identifies 10 mutually exclusive categories. 1. Substance. 2. Quantity. 3. Quality. 4. Relation. 5. Location. 6. Time. 7. Position. 8. Possession. 9. Doing. 10. Undergoing.
36
SNOMED-CT Top Level Substance Body Structure Specimen Context-Dependent Categories* Attribute Finding* Staging and Scales Organism Physical Object Events Environments and Geographic Locations Qualifier Value Special Concept* Pharmaceutical and Biological Products Social Context Disease Procedure Physical Force
37
Examples of Rules Don’t confuse instances with universals Your navel (instance) is not the abstract representation of all navels Your microarray result is not the abstract representation of all microarray results The meaning of an ontology should not change when the programming language changes
38
First Rule: Univocity Terms (including those describing relations) should have the same meanings on every occasion of use. In other words, they should refer to the same kinds of instances in reality
39
Example of univocity problem in case of part_of relation (Old) Gene Ontology: ‘part_of’ = ‘may be part of’ flagellum part_of cell ‘part_of’ = ‘is at times part of’ replication fork part_of the nucleoplasm ‘part_of’ = ‘is included as a sub-list in’
40
Second Rule: Positivity Complements of classes are not themselves classes. Terms such as ‘non-mammal’, or ‘non- frog’, or ‘non-membrane’ do not designate genuine classes.
41
Third Rule: Objectivity Which classes exist is not a function of our biological knowledge. Terms such as ‘unknown’ or ‘unclassified’ do not designate biological natural kinds.
42
Fourth Rule: Single Inheritance No class in a classificatory hierarchy should have more than one is_a parent on the immediate higher level I.e. no diamonds C is_a 2 B is_a 1 A
43
Following the single inheritance rule The position of a term within the hierarchy enriches its own definition by incorporating automatically the definitions of all the terms above it. The entire information content of the term hierarchy can be translated very cleanly into a computer representation
44
B C is_a 1 is_a 2 A ‘is_a’ no longer univocal Problems with multiple inheritance
45
Fifth Rule: Clarity of Text Definitions The terms used in a definition should be simpler (more intelligible) than the term to be defined otherwise the definition provides no assistance to human understanding Machines can cope with the full formal representation (it doesn’t need the text)
46
Sixth Rule: Basis in Reality When building or maintaining an ontology, always think carefully about how classes (types, kinds, species) relate to instances in reality Axioms governing instances Every class has at least one instance (exceptions will occur at top levels) Each child class has a smaller collection of instances than its parent class
47
Axiom: Every parent class has at least two children
48
The reason that rules are important: Interoperability Ontologies should work together Avoid redundancy in ontology building Support reuse Ontologies should be capable of being used by other ontologies (cumulation)
49
The problem of ontology re-use SNOMED MeSH UMLS NCIT HL7-RIM … None of these have clearly defined relations Still remain too much at the level of TERMINOLOGY Not based on a common set of rules Not based on a common set of relations
50
An example of unclear relationship use A is_a B ‘A’ is more specific in meaning than ‘B’ HL7-RIM: Individual Allele is_a Act of Observation cancer documentation is_a cancer disease prevention is_a disease
51
How to define A is_a B A is_a B = def. 1.A and B are names of universals (natural kinds, types) in reality 2.all instances of A are as a matter of biological science also instances of B
52
Benefits of well-defined relationships If the relations in an ontology are well- defined, then reasoning can cascade from one relational assertion (A R 1 B) to the next (B R 2 C). Relations used in ontologies thus far have not been well defined in this sense. Find all DNA binding proteins should also find all transcription factor proteins because Transcription factor is_a DNA binding protein
53
Biomedical data integration / interoperability Will never be achieved through integration of meanings or concepts The problem: different user communities use different concepts What is really needed is a well-defined, commonly used set of relationships
54
Seventh Rule: Distinguish Universals and Instances A good ontology must distinguish clearly between universals (types, kinds, classes) and instances (tokens, individuals, particulars)
55
Why distinguish classes from instances? What holds on the level of instances may not hold on the level of universals For example, my definition of an “adjacent_to” relation requires that it work in either direction (This particular) nucleus adjacent_to (this particular) cytoplasm Always true Cytoplasm adjacent_to nucleus Not always true
56
Using relations Between classes: is_a, part_of,... Between an instance and a class: this explosion instance_of the class explosion Between instances: Mary’s heart part_of Mary Relations must be defined to always work
57
Defining the part_of relation can be a problem part_of as a relation between classes versus part_of as a relation between instances nucleus part_of cell (classes) your heart part_of you (instances) testis part_of human being ? heart part_of human being ? human being has_part human testis ?
58
Similar considerations are required to clearly define nearly all relations A causes B A is_located in B A is_adjacent_to B A derives_from B Zygote derives_from ovum, sperm A transformation_of B Adult transformation_of child
59
The Rules 1.Univocity: Terms should have the same meanings on every occasion of use 2.Positivity: Terms such as ‘non-mammal’ or ‘non- membrane’ do not designate genuine classes. 3.Objectivity: Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds. 4.Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level 5.Intelligibility of Definitions: The terms used in a definition should be simpler (more intelligible) than the term to be defined 6.Basis in Reality: When building or maintaining an ontology, always think carefully at how classes relate to instances in reality 7.Distinguish Classes and Instances
60
Some rules are Rules of Thumb The world is full of difficult trade-offs The benefits of formal (logical and ontological) rigor need to be balanced Against the constraints of computer tractability, Against the needs of biomedical practitioners. BUT do the very best you can!
61
3. Case Studies from the GO http://www.geneontology.org
62
How has GO dealt with some specific aspects of ontology development? Univocity Positivity Objectivity Definitions Formal definitions Written definitions Ontology Re-use (Alignment)
63
Tactile sense Taction Tactition ? The Challenge of Univocity: People call the same thing by different names
64
Tactile sense Taction Tactition perception of touch ; GO:0050975 Univocity: GO uses one term and many characterized synonyms
65
= bud initiation The Challenge of Univocity: People use the same words to describe different things
66
Bud initiation? How is a computer to know?
67
= bud initiation sensu Metazoa = bud initiation sensu Saccharomyces = bud initiation sensu Viridiplantae Univocity: GO adds “sensu” descriptors to discriminate among organisms
68
The Challenge of Positivity Some organelles are membrane-bound. A centrosome is not a membrane bound organelle, but it still may be considered an organelle.
69
The Challenge of Positivity: Sometimes absence is a distinction in a Biologist’s mind non-membrane-bound organelle GO:0043228 membrane-bound organelle GO:0043227
70
Positivity Note the logical difference between “non-membrane-bound organelle” and “not a membrane-bound organelle” The latter includes everything that is not a membrane bound organelle!
71
The Challenge of Objectivity: Database users want to know if we don’t know anything (Exhaustiveness with respect to knowledge) We don’t know anything about a gene product with respect to these We don’t know anything about the ligand that binds this type of GPCR
72
Objectivity How can we use GO to annotate gene products when we know that we don’t have any information about them? Currently GO has terms in each ontology to describe unknown An alternative might be to annotate genes to root nodes and use an evidence code to describe that we have no data. Similar strategies could be used for things like receptors where the ligand is unknown.
73
GPCRs with unknown ligands We could annotate to this
74
GO Definitions A definition written by a biologist: necessary & sufficient conditions written definition (not computable) Graph structure: necessary conditions formal (computable)
75
Relationships and definitions Important considerations: Placement in the graph- selecting parents Appropriate relationships to different parents True path violation
76
True path violation What is it?..”the path from a child term all the way up to its top-level parent(s) must always be true". chromosome Mitochondrial chromosome Is_a relationship Part_of relationship nucleus
77
True path violation What is it? nucleuschromosome Nuclear chromosome Mitochondrial chromosome Is_a relationshipsPart_of relationship
78
The Importance of synonyms: is tRNA a function? Molecular_function Triplet codon amino acid adaptor activity GO Definition: Mediates the insertion of an amino acid at the correct point in the sequence of a nascent polypeptide chain during protein synthesis. Synonym: tRNA
79
Ontology integration One of the current goals of GO is integration cone cell fate commitment retinal_cone_cell keratinocyte differentiation keratinocyte adipocyte differentiation fat_cell dendritic cell activation dendritic_cell lymphocyte proliferation lymphocyte T-cell homeostasis T_lymphocyte garland cell differentiation garland_cell heterocyst cell differentiation heterocyst References to Cell Types in GO Cell Types in the Cell Ontology with
80
We can integrate the GO with other ontologies Chemical ontologies 3,4-dihydroxy-2-butanone-4-phosphate synthase activity Anatomy ontologies metanephros development GO itself mitochondrial inner membrane peptidase activity Nota bene: some time and effort will be required
81
Building Ontology Improve Collaborate and Learn
82
Applied Ontology: a summary Dedicated editors Practice good ontological hygiene Engage the community Reward compliance and get the ontology into use Plan for change over time KISS: Concentrate on what you can definitely agree upon: the steps you can take with certainty.
83
4. Case Studies for group discussion
84
mitosis and meiosis It's been a full lunar cycle since we last talked about this on the mailing list, and I would like to draw everyone's attention once again to the exciting topics of chromosome segregation, nuclear division and cell division. The basic problem is the multiplicity of meanings attached to 'mitosis'. The word are used in the literature and colloquially to represent everything from chromosome segregation up to a full round of nuclear and cell division and there is no consensus on how to define it in scientific or general dictionaries (check www.onelook.com for proof). To compound the problem, the only process common to all species which undergo 'mitosis' is chromosome segregation; not all species undergo nuclear division or cell division during the processes described in the literature as 'mitosis'. In the ontologies, we currently have 'mitosis' defined as chromosome segregation and nuclear division. This is therefore wrong for those species in which there is no nuclear division accompanying chromosome segregation. How are we going to define mitosis?www.onelook.com
85
Events of the mitotic cell cycle that need to be represented: mitotic chromosome segregation mitotic nuclear division mitotic cell division Only component common to all these is mitotic chromosome segregation. Structure must be flexible enough to accommodate any of the flavors of 'mitosis’, no matter what the species and no matter whether the annotator has read the definition or not.
87
Backing up assertions QUESTION: What evidence code is appropriate to use for statements of “common knowledge”?
88
The current documentation states that TAS may be used as the evidence code for statements of common knowledge. For example, let’s say you have a paper that says that Protein X is an xxxxx, with a direct assay for activity, so you can use IDA for this function term. Then it also makes a mutation in the gene for Protein X and shows that it is involved in process yyyy, so you can use IMP for the process term. But, the paper does not have any direct evidence about the localization of Protein X. However, everyone knows that process yyyy occurs in the cytoplasm, so you can annotate protein X to the component term “cytoplasm ; GO:5737” by TAS using a general reference like Biochemistry by Lupert Stryer.
89
There is not really a traceable statement in Stryer providing evidence that process yyyy occurs in this location in yeast. SGD feels that it is better to use the newer evidence code IC for these “common” knowledge types of annotations. Thus, if an SGD curator felt that it was reasonable to make the annotation “cytoplasm” based on the knowledge that Protein X the process annotation yyyy, then the curator could assign the component term “cytoplasm ; GO:5737” using IC and the GOid of the process term yyyyy.
90
many of these “common knowledge” types of statements are often not well based in actual experiments conducted on the organism of interest, that early biochemists would often perform experiments with materials that were easy to obtain, e.g. calf thymus, and assume that this accurately represented the situation for another organism, e.g. human. This may or may not be the case.
91
What is the most appropriate GO term for annotating a response to methylmercury? "Response to mercury ion" doesn't seem quite right, as it specifically states that the response is "as a result of exposure to mercuric ions (Hg2+)", but the more general-sounding "response to mercury" is a synonym of it. In the publication I am working on, they exposed zebrafish to methylmercury and documented the resulting changes in gene expression.
92
"Response to mercury ion Definition: A change in state or activity of the organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of exposure to mercuric ions (Hg2+). Synonyms: response to mercuric, response to mercury
93
Homeobox => DNA binding? http://www.geneontology.org/email- annotation/annotation-arc/annotation- 2005/0208.html http://www.geneontology.org/email- annotation/annotation-arc/annotation- 2005/0208.html
94
Bloggers and other online groups (eg. del.icio.us, Flickr [online photo archive], Technorati) have been self-categorizing or 'tagging' web sites and their content using user-defined words and phrases and not an expertly curated vocabulary or ontology. The end result is that a vast amount of content has been indexed using a rich vocabulary of tags (to date, technorati has over 1.2 billion links tagged with 1.2 million tags). Whilst this certainly lacks the formal consistency that would be obtained with curated annotation against a standard vocabulary, the quantity of content being categorized far exceeds what could be done by a group of annotators and perhaps is richer because the tags are defined by the users and creators of that content, not by a third party interpreting the material after the fact. Given the ever increasing quantity of scientific data, the proliferation of online publishing, etc., could scientists tagging their own data with their own terms be the way to go?
95
How can you recruit and train people, in both logic and biology, given that without a sufficient number of competent personnel the ontology cannot be maintained?
96
Thanks to NIH and HHMI for funding and support And to my fantastic colleagues (whose slides these are) MICHAEL ASHBURNTER, BARRY SMITH, DAVID HILL, CORNELIUS ROSSE & CHRIS MUNGALL
97
P.S. Graphical User Interfaces Semantics
98
Common pitfalls Don’t confuse instances with artifacts of your database representation...
99
Instances are not included! It is the universals that are important …though instances must be taken into account. Please keep this in mind, it is a crucial to understanding the tutorial Simon is an instance of the universal (class) human
100
Concept Concepts are in your head and will change as our understanding changes Universals exist and have an objective reality
101
Ontologies as Controlled Vocabularies expressing discoveries in the life sciences in a uniform way providing a uniform framework for managing annotation data deriving from different sources and with varying types and degrees of evidence
102
Structured definitions contain both genus and differentiae Essence = Genus + Differentiae neuron cell differentiation = Genus: differentiation (processes whereby a relatively unspecialized cell acquires the specialized features of..) Differentiae: acquires features of a neuron
103
Key idea: To define ontological relations Move from associative relations between meanings to strictly defined relations between the entities themselves. The relations can then be used computationally in the way required. For example: part_of, develops_from Definitions will enable computation To define relations we must look at more than the classes. We need also to take account of instances and time
104
‘is_a’ is pressed into service to mean a variety of different things shortfalls from single inheritance are often clues to multiple is_a classification meanings the resulting ambiguities make it difficult for curators to reliably enter new terms (errors). serves as obstacle to integration with neighboring ontologies The success of ontology alignment depends crucially on the degree to which basic ontological relations such as is_a and part_of can be relied on as having the same meanings in the different ontologies to be aligned.
105
Definitions of the all-some form Allow cascading inferences If A R 1 B and B R 2 C, then we know that Every A stands in R 1 to some B, but we know also that, whichever B this is, it can be plugged into the R 2 relation, because R 2 is defined for every B.
106
Not only relations We can apply the same methodology to other top- level categories in ontology, e.g. anatomical structure process function regulation, inhibition, suppression, co-factor... boundary, interior contact, separation, continuity tissue, membrane, sequence, cell
107
To the degree that the above rules are not satisfied, error checking and ontology alignment will be achievable, at best, only with human intervention and via force majeure
108
What we have argued for: A methodology which enforces clear, coherent definitions This promotes quality assurance intent is not hard-coded into software Meaning of relationships is defined, not inferred Guarantees automatic reasoning across ontologies and across data at different granularities
109
The importance of relationships Cyclin dependent protein kinase Complex has a catalytic and a regulatory subunit How do we represent these activities (function) in the ontology? Do we need a new relationship type (regulates)? Catalytic activity protein kinase activity protein Ser/Thr kinase activity Cyclin dependent protein kinase activity Cyclin dependent protein kinase regulator activity Molecular_function Enzyme regulator activity Protein kinase regulator activity
110
GO textual definitions: Related GO terms have similarly structured (normalized) definitions
111
Alignment of the Two Ontologies will permit the generation of consistent and complete definitions id: CL:0000062 name: osteoblast def: "A bone-forming cell which secretes an extracellular matrix. Hydroxyapatite crystals are then deposited into the matrix to form bone." [MESH:A.11.329.629] is_a: CL:0000055 relationship: develops_from CL:0000008 relationship: develops_from CL:0000375 GO Cell type New Definition + = Osteoblast differentiation: Processes whereby an osteoprogenitor cell or a cranial neural crest cell acquires the specialized features of an osteoblast, a bone-forming cell which secretes extracellular matrix.
112
Alignment of the Two Ontologies will permit the generation of consistent and complete definitions id: GO:0001649 name: osteoblast differentiation synonym: osteoblast cell differentiation genus: differentiation GO:0030154 (differentiation) differentium: acquires_features_of CL:0000062 (osteoblast) definition (text): Processes whereby a relatively unspecialized cell acquires the specialized features of an osteoblast, the mesodermal cell that gives rise to bone Formal definitions with necessary and sufficient conditions, in both human readable and computer readable forms
113
part_of part_of must be time-indexed for spatial classes A part_of B is defined as: Given any instance a and any time t, If a is an instance of the universal A at t, then there is some instance b of the universal B such that a is an instance-level part_of b at t
114
C c at t C 1 c 1 at t 1 C' c' at t time instances derives_from ovum sperm zygote
115
c at t 1 C c at t C 1 time same instance transformation_of pre-RNAmature RNA adultchild C 2 transformation_of C 1 is defined as Given any instance c of C 2 c was at some earlier time an instance of C 1
116
embryological development C c at t c at t 1 C 1
117
C c at t c at t 1 C 1 tumor development
118
Key In the following discussion: Classes are in upper case ‘A’ is the class Instances are in lower case ‘a’ is a particular instance
119
Placement in the graph Example- Proteasome complex
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.