1 Principles for Building Biomedical Ontologies Barry Smith.

Slides:



Advertisements
Similar presentations
1 Five Steps to Interoperability (in the domain of scientific ontology) Barry Smith.
Advertisements

Cell Communication Cells need to communicate with one another, whether they are located close to each other or far apart. Extracellular signaling molecules.
ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS
What is Ontology? Dictionary:A branch of metaphysics concerned with the nature and relations of being. Barry Smith:The science of what is, of the kinds.
Prof. Drs. Sutarno, MSc., PhD.. Biology is Study of Life Molecular Biology  Studying life at a molecular level Molecular Biology  modern Biology The.
Genome organization Lesk, Ch 2 (Lesk, 2008). Genomes and proteomes Genome of a typical bacterium comes as a single DNA molecule of about 5 million characters.
Application of OBO Foundry Principles in GO Chris Mungall Lawrence Berkeley Labs NCBO GO Consortium.
Gene Ontology John Pinney
OASIS Reference Model for Service Oriented Architecture 1.0
1 An Ontology of Relations for Biomedical Informatics Barry Smith 10 January 2005.
The Role of Foundational Relations in the Alignment of Biomedical Ontologies Barry Smith and Cornelius Rosse.
1 Ontology in 15 Minutes Barry Smith. 2 Main obstacle to integrating genetic and EHR data No facility for dealing with time and instances (particulars)
Thomas Bittner and Barry Smith IFOMIS (Saarbrücken) Normalizing Medical Ontologies Using Basic Formal Ontology.
Ontology The science of the kinds and structures of objects, and their properties and relations. Defined by a scientific field's vocabulary and by the.
The Ontology of the Gene Ontology Barry Smith Jennifer Williams Steffen Schulze-Kremer
STOP Barry Smith Smart Terminologies via Ontological Principles.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
On the Application of Formal Principles to Life Science Data: A Case Study in the Gene Ontology Barry Smith * Jacob Köhler † Anand Kumar * *
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
1 Logical Tools and Theories in Contemporary Bioinformatics Barry Smith
AN INTRODUCTION TO BIOMEDICAL ONTOLOGY Barry Smith University at Buffalo 1.
Tutorial on Ontology Design Barry Smith and Werner Ceusters.
Internet tools for genomic analysis: part 2
Major Constituents of Cell
1 The OBO Relation Ontology Genome Biology 2005, 6:R46 based on the fundamental distinction between instances and universals takes instances and time into.
Son of SN Barry Smith. The Virtues of Single Inheritance (= True Hierarchy) better coding clearer instructions better automatic reasoning better definitions.
Charles Dodgson aka Lewis Caroll Once master the machinery of Symbolic Logic, and you have a mental occupation always at hand, of absorbing interest, and.
1 The Future of Clinical Bioinformatics: Overcoming Obstacles to Information Integration Barry Smith Brussells, Eurorec Ontology Workshop, 25 November.
CSCI-383 Object-Oriented Programming & Design Lecture 15.
Chapter 3 The Biological Basis of Life. Chapter Outline  The Cell  DNA Structure  DNA Replication  Protein Synthesis  What is a Gene?  Cell Division:
Chapter 5: Cell Growth and Division
Click to edit Master title style  Click to edit Master text styles  Second level  Third level  Fourth level  Fifth level  Click to edit Master text.
Building Biomedical Ontologies Jennifer Clark, GO Editorial Office.
Principles for Building Biomedical Ontologies Suzanna Lewis National Center Biomedical Ontology 22 October 2005 Advanced Bioinformatics, Cold Spring Harbor.
Gene Ontology (GO) Project
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
Amo amos amot amomus amotis amont. Happy birthday Swiss-Prot Fortaleza August 2006.
Open Biomedical Ontologies. Open Biomedical Ontologies (OBO) An umbrella project for grouping different ontologies in biological/medical field –a repository.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
Gene Ontology Consortium
AP Biology Discussion Notes Wednesday 01/28/2015.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Gene Ontology Project
Blueprint of Life Topic 18: Protein Synthesis
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
Chapter 3 The Biological Basis of Life. Chapter Outline The Cell DNA Structure DNA Replication Protein Synthesis Cell Division: Mitosis and Meiosis New.
LOGIC AND ONTOLOGY Both logic and ontology are important areas of philosophy covering large, diverse, and active research projects. These two areas overlap.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Ontological Foundations of Biological Continuants Stefan Schulz, Udo Hahn Text Knowledge Engineering Lab University of Jena (Germany) Department of Medical.
Cells and Tissues A&P Unit II.  Modern cell theory incorporates several basic concepts  Cells are the building blocks of all plants and animals  Cells.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
1 How to Build an Ontology Barry Smith
Gene Ontology Project
Human Anatomy & Physiology FIFTH EDITION Elaine N. Marieb PowerPoint ® Lecture Slide Presentation by Vince Austin Copyright © 2003 Pearson Education, Inc.
Gene Ontology Consortium
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Winter 2011SEG Chapter 11 Chapter 1 (Part 1) Review from previous courses Subject 1: The Software Development Process.
1 The OBO Relation Ontology: Preliminaries Barry Smith
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Copyright © 2006 Pearson Education, Inc., publishing as Benjamin Cummings Protein Synthesis.
CSCI 383 Object-Oriented Programming & Design Lecture 15 Martin van Bommel.
Knowledge Representation Part I Ontology Jan Pettersen Nytun Knowledge Representation Part I, JPN, UiA1.
Ontology in 15 Minutes Barry Smith.
Principles for Building Biomedical Ontologies
Principles for Building Biomedical Ontologies
Principles for Building Biomedical Ontologies: A GO Perspective
Principles for Building Biomedical Ontologies:A GO Perspective
Ontology in 15 Minutes Barry Smith.
What is Ontology? s Dictionary:A branch of metaphysics concerned with the nature and relations of being. Barry Smith:The science of what is, of.
Presentation transcript:

1 Principles for Building Biomedical Ontologies Barry Smith

2 Computers are tools for scientists  this fact does not mean that the sciences themselves have new kinds of objects (data, information)  bio-ontologies are about genes, cells, organisms  not about terms, symbols, concepts, data

3 Overview  Following basic rules helps make better ontologies  We will work through the principles- based treatment of relations in ontologies, to show how ontologies can become more reliable and more powerful

4 Why do we need rules for good ontology?  Ontologies must be intelligible both to humans (for annotation) and to machines (for reasoning and error-checking)  Unintuitive rules for classification lead to entry errors (problematic links)  Facilitate training of curators  Overcome obstacles to alignment with other ontology and terminology systems  Enhance harvesting of content through automatic reasoning systems

5 First Rule: Univocity  Terms (including those describing relations) should have the same meanings on every occasion of use.  In other words, they should refer to the same kinds of entities in reality

6 MedDRA  a cold  cold (vs. hot)  C.O.L.D. (Chronic-Obstructive-Lung- Disease) code with ‘C.O.L.D.’ or call to check

7 Second Rule: Positivity  Complements of types are not themselves types.  Terms such as ‘non-mammal’ or ‘non- membrane’ do not designate genuine types.

8 Third Rule: Objectivity  Which types exist is not a function of our biological knowledge.  Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.

9 Fourth Rule: Single Inheritance No type in a classificatory hierarchy should have more than one is_a parent on the immediate higher level

10 Rule of Single Inheritance  no diamonds: C is_a 2 B is_a 1 A

11 Problems with multiple inheritance B C is_a 1 is_a 2 A ‘is_a’ no longer univocal

12 ‘is_a’ is pressed into service to mean a variety of different things  shortfalls from single inheritance are often clues to incorrect entry of terms and relations  the resulting ambiguities make the rules for correct entry difficult to communicate to human curators

13 is_a Overloading  serves as obstacle to integration with neighboring ontologies  The success of ontology alignment depends crucially on the degree to which basic ontological relations such as is_a and part_of can be relied on as having the same meanings in the different ontologies to be aligned.

14 Use of multiple inheritance  The resultant mélange makes coherent integration across ontologies achievable (at best) only under the guidance of human beings with relevant biological knowledge  How much should reasoning systems be forced to rely on human guidance?

15 Fifth Rule: Intelligibility of Terms and Definitions  Terms should be intelligible  ‘apoptosis inhibitor activity’ is a function in GO  relations between function and the processes they enable become very difficult to state unless function terms designate functions in an intelligible way  structural constituent of tooth enamel

16  extracellular matrix structural constituent  puparial glue (sensu Diptera)  structural constituent of bone  structural constituent of chorion (sensu Insecta)  structural constituent of chromatin  structural constituent of cuticle  structural constituent of cytoskeleton  structural constituent of epidermis  structural constituent of eye lens  structural constituent of muscle  structural constituent of myelin sheath  structural constituent of nuclear pore  structural constituent of peritrophic membrane (sensu Insecta)  structural constituent of ribosome – note possibility of confusion with ‘major ribosome unit’ (check)  structural constituent of tooth enamel  structural constituent of vitelline membrane (sensu Insecta)

17 Fifth Rule: Intelligibility of Terms and Definitions  The terms used in a definition should be simpler (more intelligible) than the term to be defined  otherwise the definition provides no assistance  to human understanding  for machine processing

18 To the degree that the above rules are not satisfied, error checking and ontology alignment will be achievable, at best, only with human intervention and via brute force

19 Some rules are Rules of Thumb  The world of biomedical research is a world of difficult trade-offs  The benefits of formal (logical and ontological) rigor need to be balanced  Against the constraints of computer tractability,  Against the needs of biomedical practitioners.  BUT alignment and integration of biomedical information resources will be achieved only to the degree that such resources conform to these standard principles of classification and definition

20 Definitions should be intelligible to both machines and humans  Machines can cope with the full formal representation  Humans need to use modularity  Plasma membrane  is a cell part [immediate parent]  that surrounds the cytoplasm [differentia]

21 Terms and relations should have clear definitions  These tell us how the ontology relates to the world of biological instances, meaning the actual particulars in reality:  actual cells, actual portions of cytoplasm, and so on…

22 Sixth Rule: Basis in Reality  When building or maintaining an ontology, always think carefully at how types (types, kinds, species) relate to instances in reality

23 Axioms governing instances  Every type has at least one instance  Every genus (parent type) has an instantiated species (differentia + genus)  Each species (child type) has a smaller type of instances than its genus (parent type)

24 Axioms governing Instances  Distinct types on the same level never share instances  Distinct leaf types within a classification never share instances

25 siamese mammal cat organism substance species, genera animal instances frog leaf type

26 Interoperability  Ontologies should work together  ways should be found to avoid redundancy in ontology building and to support reuse  ontologies should be capable of being used by other ontologies (cumulation)

27 Main obstacle to integration  Current ontologies do not deal well with  Time and  Space and  Instances (particulars)  Our definitions should link the terms in the ontology to instances in spatio- temporal reality

28 Benefits of well-defined relationships  If the relations in an ontology are well- defined, then reasoning can cascade from one relational assertion (A R 1 B) to the next (B R 2 C). Relations used in ontologies thus far have not been well defined in this sense.  Find all DNA binding proteins should also find all transcription factor proteins because  Transcription factor is_a DNA binding protein

29 How to define A is_a B A is_a B =def. 1. A and B are names of types (natural kinds, universals) in reality 2.all instances of A are as a matter of biological science also instances of B

30 Biomedical ontology integration / interoperability  Will never be achieved through integration of meanings or concepts  The problem is precisely that different user communities use different concepts  What’s really needed is to have well- defined commonly used relationships

31 Idea:  Move from associative relations between meanings to strictly defined relations between the entities themselves.  The relations can then be used computationally in the way required

32 Key idea: To define ontological relations  For example: part_of, develops_from  Definitions will enable computation  It is not enough to look just at types or types.  We need also to take account of instances and time

33 Kinds of relations  Between types:  is_a, part_of,...  Between an instance and a type  this explosion instance_of the type explosion  Between instances:  Mary’s heart part_of Mary

34 Seventh Rule: Distinguish types and Instances  A good ontology must distinguish clearly between  types (universals, kinds, species) and  instances (tokens, individuals, particulars)

35 Don’t forget instances when defining relations  part_of as a relation between types versus part_of as a relation between instances  nucleus part_of cell  your heart part_of you

36 Part_of as a relation between types is more problematic than is standardly supposed  testis part_of human being ?  heart part_of human being ?  human being has_part human testis ?

37 Why distinguish types from instances?  What holds on the level of instances may not hold on the level of types  nucleus adjacent_to cytoplasm  Not: cytoplasm adjacent_to nucleus  seminal vesicle adjacent_to urinary bladder  Not: urinary bladder adjacent_to seminal vesicle

38 part_of  part_of must be time-indexed for spatial types  A part_of B is defined as: Given any instance a and any time t, If a is an instance of the type A at t, then there is some instance b of the type B such that a is an instance-level part_of b at t

39 C c at t C 1 c 1 at t 1 C' c' at t derives_from (ovum, sperm  zygote... ) time instances

40 transformation_of c at t 1 C c at t C 1 time same instance pre-RNA  mature RNA child  adult

41 transformation_of   C 2 transformation_of C 1 =def. any instance of C 2 was at some earlier time an instance of C 1

42 C c at t c at t 1 C 1 embryological development

43 C c at t c at t 1 C 1 tumor development

44 Time menopause part_of aging aging part_of death menopause part_of death

45 The simple, formal details “Relations in Biomedical Ontologies” Genome Biology, 2005, 6 (5)

46 Principles for Building Biomedical Ontologies:A GO Perspective David Hill Mouse Genome Informatics The Jackson Laoratory

47 How has GO dealt with some specific aspects of ontology development?  Univocity  Positivity  Objectivity  Single Inheritance  Definitions  Formal definitions  Written definitions  Basis in Reality  Universals & Instances  Ontology Alignment

48 Tactile sense Taction Tactition ? The Challenge of Univocity: People call the same thing by different names

49 Tactile sense Taction Tactition perception of touch ; GO: Univocity: GO uses 1 term and many characterized synonyms

50 = bud initiation The Challenge of Univocity: People use the same words to describe different things

51 Bud initiation? How is a computer to know?

52 = bud initiation sensu Metazoa = bud initiation sensu Saccharomyces = bud initiation sensu Viridiplantae Univocity: GO adds “sensu” descriptors to discriminate among organisms

53 The Importance of synonyms for utility: How do we represent the function of tRNA? Biologically, what does the tRNA do? Identifies the codon and inserts the amino acid in the growing polypeptide Molecular_function Triplet_codon amino acid adaptor activity GO Definition: Mediates the insertion of an amino acid at the correct point in the sequence of a nascent polypeptide chain during protein synthesis. Synonym: tRNA

54 But Univocity is also Dependent on a User’s Perspective Development (The biological process whose specific outcome is the progression of an organism over time from an initial condition to a later condition) --part_of hepatocyte differentiation ----part_of hepatocyte fate commitment part_of hepatocyte fate specification part_of hepatocyte fate determination ----part_of hepatocyte development

55 But Univocity is also Dependent on a User’s Perspective So from the perspective of GO a hepatocyte begins development after it is committed to its fate. Its initial condition is after cell fate commitment. But! A User may ask show me things that have do do with hepatocyte development. Do they mean show me things that have to do with ‘hepatocyte development” or do they mean show me things that have to do with ‘development’ and a ‘hepatocyte’?

56 The Challenge of Positivity Some organelles are membrane-bound. A centrosome is not a membrane bound organelle, but it still may be considered an organelle.

57 The Challenge of Positivity: Sometimes absence is a distinction in a Biologist’s mind non-membrane-bound organelle GO: membrane-bound organelle GO:

58 Positivity  Note the logical difference between  “non-membrane-bound organelle” and  “not a membrane-bound organelle”  The latter includes everything that is not a membrane bound organelle!

59 The Challenge of Objectivity: Database users want to know if we don’t know anything (Exhaustiveness with respect to knowledge) We don’t know anything about a gene product with respect to these We don’t know anything about the ligand that binds this type of GPCR

60 Objectivity  How can we use GO to annotate gene products when we know that we don’t have any information about them?  Currently GO has terms in each ontology to describe unknown  An alternative might be to annotate genes to root nodes and use an evidence code to describe that we have no data.  Similar strategies could be used for things like receptors where the ligand is unknown.

61 GPCRs with unknown ligands We could annotate to this

62 Single Inheritance  GO has a lot of is_a diamonds  Some are due to incompleteness of the graph  Some are due to a mixture of dissimilar types within the graph at the same level

63 Is_a diamond in GO Process behavior locomotory behaviorlarval behavior larval locomotory behavior

64 Is_a diamond in GO Function enzyme regulator activity GTPase regulator activity enzyme activator activity GTPase activator acivity

65 Is_a diamond in GO Cellular Component organelle non-membrane bound organelle intracellular organelle non-membrane bound intracellular organelle

66 Technically the diamonds are correct, but could be eliminated locomotory behaviorlarval behavior GTPase regulator activity enzyme activator activity non-membrane bound organelle intracellular organelle What do these pairs have in common?

67 What do the middle pair of terms all have in common? locomotory behaviorlarval behavior GTPase regulator activity enzyme activator activity non-membrane bound organelle intracellular organelle

68 They are all differentiated from the parent term by a different factor locomotory behaviorlarval behavior GTPase regulator activity enzyme activator activity non-membrane bound organelle intracellular organelle Type of behavior vs. what is behaving What is regulated vs. type of regulator Type of organelle vs. location of organelle

69 Insert an intermediate grouping term behavior locomotory behavior larval behavior larval locomotory behavior behavior of a thing descriptive behavior

70 Why insert terms that no one would use? behavior locomotory behavior larval behaviorrhythmic behavioradult behavior By the structure of this graph, locomotory behavior has the same relationship to larval behavior as to rhythmic behavior

71 Why insert terms that no one would use? behavior locomotory behavior larval behaviorrhythmic behavioradult behavior But actually, locomotory behavior/rhythmic behavior and larval behavior/adult behavior group naturally Descriptive behavior Behavior of a thing This type of single step differentiation of terms between levels would allow us to use distances between nodes and levels to compare similarity.

72 GO Definitions A definition written by a biologist: necessary & sufficient conditions written definition (not computable) Graph structure: necessary conditions formal (computable)

73 Relationships and definitions  The set of necessary conditions is determined by the graph  This can be considered a partial definition  Important considerations:  Placement in the graph- selecting parents  Appropriate relationships to different parents  True path violation

74 Placement in the graph  Example- Proteasome complex

75 The importance of relationships  Cyclin dependent protein kinase  Complex has a catalytic and a regulatory subunit  How do we represent these activities (function) in the ontology?  Do we need a new relationship type (regulates)? Catalytic activity protein kinase activity protein Ser/Thr kinase activity Cyclin dependent protein kinase activity Cyclin dependent protein kinase regulator activity Molecular_function Enzyme regulator activity Protein kinase regulator activity

76 We must avoid true path violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true". chromosome Mitochondrial chromosome Is_a relationship Part_of relationship nucleus

77 We must avoid true path violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true". nucleuschromosome Nuclear chromosome Mitochondrial chromosome Is_a relationshipsPart_of relationship

78 GO textual definitions: Related GO terms have similarly structured (normalized) definitions

79 Structured definitions contain both genus and differentiae Essence = Genus + Differentiae neuron cell differentiation = Genus: differentiation (processes whereby a relatively unspecialized cell acquires the specialized features of..) Differentiae: acquires features of a neuron

80 Basis in Reality  GO is designed by a consortium  As long as egos don’t get in the way, GO represents types rather than concepts  Large-scale developments of the GO are a result of compromise  Gene Annotators have a large say in GO content  Annotators are experts in their fields  Annotators constantly read the scientific literature But, since GO is representing a science, GO actually represents paradigms. Therefore, it is essential that GO is able to change!

81 types and Instances For the sake of GO, types are the terms and instances are the gene product attributes that are annotated to them.

82 types and Instances  When should we create a new type as opposed to multiple annotations?  When the the biology represents a universal principal. Receptor signaling protein tyrosine kinase activity does not represent receptor signaling protein activity and tyrosine kinase activity independently.

83 Ontology alignment One of the current goals of GO is to align:  cone cell fate commitment  retinal_cone_cell  keratinocyte differentiation  keratinocyte  adipocyte differentiation  fat_cell  dendritic cell activation  dendritic_cell  lymphocyte proliferation  lymphocyte  T-cell homeostasis  T_lymphocyte  garland cell differentiation  garland_cell  heterocyst cell differentiation  heterocyst Cell Types in GOCell Types in the Cell Ontology with

84 Alignment of the Two Ontologies will permit the generation of consistent and complete definitions id: CL: name: osteoblast def: "A bone-forming cell which secretes an extracellular matrix. Hydroxyapatite crystals are then deposited into the matrix to form bone." [MESH:A ] is_a: CL: relationship: develops_from CL: relationship: develops_from CL: GO Cell type New Definition + = Osteoblast differentiation: Processes whereby an osteoprogenitor cell or a cranial neural crest cell acquires the specialized features of an osteoblast, a bone-forming cell which secretes extracellular matrix.

85 Alignment of the Two Ontologies will permit the generation of consistent and complete definitions id: GO: name: osteoblast differentiation synonym: osteoblast cell differentiation genus: differentiation GO: (differentiation) differentium: acquires_features_of CL: (osteoblast) definition (text): Processes whereby a relatively unspecialized cell acquires the specialized features of an osteoblast, the mesodermal cell that gives rise to bone Formal definitions with necessary and sufficient conditions, in both human readable and computer readable forms

86 Other Ontologies that can be aligned with GO  Chemical ontologies  3,4-dihydroxy-2-butanone-4-phosphate synthase activity  Anatomy ontologies  metanephros development  GO itself  mitochondrial inner membrane peptidase activity

87 But Eventually…

88 Building Ontology Improve Collaborate and Learn

89 A tribute to Lewis Carroll Once master the machinery of Symbolic Logic, and you have a mental occupation always at hand, of absorbing interest, and one that will be of real use to you in any subject you may take up. It will give you clearness of thought - the ability to see your way through a puzzle - the habit of arranging your ideas in an orderly and get-at-able form - and, more valuable than all, the power to detect fallacies, and to tear to pieces the flimsy illogical arguments, which you will so continually encounter in books, in newspapers, in speeches, and even in sermons, and which so easily delude those who have never taken the trouble to master this fascinating Art. Lewis Carroll (a) All babies are illogical. (b) Nobody is despised who can manage a crocodile. (c) Illogical persons are despised Can a baby can manage a crocodile?