Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Gene Ontology and Semantic Similarity Measures.

Similar presentations


Presentation on theme: "1 Gene Ontology and Semantic Similarity Measures."— Presentation transcript:

1 1 Gene Ontology and Semantic Similarity Measures

2 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc. John Wiley & Sons, Inc Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!

3 What is Ontology? Dictionary: A branch of metaphysics concerned with the nature and relations of being. Barry Smith: The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality. 1606 1700s

4 Sriniga Srinivasan, Chief Ontologist, Yahoo! The ontology. Dividing human knowledge into a clean set of categories is a lot like trying to figure out where to find that suspenseful black comedy at your corner video store. Questions inevitably come up, like are Movies part of Art or Entertainment? (Yahoo! lists them under the latter.) -Wired Magazine, May 1996

5 So what does that mean? From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things.

6 6  what kinds of things exist?  what are the relationships between these things? eye part_of sclera is_a sense organ develops from Optic placode A biological ontology is: A (machine) interpretable representation of some aspect of biological reality http://www.macula.org/anatomy/eyeframe.html

7 7 Gene Ontology (GO) Consortium Formed to develop a shared language adequate for the annotation of molecular characteristics across organisms; a common language to share knowledge. Seeks to achieve a mutual understanding of the definition and meaning of any word used; thus we are able to support cross- database queries. Members agree to contribute gene product annotations and associated sequences to GO database; thus facilitating data analysis and semantic interoperability.

8 8 Gene Ontology widely adopted AgBase

9 9 Molecular Function = elemental activity/task –the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective –broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex –subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme GO represents three biological domains

10 Function (what) Process (why) Drive nail (into wood) Carpentry Drive stake (into soil) Gardening Smash roach Pest Control Clown’s juggling object Entertainment Example: Gene Product = hammer

11 Biological Examples Molecular Function Biological Process Cellular Component

12 term: MAPKKK cascade (mating sensu Saccharomyces) goid: GO:0007244 definition: OBSOLETE. MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces. definition_reference: PMID:9561267 comment: This term was made obsolete because it is a gene product specific term. To update annotations, use the biological process term 'signal transduction during conjugation with cellular fusion ; GO:0000750'. Terms, Definitions, IDs

13 Cellular Component where a gene product acts

14 Molecular Function A gene product may have several functions; a function term refers to a reaction or activity, not a gene product Sets of functions make up a biological process

15 Molecular Function activities or “jobs” of a gene product glucose-6-phosphate isomerase activity

16 Molecular Function insulin binding insulin receptor activity

17 Biological Process a commonly recognized series of events cell division

18 Biological Process transcription

19 Biological Process regulation of gluconeogenesis

20 Biological Process limb development

21 Ontology Structure Terms are linked by two relationships –is-a  –part-of 

22 Ontology Structure cell membrane chloroplast mitochondrial chloroplast membrane is-a part-of

23 Ontology Structure Ontologies are structured as a hierarchical directed acyclic graph (DAG) Terms can have more than one parent and zero, one or more children

24 Ontology Structure cell membrane chloroplast mitochondrial chloroplast membrane Directed Acyclic Graph (DAG) - multiple parentage allowed

25 Ontology Structure http://www.ebi.ac.uk/ego

26 Anatomy of a GO term id: GO:0006094 name: gluconeogenesis namespace: process def: The formation of glucose from noncarbohydrate precursors, such as pyruvate, amino acids and glycerol. [http://cancerweb.ncl.ac.uk/omd/index.html] exact_synonym: glucose biosynthesis xref_analog: MetaCyc:GLUCONEO-PWY is_a: GO:0006006 is_a: GO:0006092 unique GO ID term name definition synonym database ref parentage ontology

27 Evidence Codes for GO Annotations http://www.geneontology.org/doc/GO.evidence.html

28 IEAInferred from Electronic Annotation ISSInferred from Sequence Similarity IEPInferred from Expression Pattern IMPInferred from Mutant Phenotype IGIInferred from Genetic Interaction IPIInferred from Physical Interaction IDAInferred from Direct Assay RCAInferred from Reviewed Computational Analysis TASTraceable Author Statement NASNon-traceable Author Statement ICInferred by Curator NDNo biological Data available

29 29 Semantic Similarity Measures between GO terms and proteins

30 Two information in GO for semantic similarity information content (IC) of GO terms structural information of GO hierarchy 30

31 Information content The IC of a GO term t is usually defined as negative logarithm of the term's probability: And the probability of a given GO term t is defined as: –where anno(t) is the number of genes directly annotated with the term t. Thus, the information content of DAG root is 0. As GO terms ascend the hierarchical tree of DAG from the leaf, the information content would not increase. 31

32 Structural information Two structural concepts that are frequently used in GO similarity measures are –the “lowest common ancestor” (LCA): The LCA of two GO terms is their common ancestor term at the lowest level in the DAG structure. –the “most informative common ancestor” (MICA). Given two GO terms, the MICA is their common ancestor with the lowest probability in the DAG structure. –Notice that there also exists a case where two terms have multiple ancestor terms at the same level. In this case, the term with lowest probability will be the final LCA. 32

33 Resnik Mesasure Resnik similarity. Resnik introduced a semantic similarity for “is_a” ontologies based the highest IC values among IC values of all common ancestors of two terms: where S(t 1, t 2 ) is the set of common ancestors of two term t 1 and t 2. The Resnik similarity has a minimum zero. 33

34 Lin similarity Lin developed a information- theoretic similarity applicable to any domain that can be described by a probabilistic model. Lin measure is based on the relative probability between two terms and their MICA. The Lin measure between two GO term t 1 and t 2 is defined as: –The value of Lin similarity ranges from zero to one. 34

35 Jiang similarity Jiang similarity. The Jiang and Conrath integrated the edge- based method with the node-based approach of the information content calculation to develop a new distance measure. For its simple case in which factors related to local density, node depth and link type are ignored, the Jiang measure between two GO term t 1 and t 2 is defined as: Jiang distance measure can easily be transformed into a similarity measure by adding one and inverting it [10]. If terms t1 and t2 are the same, Dist Jiang (t1,t2) should be 0. Adding one is to avoid the division of 0. The value of Jiang similarity ranges from zero to one 35

36 Wu & Palmer Wu & Palmer, 1994 –Based on the depths of the two concepts in the taxonomies, and the depth of the LCS 36

37 Graph information content (GIC) similarity Let DAGT1 and DAGT2 be two ancestor DAGs induced by two GO terms t 1 and t 2. Then, the graph information content (GIC) measure is defined as the ratio between sum of information content of GO terms in the intersection of DAGT1 and DAGT2 and sum of information content of GO terms in the union of DAGT1 and DAGT2: The values of GIC similarity ranges from zero to one 37

38 Schlicker’s relevance similarity Schlicker et al. pointed out that both the specific of MICA and the relation between two GO terms with MICA are needed to be considered in a similarity measure. Then, they developed a new relevance similarity measure by combining the Lin similarity with the probability of MICA. The values of relevance similarity are also between zero and one. We also evaluate the 38

39 Wang Similarity For any term t in DAG A = (A, T A, E A ), its S-value related to term A, S A (t), is defined as: –where w e is the semantic contribution factor for edge e E A linking term t with its child term t’. In DAG A, GO term A is the most specific term and we define its contribution to its own semantics as 1. Other terms in DAG A are more general and, hence, contribute less to the semantics of GO term A. Therefore, we have 0 < w e < 1. The semantic value of GO term A, SV(A), as: –The semantic contribution factors for “is-a” and “part-of” relations can be set to different values, for example, 0.8 and 0.6 respectively. Given DAG A = (A, T A, E A ) and DAG B = (B, T B, E B ) for GO terms A and B respectively, the semantic similarity between these two terms, S GO (A, B), is defined as –where S A (t) is the S-value of GO term t related to term A and S B (t) is the S-value of GO term t related to term B. 39

40 Semantic Similarity between Proteins Best match average similarity. The similarity between two proteins is based on average of best match GO term pairs. Given two proteins p 1 and p 2, let go1 and go2 represent their corresponding sets of GO terms. First, the similarity between one GO term t and a set of GO terms, go = {t 1, t 2, … t k ) is defined as the maximum similarity between the t and any member in set go. Therefore, the similarity between two proteins can be defined as the weighted average of the term similarity scores: –where m is the number of terms in go1 and n is the number of terms in go2. In the best match average protein similarity measures, proteins with more GO terms annotated will have more influence more on the overall similarity score 40

41 Semantic Similarity between Proteins Average similarity. The similarity between two proteins, p 1 and p 2, is the average of similarities among all pairs of two GO term sets, go1 and go2: –where m is the number of terms in go1 and n is the number of terms in go2. Maximum similarity. The similarity between two proteins, p 1 and p 2, is the maximum similarity among all pairs of two GO term sets, go1 and go2: –where m is the number of terms in go1 and n is the number of terms in go2. 41


Download ppt "1 Gene Ontology and Semantic Similarity Measures."

Similar presentations


Ads by Google