Download presentation
Presentation is loading. Please wait.
Published byΝῶε Ρόκας Modified over 6 years ago
1
GENE ANNOTATION AND NETWORK INFERENCE BY PHYLOGENETIC PROFILING
CISC BIOINFORMATICS GENE ANNOTATION AND NETWORK INFERENCE BY PHYLOGENETIC PROFILING Authors : Jie Wu, Zhenjun Hu and Charles DeLisi Boston University Presented by, Rajesh Ponnurangam
2
MOTIVATION The need for effective decision rule to use for correlation
Inefficiencies of current methods Effectiveness of Phylogenetic analysis Need for improved performance at various levels of Resolution New Decision Rule – Correlation Enrichment
3
OVERVIEW Introduction Concepts
What’s wrong with existing technologies of decision making? Comparison of Decision Rules Comparison with other published methods Identifying functional and evolutionary modules Standard Guilt by Association (SGA) Correlation Enrichment (CE) How correlation enrichment (CE) proves to be more effective?
4
INTRODUCTION CONCEPTS
Gene Annotation Network inference Phylogenetic profiling Correlation Enrichment (CE) Standard guilt by association (SGA) KEGG Pathways COG Ontology
5
INTRODUCTION CONCEPTS
Gene Annotation – The process of attaching biological information to sequences identifying elements on the genome (gene finding), and attaching biological information to these elements Network Inference – Knowing the topology of a biological network like transcriptional regulatory networks, metabolite networks etc. Phylogentic Profiling – Used to infer the function of a gene by finding another gene of known function with an identical pattern of presence and absence across a set of distributed genomes
6
INTRODUCTION CONCEPTS
Correlation Enrichment – A new decision rule for assigning genes to functional categories at various levels of resolution Standard Guilt by Association (SGA) – Simple decision rule, which assigns an unannotated gene to all known categories of an annotated gene if the phylogenetic profiles exceed some specific correlation threshold KEGG – Kyoto Encyclopedia of Genes and Genomes, connects known information on molecular interaction networks COG Ontology – Cluster of Orthologous Genes, source of a conserved domain datasource
7
EXISTING METHOD AND PHYLOGENETIC PROFILING
Current methods like SGA, perform at a level well below what is possible, largely because the performance of an effective decision rule to use the correlate deteriorates rapidly as coverage increases Phylogenetic profiling provides restricted profiling, requiring full profile identity, while accurate, has low coverage Phylogenetic profiling of a gene is a binary string Presence – 1 Absence – 0
8
PHYLOGENETIC PROFILING
N -> Number of genomes over which profiles are defined with gene X occurring in x genomes and gene Y occurring in y genomes and both occurring in z genomes, the probability of observing z co-occurrences purely by chance, given N,x and y is, MI(X,Y) - p(i,j), (i=0,1; j=0,1), fraction of genomes in which gene X is in state i and gene Y is in state j p(1,1) – fraction of genomes in which both are present p(1,0) – fraction of genomes in which X is present and Y is absent
9
PHYLOGENETIC PROFILING
Also Then the relation between MI and eq(1) is The paper defines a new measure of correlation between two binary strings 0 ≤ C ≤ 1 (3b)
10
COMPARISON OF SGA & CE SGA assigns an unannotated gene to all known categories of an annotated gene if profiles exceed some correlation threshold. CE assigns an unannotated gene by ranking each category (pathway) with a score reflecting The number of genes (annotated) within a category, whose profile correlation with that of the unannotated gene exceeds a pre-specified threshold The magnitude of these correlations CE substantially outperforms SGA in allocating genes to functional categories SGA, for C*=0.35, links 1025 of 2918 unannotated orthologs to one pathway CE was able to assign all 2918 KEGG unannotated orthologs to pathways and all COG unannotated orthologs to COG categories
11
PATHWAY ALLOCATION PERFORMANCE
12
COMPARISON OF DECISION RULES
13
COMPARISON OF DECISION RULES
SGA – assignment based on profile identity For inferences based on identity only 5.4% of unannotated orthologs are assignable to KEGG pathways When C*=0.2 to achieve a coverage of 90% requires accepting a PPV of 6% For inferences based on CE, PPV is markedly increased at high coverage, exceeding its SGA value approximately 6 fold The two decision rules perform similarly at coverages below 20% PPV estimates are conservative CE performs superior than SGA
14
COMPARISON OF DECISION RULES
At C*=0.4 where SGA and CE curves for PPV have reached about half their maximum divergence, CE performs substantially better than SGA at GO specificity levels.
15
COMPARISON WITH OTHER PUBLISHED MODELS
Different methods to draw functional inferences like “majority vote” and Markov Random Field can assign function based on the network context of unannotated genes Predictive reliability can be increased by combining them using one or another statistical framework such as support vector machines, Bayesian inferences and Markov Random field. Using SGA to assign genes to GO categories, fraction of genes assigned to at least one category decreases from 0.98 to ~0.10 as functional specificity increases with coverage fixed at 40% Using CE, the fraction correctly assigned to at least one category is 0.95 at the lowest specificity level and remains 0.78 at all specificity levels
16
INFERENCES BASED ON COG ONTOLOGY
COG functional categories provide a low resolution, but fully resolved annotation 1 gene to 1 functional category mapping Profiling by CE of the full set of 4826 genes, at C*=0.55 returns a 926 genes linked to at least one annotated gene Each of the 926 genes, including 249 unannotated are assignable to COG category Performance estimation – 68% (463/677)
17
INFERENCES BASED ON COG ONTOLOGY
18
INFERENCES BASED ON COG ONTOLOGY
A more detailed version of the category H TP set reveals two strikingly dense clusters – one with 7 orthologs, the other with 11
19
PHYLOGENETIC PROFILES OF THE
11-MEMBER CLUSTER Phlogenetic profiles of the 11-member cluster of orthologs across 66 genomes uncovered by CE. Green represents absence and red, presence of an ortholog
20
CLIQUES, CLUSTERS & INFERENCE QUALITY
As the threshold decreases from its most stringent value (C*=0.91), the number of clusters containing more than 3 nodes increases, peaking at C*=0.66 and then declines as the nodes coalesce into increasingly larger clusters
21
CLIQUES, CLUSTERS & INFERENCE QUALITY
22
METHODS Dataset – COG database. Accuracy is evaluated against KEGG
Assessment Positive Predictive Value – By definition, the population averaged positive predictive value is,
23
METHODS PPV as a product of two factors Related Metrics SPE-ACC SEN-A0
24
STANDARD GUILT BY ASSOCIATION
let i be the number of the categories that contain the gene I let J(I, J) be the set of categories that contain a gene J whose profile correlation with I meets the threshold C*, j(I, J) is its size let K(I, J) denote the set of common categories and k(I, J) is its size; where 0 ≤ k(I, J) ≤ min(i, j). The unannotated gene is therefore correctly assigned to TP = k categories and incorrectly assigned to the remaining FP = j - k categories. Also TN = T - i - j + k and FN = i - k, where T = 133 is the total number of pathways. Consequently, the PPVI(J) with which gene I is assigned using linked gene J is
25
STANDARD GUILT BY ASSOCIATION
Maximum PPVI(J) is not necessarily 1, but min(i,j)/j For j<I, PPVI(J)<1, whereas when i>j, PPVI(J) can become 1 when the pathways of J are a subset of I. The positive predictive value for gene I is obtained by taking sums over all genes to which its is correlated Where G(I) is the number of genes correlated with gene I and Nc(I) is the subset of genes in G(I) that share at least one category with gene I
26
CORRELATION ENRICHMENT
Suppose an unannotated gene is correlated with in total g other genes (C > C*) from r categories, and let m1, m2, ..., mr be the number of correlated genes in categories k1, k2, ..., kr, where r ≤ g, the equality holding only when each gene is in one category. Further, let denote the categories the gene is in. For each of the r categories that have 1 or more genes meeting the correlation threshold with I, define a weighted sum score, Sv α is positive adjustable integer which gives disproportionately high weights to strong correlations
27
CORRELATION ENRICHMENT
FP = r0 – TP FN = T1 – TP TN = T – r0 – T1 + TP
28
REFERENCES Wu J, Kasif S, DeLisi C: Identification of functional links between genes using phylogenetic profiles. Bioinformatics 2003, 19(12): Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.Proc Natl Acad Sci U S A 1999, 96(8): Aravind L: Guilt by association: contextual information in genome analysis.Genome Res 2000, 10(8): Nariai N, Tamada Y, Imoto S, Miyano S: Estimating gene regulatory networks and protein-protein interactions of Saccharomyces cerevisiae from multiple genome-wide data.Bioinformatics 2005, 21 Suppl 2:ii206-ii212.
29
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.