GENE ANNOTATION AND NETWORK INFERENCE BY PHYLOGENETIC PROFILING

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
The STRING database Michael Kuhn EMBL Heidelberg.
Profiles for Sequences
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Biological Gene and Protein Networks
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Evaluating Hypotheses
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Classification A comparison of function inference techniques.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Gene Set Enrichment Analysis (GSEA)
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Networks and Interactions Boo Virk v1.0.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Inferring Functional Information from Domain co-evolution Yohan Kim, Mehmet Koyuturk, Umut Topkara, Ananth Grama and Shankar Subramaniam Gaurav Chadha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Clustering Metabolic Networks Using Minimum Cut Trees Ryan Kellogg 1, Allison Heath 2, Lydia Kavraki 2,3 1 Carnegie Mellon University, Department of Electrical.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
PPI team Progress Report PPI team, IDB Lab. Sangwon Yoo, Hoyoung Jeong, Taewhi Lee Mar 2006.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Flat clustering approaches
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment Raja Jothi, Teresa.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Risk Identification and Evaluation Chapter 2
7. Performance Measurement
Networks and Interactions
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Evaluating Results of Learning
Predicting Active Site Residue Annotations in the Pfam Database
1 Department of Engineering, 2 Department of Mathematics,
Hidden Markov Models Part 2: Algorithms
1 Department of Engineering, 2 Department of Mathematics,
CISC 841 Bioinformatics (Spring 2006) Inference of Biological Networks
1 Department of Engineering, 2 Department of Mathematics,
Gene Family Ancestral State Phylogenetic Profiling
Recurrence-Associated Long Non-coding RNA Signature for Determining the Risk of Recurrence in Patients with Colon Cancer  Meng Zhou, Long Hu, Zicheng.
SEG5010 Presentation Zhou Lanjun.
Anastasia Baryshnikova  Cell Systems 
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Gautam Dey, Tobias Meyer  Cell Systems 
Predicting Gene Expression from Sequence
Network-Based Coverage of Mutational Profiles Reveals Cancer Genes
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

GENE ANNOTATION AND NETWORK INFERENCE BY PHYLOGENETIC PROFILING CISC 841 - BIOINFORMATICS GENE ANNOTATION AND NETWORK INFERENCE BY PHYLOGENETIC PROFILING Authors : Jie Wu, Zhenjun Hu and Charles DeLisi Boston University Presented by, Rajesh Ponnurangam

MOTIVATION The need for effective decision rule to use for correlation Inefficiencies of current methods Effectiveness of Phylogenetic analysis Need for improved performance at various levels of Resolution New Decision Rule – Correlation Enrichment

OVERVIEW Introduction Concepts What’s wrong with existing technologies of decision making? Comparison of Decision Rules Comparison with other published methods Identifying functional and evolutionary modules Standard Guilt by Association (SGA) Correlation Enrichment (CE) How correlation enrichment (CE) proves to be more effective?

INTRODUCTION CONCEPTS Gene Annotation Network inference Phylogenetic profiling Correlation Enrichment (CE) Standard guilt by association (SGA) KEGG Pathways COG Ontology

INTRODUCTION CONCEPTS Gene Annotation – The process of attaching biological information to sequences identifying elements on the genome (gene finding), and attaching biological information to these elements Network Inference – Knowing the topology of a biological network like transcriptional regulatory networks, metabolite networks etc. Phylogentic Profiling – Used to infer the function of a gene by finding another gene of known function with an identical pattern of presence and absence across a set of distributed genomes

INTRODUCTION CONCEPTS Correlation Enrichment – A new decision rule for assigning genes to functional categories at various levels of resolution Standard Guilt by Association (SGA) – Simple decision rule, which assigns an unannotated gene to all known categories of an annotated gene if the phylogenetic profiles exceed some specific correlation threshold KEGG – Kyoto Encyclopedia of Genes and Genomes, connects known information on molecular interaction networks COG Ontology – Cluster of Orthologous Genes, source of a conserved domain datasource

EXISTING METHOD AND PHYLOGENETIC PROFILING Current methods like SGA, perform at a level well below what is possible, largely because the performance of an effective decision rule to use the correlate deteriorates rapidly as coverage increases Phylogenetic profiling provides restricted profiling, requiring full profile identity, while accurate, has low coverage Phylogenetic profiling of a gene is a binary string Presence – 1 Absence – 0

PHYLOGENETIC PROFILING N -> Number of genomes over which profiles are defined with gene X occurring in x genomes and gene Y occurring in y genomes and both occurring in z genomes, the probability of observing z co-occurrences purely by chance, given N,x and y is, MI(X,Y) - p(i,j), (i=0,1; j=0,1), fraction of genomes in which gene X is in state i and gene Y is in state j p(1,1) – fraction of genomes in which both are present p(1,0) – fraction of genomes in which X is present and Y is absent

PHYLOGENETIC PROFILING Also Then the relation between MI and eq(1) is The paper defines a new measure of correlation between two binary strings                                                      0 ≤ C ≤ 1 (3b)

COMPARISON OF SGA & CE SGA assigns an unannotated gene to all known categories of an annotated gene if profiles exceed some correlation threshold. CE assigns an unannotated gene by ranking each category (pathway) with a score reflecting The number of genes (annotated) within a category, whose profile correlation with that of the unannotated gene exceeds a pre-specified threshold The magnitude of these correlations CE substantially outperforms SGA in allocating genes to functional categories SGA, for C*=0.35, links 1025 of 2918 unannotated orthologs to one pathway CE was able to assign all 2918 KEGG unannotated orthologs to pathways and all COG unannotated orthologs to COG categories

PATHWAY ALLOCATION PERFORMANCE

COMPARISON OF DECISION RULES

COMPARISON OF DECISION RULES SGA – assignment based on profile identity For inferences based on identity only 5.4% of unannotated orthologs are assignable to KEGG pathways When C*=0.2 to achieve a coverage of 90% requires accepting a PPV of 6% For inferences based on CE, PPV is markedly increased at high coverage, exceeding its SGA value approximately 6 fold The two decision rules perform similarly at coverages below 20% PPV estimates are conservative CE performs superior than SGA

COMPARISON OF DECISION RULES At C*=0.4 where SGA and CE curves for PPV have reached about half their maximum divergence, CE performs substantially better than SGA at GO specificity levels.

COMPARISON WITH OTHER PUBLISHED MODELS Different methods to draw functional inferences like “majority vote” and Markov Random Field can assign function based on the network context of unannotated genes Predictive reliability can be increased by combining them using one or another statistical framework such as support vector machines, Bayesian inferences and Markov Random field. Using SGA to assign genes to GO categories, fraction of genes assigned to at least one category decreases from 0.98 to ~0.10 as functional specificity increases with coverage fixed at 40% Using CE, the fraction correctly assigned to at least one category is 0.95 at the lowest specificity level and remains 0.78 at all specificity levels

INFERENCES BASED ON COG ONTOLOGY COG functional categories provide a low resolution, but fully resolved annotation 1 gene to 1 functional category mapping Profiling by CE of the full set of 4826 genes, at C*=0.55 returns a 926 genes linked to at least one annotated gene Each of the 926 genes, including 249 unannotated are assignable to COG category Performance estimation – 68% (463/677)

INFERENCES BASED ON COG ONTOLOGY

INFERENCES BASED ON COG ONTOLOGY A more detailed version of the category H TP set reveals two strikingly dense clusters – one with 7 orthologs, the other with 11

PHYLOGENETIC PROFILES OF THE 11-MEMBER CLUSTER Phlogenetic profiles of the 11-member cluster of orthologs across 66 genomes uncovered by CE. Green represents absence and red, presence of an ortholog

CLIQUES, CLUSTERS & INFERENCE QUALITY As the threshold decreases from its most stringent value (C*=0.91), the number of clusters containing more than 3 nodes increases, peaking at C*=0.66 and then declines as the nodes coalesce into increasingly larger clusters

CLIQUES, CLUSTERS & INFERENCE QUALITY

METHODS Dataset – COG database. Accuracy is evaluated against KEGG Assessment Positive Predictive Value – By definition, the population averaged positive predictive value is,

METHODS PPV as a product of two factors Related Metrics SPE-ACC SEN-A0

STANDARD GUILT BY ASSOCIATION let i be the number of the categories that contain the gene I let J(I, J) be the set of categories that contain a gene J whose profile correlation with I meets the threshold C*, j(I, J) is its size let K(I, J) denote the set of common categories and k(I, J) is its size; where 0 ≤ k(I, J) ≤ min(i, j). The unannotated gene is therefore correctly assigned to TP = k categories and incorrectly assigned to the remaining FP = j - k categories. Also TN = T - i - j + k and FN = i - k, where T = 133 is the total number of pathways. Consequently, the PPVI(J) with which gene I is assigned using linked gene J is

STANDARD GUILT BY ASSOCIATION Maximum PPVI(J) is not necessarily 1, but min(i,j)/j For j<I, PPVI(J)<1, whereas when i>j, PPVI(J) can become 1 when the pathways of J are a subset of I. The positive predictive value for gene I is obtained by taking sums over all genes to which its is correlated Where G(I) is the number of genes correlated with gene I and Nc(I) is the subset of genes in G(I) that share at least one category with gene I

CORRELATION ENRICHMENT Suppose an unannotated gene is correlated with in total g other genes (C > C*) from r categories, and let m1, m2, ..., mr be the number of correlated genes in categories k1, k2, ..., kr, where r ≤ g, the equality holding only when each gene is in one category. Further, let denote the categories the gene is in. For each of the r categories that have 1 or more genes meeting the correlation threshold with I, define a weighted sum score, Sv α is positive adjustable integer which gives disproportionately high weights to strong correlations

CORRELATION ENRICHMENT FP = r0 – TP FN = T1 – TP TN = T – r0 – T1 + TP

REFERENCES Wu J, Kasif S, DeLisi C: Identification of functional links between genes using phylogenetic profiles. Bioinformatics 2003, 19(12):1524-1530 Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.Proc Natl Acad Sci U S A 1999, 96(8):4285-4288 Aravind L: Guilt by association: contextual information in genome analysis.Genome Res 2000, 10(8):1074-1077 Nariai N, Tamada Y, Imoto S, Miyano S: Estimating gene regulatory networks and protein-protein interactions of Saccharomyces cerevisiae from multiple genome-wide data.Bioinformatics 2005, 21 Suppl 2:ii206-ii212.

Thank you