Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment Raja Jothi, Teresa.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Journal Club Jenny Gu October 24, Introduction Defining the subset of Superfamilies in LUCA Examine adaptability and expansion of particular superfamilies.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Finding detailed relationships between proteins specific to phenotypes among microbial organisms Daniel Park Molecular Biology Institute, UCLA Yeates lab.
Heuristic alignment algorithms and cost matrices
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Metagenomic Analysis Using MEGAN4
Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Protein Sequence Alignment and Database Searching.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
1/17 Identification of thermophilic species by the amino acid compositions deduced from their genomes Reporter: Yu Lun Kuo
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Comp. Genomics Recitation 3 The statistics of database searching.
Inferring Functional Information from Domain co-evolution Yohan Kim, Mehmet Koyuturk, Umut Topkara, Ananth Grama and Shankar Subramaniam Gaurav Chadha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Functional and Evolutionary Attributes through Analysis of Metabolism Sophia Tsoka European Bioinformatics Institute Cambridge UK.
Bioinformatics and Computational Biology
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Use of Logic Relationships to Decipher Protein Network Organization Peter M. Bowers, Shawn J. Cokus, David Eisenberg, Todd O. Yeates Presented by Krishna.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Step 3: Tools Database Searching
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Patterns of HIV-1 evolution in individuals with differing rates of CD4 T cell decline Markham RB, Wang WC, Weisstein AE, Wang Z, Munoz A, Templeton A,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
bacteria and eukaryotes
Sequence similarity, BLAST alignments & multiple sequence alignments
Functional organization of the yeast proteome by systematic analysis of protein complexes Presented by Nathalie Kirshman and Xinyi Ma.
Basics of Comparative Genomics
Genomes and Their Evolution
Predicting Active Site Residue Annotations in the Pfam Database
Genomes and Their Evolution
Basic Local Alignment Search Tool
SEG5010 Presentation Zhou Lanjun.
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Gautam Dey, Tobias Meyer  Cell Systems 
Basic Local Alignment Search Tool
Presentation transcript:

Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment Raja Jothi, Teresa M. Przytycka, and L. Aravind National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA Presented by Jennifer Hiras November 8, 2007 CISC-841 Jothi et al. BMC Bioinformatics 2007, 8:173

Outline Background –Significance –Phylogenetic Profile Comparisons Goals Methods and Results –Data Set Construction –Phylogenetic Profile Construction –Assessing the Degree of Similarity Between Two Profiles –Pathway Similarity Calculation –Performance Measure Calculations –Statistical Significance Test Conclusions

Background Whole-genome sequencing has generated large amounts of genomic sequences Major Challenges –Discovering function of proteins –Determine how proteins interact with each other in cellular pathways and networks Phylogenetic Profile Comparisons (PPCs) –Widely used approach –Patterns of presence or absence of protein families across multiple genomes used to infer functional linkages between proteins

What is a Phylogenetic Profile? A phylogenetic profile is the description of the presence or absence of a given protein in a set of genomes Profiles of proteins B and C are similar, whereas profile of protein A is different. A profile is constructed using the BLAST similarity (E-value) score of the highest scoring sequence from a given genome Genome1Genome2Genome3Genome4 Profile of protein APresentAbsent Profile of protein BPresent AbsentPresent Profile of protein CPresent AbsentPresent

Phylogenetic Profile Comparisons Proteins having matching or similar profiles are inferred to be functionally linked Predict the function of uncharacterized proteins by simply relating profiles of proteins with known function to proteins with unknown functions Clustering protein profiles based on their similarity

Purpose of Investigation Goal –Measure the accuracy and coverage of PPC –Identify its biases, strengths, and weaknesses Studied E. coli and S. cerevisiae (yeast) proteins through comparative analysis of ~900,000 known proteins from 95 different organisms Designed 16 reference sets of genomes (Table 1) –The choice of reference set of genomes affects the predictive power of the PPC approach

Functional Linkages KEGG pathway maps for functional linkages –to enable fair comparison of this work to previous studies that used KEGG for analysis –KEGG is one of the most widely used database for verifying large-scale functional linkage predictions Degree of similarity between two profiles measured by mutual info score between profiles –Functionally related if mutual info score is above a certain threshold Considered two proteins to be related (linked) if they co-occur in at least one KEGG pathway

Methods and Results

Data Set Construction AA sequences of 894,522 proteins from 95 different organisms –41 Bacteria, 11 Archaea, and 43 Eukaryotes –NCBI, Ensembl, BROAD, WashU, PlasmoDB, and dictyBase NCBI's Basic Local Alignment Search Tool (BLAST) –For comparison of protein sequences against each other Methods

Phylogenetic Profile Construction Every protein i is searched against the set of proteins from each organism j, and the presence/absence of the query protein's homolog in organism j is recorded in the form of BLAST e-value Eij. For each protein i, a phylogenetic vector/profile P was generated with each entry Pij = -1/logEij in the vector corresponding to presence/absence information of i's homolog in organism j. Methods

Table I: Summary of genomes in reference sets

Figure 1 Reference sets of organisms. The relationships and divergence times (Mya) of 95 organisms (41 bacterial, 11 archaeal, and 43 eukaryotic genomes) are based on recent studies. The tree branch lengths are not proportional to time.

Assessing the Degree of Similarity Between Two Profiles Used the mutual information (MI) measure MI(A, B) = H(A) + H(B) - H(A, B) where is the entropy of profile A, and is the joint entropy of profiles A and B p(a) is the frequency with which the value a is observed in profile vector A, and p(a, b) is the frequency with which the pair of values (a, b) are observed in A and B (with a and b appearing in A and B, respectively) Methods

Pathway Similarity Calculation KEGG pathway database used to compile the pathway membership information for each protein Pathway similarity between two proteins A and B, calculated by taking the Jaccard coefficient of their KEGG database pathway annotation: Pathway similarity (A, B) = 100 × (| KEGG A ⋂ KEGG B |)/(| KEGG A ⋃ KEGG B |) where KEGGx is the set of specific pathways protein A is known to participate, and |KEGGx| is the number of unique pathways in the set. A pathway similarity score of s between proteins A and B indicates that A is present in at least s% of the pathways that B is present in, and vice-versa. Methods

Performance Measure Calculations Assess the performance of PPCs for prediction of functional linkages Two proteins are considered to be functionally related (positive) if they co-occur in at least one KEGG pathway, and unrelated (negative) otherwise Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP) Positive Predictive Value = TP/(TP+FP) Methods

Figure 2: E. coli proteins Figure 3: Yeast proteins Results BAE3a

Structure and evolutionary history of individual pathways influence performance of PPCs A KEGG pathway is an ensemble of many smaller pathways Different pathways differ in terms of the conservation patterns of their components Evaluated the role of diversity in conservation across different functional systems Set of nine systems (second level of KEGG orthology) –each with 80 or more protein components Repeated the same analysis for proteins in each of these KEGG pathways (7 of the 16 reference sets) Analysis on individual pathways –Positive if they co-occur in the pathway under consideration. Otherwise, the pair was negative –Different from overall analysis

Figure 5: E. coli Translation apparatus is the most conserved pathway –Most non-redundant set (NR- 8) performed best, not BAE3a –Phyletic pattern from diverse and NR genomes captures all useful evolutionary info necessary for functional linkage In E. coli, 5 of 8 KEGG pathways, set B best –Indicates nature of conservation has effect on performance

Statistical Significance Test t-test to determine whether the difference in mean values of two distributions is statistically significant The higher the t-score, the higher the significance Methods

Figure 4: t-tests Differences in mean values for all 16 reference sets of genomes are statistically significant t-scores spanned a wide range –Suggests dramatic differences in the performance of individual reference sets Results

Conclusions

General Observations Examined ~710,000 possible functional linkages in E. coli and ~640,000 in yeast Overall performances using all 16 reference sets of genomes are depicted in Figures 2 and 3 Predictive power of PPC approach was greatly influenced by the selection of reference set of genomes Conclusion

Selection of reference set genomes at the super-kingdom level is crucial Reference sets composed of genomes from individual super-kingdoms (B or A) –Performs worse only the bacterial and archaeal super-kingdom (BA) –Suggests that phyletic patterns of proteins in bacterial and archaeal genomes alone provide sufficient information for reasonably accurate functional linkage predictions Any further improvements to this performance require additions of a few eukaryotic genomes Conclusion

Selection of reference set genomes at the super-kingdom level is crucial (con’t) Best performance - set BAE3a –contained all B and A genomes included in set and a single representative from each of the major eukaryotic lineages Further additions of eukaryotic genomes –little or no effect on the performance in E. coli –reduced the performance in yeast Conclusion

Genome choice within the super-kingdom is a notable determinant of PPC performance Numerous parasitic or pathogenic representatives amongst currently available E and B genomes Lack many metabolic pathways Sequencing efforts to date biased towards parasitic or pathogenic organisms –Resulted in over-representation in the public databases Evaluated the effects of their inclusion on the performance of the PPC approach Inclusion or non-inclusion of these genomes in the reference set didn’t significantly alter the performance Conclusion

Guidelines for selecting genomes for a reference set Representatives from all 3 kingdoms of life are a must for highly accurate predictions –reasonably good accuracy can be achieved using prokaryotes alone Specifically choose genomes representing the entire phylogenetic diversity of life. –Fully-sequenced large genomes of free-living organisms give the best info Parasitic organisms, strains of the same species and very closely related species are unlikely to provide new predictive information Conclusion