Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2 Department of Computer Science, University of California, Irvine, USA International Symposium on Integrative Bioinformatics Vesna Memišević 2, Tijana Milenković 2, and Nataša Pržulj 1
Motivation Genetic sequences – revolutionized understanding of biology Non-sequence based data of importance, e.g.: –secondary & tertiary structure of RNA have the dominant role in RNA function (tRNA: Gautheret et al., Comput. Appl. Biosci., 1990) (rRNA: Woese et al., Microbiological Reviews, 1983) –Secondary structure-based approach – more effective at finding new functional RNAs than sequence-based alignments (Webb et al., Science, 2009) What about patterns of interconnections in PPI networks? –Can they complement the knowledge learned from genomic sequence? –Wiring patterns of duplicated proteins in PPI net – insights into evol. dist.? –Does the information about homologues captured by PPI network topology differ from that captured by their sequence? Nataša Pržulj 2
Background Homologs – descend from a common ancestor: 1.Paralogs: in the same species, evolve through gene duplication events 2.Orthologs: in different species, evolve through speciation events 3 Nataša Pržulj
44 Background Sequence-based homology data from: 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] 4 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
555 Sequence-based homology data from: 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed 2.KEGG Orthology System [2] 5 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, Background
666 Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed 2.KEGG Orthology System [2] 6 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’
77 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found 2.KEGG Orthology System [2] 7 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’
888 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found 2.KEGG Orthology System [2] 8 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’
999 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found Triangles sharing a side merged into the groups of orthologs and paralogs 2.KEGG Orthology System [2] 9 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’
10 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found Triangles sharing a side merged into the groups of orthologs and paralogs 2.KEGG Orthology System [2] 10 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’ 23 4
11 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found Triangles sharing a side merged into the groups of orthologs and paralogs No dependence on the absolute level of similarity between compared proteins 2.KEGG Orthology System [2] 11 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’ 23 4
12 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] 12 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
13 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] Sequences aligned If alignment score < then 1 assigned as “similarity bit” Otherwise, 0 assigned as “similarity bit” “Bit vectors” constructed for a protein, over all proteins Graph constructed with nodes protein sequences and edges correlation coefficients of bit vectors of nodes Cliques found in the graph = orthology groups [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, Nataša Pržulj
14 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’ Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] Sequences aligned If alignment score < then 1 assigned as “similarity bit” Otherwise, 0 assigned as “similarity bit” “Bit vectors” constructed for a protein, over all proteins Graph constructed with nodes protein sequences and edges correlation coefficients of bit vectors of nodes Cliques found in the graph = orthology groups
15 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’ Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] Sequences aligned If alignment score < then 1 assigned as “similarity bit” Otherwise, 0 assigned as “similarity bit” “Bit vectors” constructed for a protein, over all proteins Graph constructed with nodes protein sequences and edges correlation coefficients of bit vectors of nodes Cliques found in the graph = orthology groups Again, no dependence on absolute level of similarity
16 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] We examine yeast proteins only: Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” There are 9,643 of unique such pairs What are their topological similarities within the PPI network? 16 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
17 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] We examine yeast proteins only: Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” There are 9,643 of unique such pairs What are their topological similarities within the PPI network? 17 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
18 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] We examine yeast proteins only: Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” There are 9,643 of unique such pairs What are their topological similarities within the PPI network? 18 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
19 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] Previous network-topology assisted approaches: Network-alignment-based (ISORank) Yosef, Sharan & Noble, Bioinformatics, 2008 (hybrid Rankprop) Rely heavily on sequence information Use only limited amount of network topology 19 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.
20 Our Method We examine yeast proteins only: Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” There are 9,643 of unique such pairs What are their topological similarities within the PPI network? PPI networks are noisy We analyze the high-confidence part of yeast PPI network by Collins et al. [3]: 9,074 edges amongst 1,621 proteins Focus on proteins with degree > 3 to avoid noisy PPIs There are 175 orthologous pairs amongst 181 proteins 20 Nataša Pržulj [3] Collins et al., Molecular and Cellular Proteomics, 6(3):439–450, 2008
21 Our Method Nataša Pržulj Does PPI network topology contain homology information? Are similarly wired proteins homologous? Does homology information obtained from network topology differ from that obtained from sequence?
22 Our Method Nataša Pržulj N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg , 2004.
23 Our Method Nataša Pržulj N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg , Induced Of any frequency
24 Our Method Nataša Pržulj Generalize node degree N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.
25 Our Method Nataša Pržulj N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.
26 Our Method Nataša Pržulj N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.
27 T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg , Graphlet Degree (GD) vectors, or “node signatures” Nataša Pržulj Our Method
28 Nataša Pržulj Our Method Similarity measure between nodes’ Graphlet Degree vectors T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg , 2008.
29 Nataša Pržulj Our Method T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg , Signature Similarity Measure
30 Nataša Pržulj Our Method For the 181 proteins in 175 orthologous pairs, we find: Graphlet degree vectors (GDVs) in the entire PPI network GDV-similarities (GDS) = topological similarities Sequence identities using Smith-Waterman local alignment with BLOSUM50 substitution matrix as the scoring scheme We compare the GDV-similarity vs. sequence identity topology vs. sequence
31 Results Nataša Pržulj Orthologous pairs often perform the same or similar function. Does GD vector similarity (GDS) imply shared biological function? Note: most GO annotations were obtained from sequences Similar topology ~ similar sequence ~ similar function Network Topology
32 Results Nataša Pržulj Orthologous proteins have high GD vector similarities Network Topology
33 Results Nataša Pržulj Orthologous proteins have high GD vector similarities p-value < % Network Topology
34 Results Nataša Pržulj Orthologous proteins have high GD vector similarities p-value < % > 20% of orthologous pairs have GDS > 85% Network Topology
35 Results Nataša Pržulj PPI networks are noisy Random edge additions, deletions and rewirings in the PPI net Network Topology – Robustness
36 Results Nataša Pržulj PPI networks are noisy Random edge additions, deletions and rewirings in the PPI net Network Topology – Robustness
37 Results Nataša Pržulj PPI networks are noisy Random edge additions, deletions and rewirings in the PPI net Network Topology – Robustness
38 Results Nataša Pržulj Sequence identities for the 175 orthologous pairs Sequence
39 Results Nataša Pržulj Sequence identities for the 175 orthologous pairs Sequence ~70% orth. pairs have seq. identity < 35% 35%
40 Results Nataša Pržulj Sequence identities for the 175 orthologous pairs Sequence ~20% orth. pairs have seq. identity > 90% 90%
41 Results Nataša Pržulj Sequence identities for the 175 orthologous pairs Sequence “Twilight zone” for homology 20-35% ~70% orth. pairs have seq. identity < 35% No dependence on the absolute similarity COG & KEGG, but triangles in the graph of best matches
42 85% 20%35% ~20% of orthologous pairs have signature similarities above 85% (35 pairs) ~30% of orthologous pairs have sequence identities above 35% (53 pairs) Overlap: 22 pairs (~60% of the smaller set) Sequence and network topology somewhat complementary slices of homology information Nataša Pržulj Results Comparison:
43 Results Nataša Pržulj 59 of the yeast ribosomal proteins – retained two genomic copies Are duplicated proteins functionally redundant? No: have different genetic requirements for their assembly and localization so are functionally distinct Also note: avg sequence identity of struct. similar prots ~8-10% Two pairs with identical sequence: Examples 100% sequence identity 50% signature similarity Degrees 25 and 5
44 Results Nataša Pržulj 59 of the yeast ribosomal proteins – retained two genomic copies Are duplicated proteins functionally redundant? No: have different genetic requirements for their assembly and localization so are functionally distinct Also note: avg sequence identity of struct. similar prots ~8-10% Two pairs with identical sequence: Examples 100% sequence identity 65% signature similarity Degrees 54 and 9
45 Conclusions Homology information captured by PPI network topology differs from that captured by sequence Complementary sources for identifying homologs Future work: Could topological similarity be used to identify orthologs from best-hits graph analysis as done for sequences?
Acknowledgements This project was supported by the NSF CAREER IIS grant Nataša Pržulj