Complementarity of network and sequence information in homologous proteins March, 2010 1 Department of Computing, Imperial College London, London, UK 2.

Slides:

Advertisements

Similar presentations

341: Introduction to Bioinformatics

Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

1 Modular Co-evolution of metabolic networks Zhao Jing.

Detecting active subnetworks in molecular interaction networks with missing data Luke Hunter Texas A&M University SHURP 2007 Student.

Network Properties 1.Global Network Properties ( Chapter 3 of the course textbook “Analysis of Biological Networks” by Junker and Schreiber) 1)Degree distribution.

Introduction to Bioinformatics

GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.

Orthology, paralogy and GO annotation Paul D. Thomas SRI International.

Basics of Comparative Genomics Dr G. P. S. Raghava.

Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.

Types of homology BLAST

341: Introduction to Bioinformatics Dr. Nataša Pržulj Department of Computing Imperial College London Winter 2011.

University at BuffaloThe State University of New York Young-Rae Cho Department of Computer Science and Engineering State University of New York at Buffalo.

Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.

Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen.

Comparison of Networks Across Species CS374 Presentation October 26, 2006 Chuan Sheng Foo.

Bioinformatics and Phylogenetic Analysis

Gene and Protein Networks II Monday, April CSCI 4830: Algorithms for Molecular Biology Debra Goldberg.

Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.

Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.

341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

341: Introduction to Bioinformatics

Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)

A graph theory approach to characterize the relationship between protein functions and structure of biological networks Serene Wong March 15, 2011.

Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.

Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.

20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.

I529: Lab5 02/20/2009 AI : Kwangmin Choi. Today’s topics Gene Ontology prediction/mapping – AmiGo –

ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.

Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.

Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 18, 2010 Lecture hour 18 Nataša Pržulj

Optimal Network Alignment with Graphlet Degree Vectors

Topological Network Alignment Uncovers Biological Function and Phylogeny Oleksii Kuchaiev¹, Tijana Milenković¹, Vesna Memišević, Wayne Hayes, Nataša Pržulj².

ARE THESE ALL BEARS? WHICH ONES ARE MORE CLOSELY RELATED?

1 Having genome data allows collection of other ‘omic’ datasets Systems biology takes a different perspective on the entire dataset, often from a Network.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Algorithms for Biological Networks Prof. Tijana Milenković Computer Science and Engineering University of Notre Dame Fall 2010.

Protein and RNA Families

Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.

CSCE555 Bioinformatics Lecture 18 Network Biology: Comparison of Networks Across Species Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu.

Comparative Genomics.

Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.

341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.

GO based data analysis Iowa State Workshop 11 June 2009.

341: Introduction to Bioinformatics

Bioinformatics Dipl. Ing. (FH) Patrick Grossmann

PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Comparative Network Analysis BMI/CS 776 Spring 2013 Colin Dewey

Phylogeny and the Tree of Life

PINALOG Protein Interaction Network Alignment and its implication in function prediction and complex detection Hang Phan Prof. Michael J.E. Sternberg.

Emily Pachunka ● Spring 2017

CSCI2950-C Lecture 12 Networks

Spectral methods for Global Network Alignment

Bioinformatics 3 V6 – Biological Networks are Scale- free, aren't they? Fri, Nov 2, 2012.

Novel directions for biological network alignment - MAGNA

Basics of Comparative Genomics

Section 8.6: Clustering Coefficients

Genome Annotation Continued

Section 8.6 of Newman’s book: Clustering Coefficients

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Department of Computer Science University of York

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Clustering Coefficients

Spectral methods for Global Network Alignment

Basics of Comparative Genomics

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Basic Local Alignment Search Tool

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Presentation transcript:

Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2 Department of Computer Science, University of California, Irvine, USA International Symposium on Integrative Bioinformatics Vesna Memišević 2, Tijana Milenković 2, and Nataša Pržulj 1

Motivation Genetic sequences – revolutionized understanding of biology Non-sequence based data of importance, e.g.: –secondary & tertiary structure of RNA have the dominant role in RNA function (tRNA: Gautheret et al., Comput. Appl. Biosci., 1990) (rRNA: Woese et al., Microbiological Reviews, 1983) –Secondary structure-based approach – more effective at finding new functional RNAs than sequence-based alignments (Webb et al., Science, 2009) What about patterns of interconnections in PPI networks? –Can they complement the knowledge learned from genomic sequence? –Wiring patterns of duplicated proteins in PPI net – insights into evol. dist.? –Does the information about homologues captured by PPI network topology differ from that captured by their sequence? Nataša Pržulj 2

Background Homologs – descend from a common ancestor: 1.Paralogs: in the same species, evolve through gene duplication events 2.Orthologs: in different species, evolve through speciation events 3 Nataša Pržulj

44 Background Sequence-based homology data from: 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] 4 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.

555 Sequence-based homology data from: 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed 2.KEGG Orthology System [2] 5 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, Background

666 Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed 2.KEGG Orthology System [2] 6 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’

77 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found 2.KEGG Orthology System [2] 7 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’

888 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found 2.KEGG Orthology System [2] 8 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’

999 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found Triangles sharing a side merged into the groups of orthologs and paralogs 2.KEGG Orthology System [2] 9 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’

10 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found Triangles sharing a side merged into the groups of orthologs and paralogs 2.KEGG Orthology System [2] 10 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’ 23 4

11 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] Proteins in different genomes – sequence compared for the best hits (BeTs) The graph of BeTs constructed Triangles in it found Triangles sharing a side merged into the groups of orthologs and paralogs No dependence on the absolute level of similarity between compared proteins 2.KEGG Orthology System [2] 11 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’ 23 4

12 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] 12 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.

13 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] Sequences aligned If alignment score < then 1 assigned as “similarity bit” Otherwise, 0 assigned as “similarity bit” “Bit vectors” constructed for a protein, over all proteins Graph constructed with nodes protein sequences and edges correlation coefficients of bit vectors of nodes Cliques found in the graph = orthology groups [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, Nataša Pržulj

14 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’ Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] Sequences aligned If alignment score < then 1 assigned as “similarity bit” Otherwise, 0 assigned as “similarity bit” “Bit vectors” constructed for a protein, over all proteins Graph constructed with nodes protein sequences and edges correlation coefficients of bit vectors of nodes Cliques found in the graph = orthology groups

15 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, ’ Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] Sequences aligned If alignment score < then 1 assigned as “similarity bit” Otherwise, 0 assigned as “similarity bit” “Bit vectors” constructed for a protein, over all proteins Graph constructed with nodes protein sequences and edges correlation coefficients of bit vectors of nodes Cliques found in the graph = orthology groups Again, no dependence on absolute level of similarity

16 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] We examine yeast proteins only: Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” There are 9,643 of unique such pairs What are their topological similarities within the PPI network? 16 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.

17 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] We examine yeast proteins only: Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” There are 9,643 of unique such pairs What are their topological similarities within the PPI network? 17 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.

18 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] We examine yeast proteins only: Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” There are 9,643 of unique such pairs What are their topological similarities within the PPI network? 18 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.

19 Background Sequence-based homology data from : 1.Clusters of Orthologous Groups – COG [1] 2.KEGG Orthology System [2] Previous network-topology assisted approaches: Network-alignment-based (ISORank) Yosef, Sharan & Noble, Bioinformatics, 2008 (hybrid Rankprop)  Rely heavily on sequence information  Use only limited amount of network topology 19 Nataša Pržulj [1] Tatusov et al., BMC Bioinformatics, 4(41), [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000.

20 Our Method We examine yeast proteins only: Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” There are 9,643 of unique such pairs What are their topological similarities within the PPI network? PPI networks are noisy We analyze the high-confidence part of yeast PPI network by Collins et al. [3]: 9,074 edges amongst 1,621 proteins Focus on proteins with degree > 3 to avoid noisy PPIs There are 175 orthologous pairs amongst 181 proteins 20 Nataša Pržulj [3] Collins et al., Molecular and Cellular Proteomics, 6(3):439–450, 2008

21 Our Method Nataša Pržulj Does PPI network topology contain homology information?  Are similarly wired proteins homologous? Does homology information obtained from network topology differ from that obtained from sequence?

22 Our Method Nataša Pržulj N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg , 2004.

23 Our Method Nataša Pržulj N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg ,  Induced  Of any frequency

24 Our Method Nataša Pržulj Generalize node degree N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.

25 Our Method Nataša Pržulj N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.

26 Our Method Nataša Pržulj N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.

27 T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg , Graphlet Degree (GD) vectors, or “node signatures” Nataša Pržulj Our Method

28 Nataša Pržulj Our Method Similarity measure between nodes’ Graphlet Degree vectors T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg , 2008.

29 Nataša Pržulj Our Method T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg , Signature Similarity Measure

30 Nataša Pržulj Our Method For the 181 proteins in 175 orthologous pairs, we find: Graphlet degree vectors (GDVs) in the entire PPI network GDV-similarities (GDS) = topological similarities Sequence identities using Smith-Waterman local alignment with BLOSUM50 substitution matrix as the scoring scheme We compare the GDV-similarity vs. sequence identity topology vs. sequence

31 Results Nataša Pržulj Orthologous pairs often perform the same or similar function. Does GD vector similarity (GDS) imply shared biological function? Note: most GO annotations were obtained from sequences  Similar topology ~ similar sequence ~ similar function Network Topology

32 Results Nataša Pržulj Orthologous proteins have high GD vector similarities Network Topology

33 Results Nataša Pržulj Orthologous proteins have high GD vector similarities p-value < % Network Topology

34 Results Nataša Pržulj Orthologous proteins have high GD vector similarities p-value < % > 20% of orthologous pairs have GDS > 85% Network Topology

35 Results Nataša Pržulj PPI networks are noisy Random edge additions, deletions and rewirings in the PPI net Network Topology – Robustness

36 Results Nataša Pržulj PPI networks are noisy Random edge additions, deletions and rewirings in the PPI net Network Topology – Robustness

37 Results Nataša Pržulj PPI networks are noisy Random edge additions, deletions and rewirings in the PPI net Network Topology – Robustness

38 Results Nataša Pržulj Sequence identities for the 175 orthologous pairs Sequence

39 Results Nataša Pržulj Sequence identities for the 175 orthologous pairs Sequence ~70% orth. pairs have seq. identity < 35% 35%

40 Results Nataša Pržulj Sequence identities for the 175 orthologous pairs Sequence ~20% orth. pairs have seq. identity > 90% 90%

41 Results Nataša Pržulj Sequence identities for the 175 orthologous pairs Sequence “Twilight zone” for homology 20-35% ~70% orth. pairs have seq. identity < 35%  No dependence on the absolute similarity COG & KEGG, but triangles in the graph of best matches

42 85% 20%35% ~20% of orthologous pairs have signature similarities above 85% (35 pairs) ~30% of orthologous pairs have sequence identities above 35% (53 pairs) Overlap: 22 pairs (~60% of the smaller set)  Sequence and network topology  somewhat complementary slices of homology information Nataša Pržulj Results Comparison:

43 Results Nataša Pržulj 59 of the yeast ribosomal proteins – retained two genomic copies Are duplicated proteins functionally redundant? No: have different genetic requirements for their assembly and localization so are functionally distinct Also note: avg sequence identity of struct. similar prots ~8-10% Two pairs with identical sequence: Examples 100% sequence identity 50% signature similarity Degrees 25 and 5

44 Results Nataša Pržulj 59 of the yeast ribosomal proteins – retained two genomic copies Are duplicated proteins functionally redundant? No: have different genetic requirements for their assembly and localization so are functionally distinct Also note: avg sequence identity of struct. similar prots ~8-10% Two pairs with identical sequence: Examples 100% sequence identity 65% signature similarity Degrees 54 and 9

45 Conclusions Homology information captured by PPI network topology differs from that captured by sequence Complementary sources for identifying homologs Future work: Could topological similarity be used to identify orthologs from best-hits graph analysis as done for sequences?

Acknowledgements This project was supported by the NSF CAREER IIS grant Nataša Pržulj