Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graph partitioning in genomic data analysis Roland Barriot, Petra Langendijk-Genevaux, Yves Quentin, Gwennaele Fichant « Génomique des systèmes intégrés.

Similar presentations


Presentation on theme: "Graph partitioning in genomic data analysis Roland Barriot, Petra Langendijk-Genevaux, Yves Quentin, Gwennaele Fichant « Génomique des systèmes intégrés."— Presentation transcript:

1 Graph partitioning in genomic data analysis Roland Barriot, Petra Langendijk-Genevaux, Yves Quentin, Gwennaele Fichant « Génomique des systèmes intégrés » group C LUSTERING AND B LOCK M ETHODS Toulouse, July 2015 Laboratoire de Microbiologie et Génétique Moléculaires – UMR5100

2 Outline Biological concepts and definitions – Genomes, genes, gene functions – Quest for Orthologs Strategy for inferring orthologs – Graph representation – Community detection and bi-clustering Application to ABC transporters Current challenges Clustering and Block Methods - 2015 - Roland Barriot2

3 Central dogma of molecular biology DNA →mRNA → protein Clustering and Block Methods - 2015 - Roland Barriot3 chromosome (DNA) genes alphabet = {A, T, C, G} 4 letters (nucleotides) alphabet = {A, U, C, G} alphabet = 20 letters (amino acids) hereditary information molecular machines

4 Sequence alignment Sequence comparisons and alignments – Query = protein sequence from Escherichia coli – Subject = protein sequence from Salmonella typhimurium → similarity score reflects sequence similarity (conservation) derived from a common ancestor? (orthologous sequences) similar sequence = similar function? Clustering and Block Methods - 2015 - Roland Barriot4 Speciation Evolution = mutations Today organisms chromosome (DNA) Common ancestor sequence similarity

5 Genomes Clustering and Block Methods - 2015 - Roland Barriot5 Full chromosome of Mycoplasma genitalium 1.2 Mbp (small) The genome (DNA) is the hereditary information that defines the potential of an organism. It is the sequence of nucleotides of the chromosomes of an organism. Gene function is predicted by sequence similarity.

6 Duplication events Need an orthology 1:1 relationship to predict gene function Clustering and Block Methods - 2015 - Roland Barriot6

7 Strategy for orthology 1:1 inference Infer putative links – sequence similarity → graph Prune graph to remove false positives – graph partitioning → communities Identify subclasses by genomic neighborhood conservation – bi-clustering → orthologs 1:1 Clustering and Block Methods - 2015 - Roland Barriot7

8 Putative orthology 1:1 inference Clustering and Block Methods - 2015 - Roland Barriot8 all against all sequences systematic comparison B:C sim(B,C) > max(sim(B,B2), sim(C, C2), sim(C,C22)) B2:C22  sim(B2,C22) < sim(C2,C22) ABB2CC2C22 Speciation Duplication BB2 CC2C22 sequence comparisons Evolution = mutations

9 Gene losses Clustering and Block Methods - 2015 - Roland Barriot9 rounds of duplications and losses lead to false positive links ABB2CC2C22 Speciation Duplication BB2 CC2C22 BBH sequence comparisons losses loss

10 False positive pruning Graph representation: nodes = genes, edges = orthologs Clustering and Block Methods - 2015 - Roland Barriot10 BB2CC2C22 B C B2 C2 Graph partitioning or Community detection

11 Orthologous systems: co-evolution Clustering and Block Methods - 2015 - Roland Barriot11 A Duplication pentose transport xylose transport arabinose transport Staphylococcus aureus CC2C22 Speciation (orthology) common ancestor (extinct) Today organisms Evolution pentose transport genes 1 single ABC transporter pentose transport proteins Escherichia coli Speciation BB2 pentose transport xylose transport Streptococcus pneumoniae chromosome (DNA)

12 Gene order conservation illustration Clustering and Block Methods - 2015 - Roland Barriot12

13 Orthologous systems Clustering and Block Methods - 2015 - Roland Barriot13 A B1 C1 D1 B2 C2 D2’ D2’’ Crossing: isortholog communities genomic neighborhood

14 Clustering of communities shared by the same genomes Clustering and Block Methods - 2015 - Roland Barriot14 B1 C1 B2 D1 C2 Genomic neighbors communities g1g2g4g3g7g5g6

15 Clustering of communities shared by the same genomes Clustering and Block Methods - 2015 - Roland Barriot15 Bicluster 2 Bicluster 1 B1 C1 B2 D1 C2 Genomic neighbors communities g7g5g4g3g1g2g6 Two core subfamilies

16 Application to ABC systems ABC transporters large family of paralogous genes up to >100 in a genome up to 15% of the genes in a prokaryotic genome in complete genomes of prokaryotes: ABCdb – ~2,000 genomes – ~350k genes encoding ABC systems Clustering and Block Methods - 2015 - Roland Barriot16 MSD NBD MSD NBD MSD NBD MSD NBD SBP ATPADP + P i ATPADP + P i ExporterImporter SBP: Solute Binding Protein MSD: Membrane Spanning Domain NBD: Nucleotide Binding Domain

17 Pipeline for ABC systems reconstruction and annotation Clustering and Block Methods - 2015 - Roland Barriot17 Complete genome ABC proteins identification & classification Assembly into functional systems Sub-family classification Functional prediction Evolutionary studies Sequence similarity search Profiles Chromosome localization Sub-family compatibility Evolutionary rules MSD NBDNBD NBDNBD SBP Example from Acidovorax citrulli AAC00-1:same subfamily (carbohydrates transport), same location MSDMSD NBD MSDMSD SBP MSDMSD NBD MSDMSD SBP

18 ABC systems in Lactococcus lactis Clustering and Block Methods - 2015 - Roland Barriot18 A_1 : import galactosides A_2 : import oligopeptides A_3 : résistance macrolides A_4 : import acides aminés A_5 : import di-saccharides A_6 : multidrogues resistance A_7 : export antibiotiques A_8 : import sidérophores A_9 : export peptides / antibiotiques A_11 : import phosphate A_12 et A_14 : ? A_18 : import phosphonate

19 ABC systems in Bacillus subtilis 206 proteins forming 80 systems Clustering and Block Methods - 2015 - Roland Barriot19

20 ABC Subfamily 1: carbohydrate importers ~100 curated genomes – 221 MSDs, 158 SBPs, 137 NBDs Clustering and Block Methods - 2015 - Roland Barriot20 MSDMSD NBD MSDMSD SBP MSD orthologs 1:1 graphSBP orthologs 1:1 graph NBD orthologs 1:1 graph

21 Communities detected walktrap method Clustering and Block Methods - 2015 - Roland Barriot21

22 Bi-clustering results Clustering and Block Methods - 2015 - Roland Barriot22 Subfamily of ABC carbohydrate importers 110 reconstructed systems out of 125 Autoinducer-2 in E. coli and S. typhimurium (Xavier and Bassler, 2005) B.cereus, B. anthracis, M. loti, P. multocida, S. flexneri Galactose importer in E. coli ( Harayama et al., 1983 ) B. halodurans, F. nucleatum, H. influenzae, M. loti, P. multocida, S. tuphimurium, S. flexneri and V. vulnificus Detect and isolate paralogous systems within the subfamily Iterative signature algorithm [Bergmann et al., 2003]

23 Current challenges Quest for orthologs : – Big data: Pace of new genome release – new vertices and edges Include more biological criteria – communities should not contain more than once each genome Co-clustering – cluster in parallel each domain – problem: graphs do not have the same vertices Clustering and Block Methods - 2015 - Roland Barriot23

24 Quest for Orthologs Clustering and Block Methods - 2015 - Roland Barriot24

25 Current dataset ~2,000 genomes – 350k genes coding ABC systems Subfamily 1 – MSDs: 2,836 vertices, 700k edges – SBPs: 2,407 vertices, 350k edges – NBDs: 1,670 vertices, 400k edges Graphs change with each new genome release! >1 new genome / day Clustering and Block Methods - 2015 - Roland Barriot25

26 Communities should not contain more than once each genome Include more biological criteria Clustering and Block Methods - 2015 - Roland Barriot26

27 Co-clustering Partition graphs in parallel by making use of other partner graph topologies Clustering and Block Methods - 2015 - Roland Barriot27

28 Conclusions Various graph topologies Need efficient methods – graph partitioning – bi-clustering Need accurate methods – essential for gene function prediction Clustering and Block Methods - 2015 - Roland Barriot28

29 Acknowledgements Clustering and Block Methods - 2015 - Roland Barriot29 Petra Langendijk-GenevauxGwennaele FichantYves Quentin Mathias Weyder

30 Available genomes and knowledge Model organism Escherichia coli: ~30% of genes with unknown function Clustering and Block Methods - 2015 - Roland Barriot30 count Genome size Genome size (in genes) distribution in ~2500 bacteria

31 Orthology 1:1 inference Clustering and Block Methods - 2015 - Roland Barriot31 requires 1:1 relationship = isorthology / no duplication since last speciation event all against all sequences systematic comparison B1:C1 sim(B1,C1) > max(sim(B1,B2), sim(C1, C2), sim(C1,C3)) B2:C2  sim(B2,C2) < sim(C2,C3) ABB2CC2C22 Speciation 1 Speciation 2 Duplication 1 Duplication 2 BB2 CC2C22 BBH sequence comparisons

32 Evolution Clustering and Block Methods - 2015 - Roland Barriot32 ABB2CC2C22 Speciation 2 Duplication 1 Duplication 2 Today organisms pentose transport pentose transport xylose transport pentose transport xylose transport arabinose transport Archaea Bacteria Pyrococcus furiosus Mycoplasma genitalium Escherichia coli

33 Bi-clustering genes encoding an ABC system should be neighbors on the chromosome Clustering and Block Methods - 2015 - Roland Barriot33

34 Communities detected Clustering and Block Methods - 2015 - Roland Barriot34

35 Definitions Homologs: sequences that derived from the same ancestral sequence – Paralogs: homologs which divergence is due to a duplication – Orthologs: homologs which divergence dates back to a speciation event – Isorthologs: orthologs for which no duplication event occurred after speciation orthologs 1:1 more likely to have retained the same function Clustering and Block Methods - 2015 - Roland Barriot35

36 Community detection Clustering and Block Methods - 2015 - Roland Barriot36

37 Clustering and Block Methods - 2015 - Roland Barriot37 Speciation Evolution = mutations Today organisms chromosome (DNA) Common ancestor sequence similarity


Download ppt "Graph partitioning in genomic data analysis Roland Barriot, Petra Langendijk-Genevaux, Yves Quentin, Gwennaele Fichant « Génomique des systèmes intégrés."

Similar presentations


Ads by Google