Graph partitioning in genomic data analysis Roland Barriot, Petra Langendijk-Genevaux, Yves Quentin, Gwennaele Fichant « Génomique des systèmes intégrés.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Phylogenetic analysis To infer and study evolutionary history of homologous gene families Manuel Ruiz (CIRAD, Data Integration team) Alexis Dereeper (IRD)
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Orthology, paralogy and GO annotation Paul D. Thomas SRI International.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Comparative genomics Joachim Bargsten February 2012.
BIOINFORMATICS Ency Lee.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Sequence Similarity Searching Class 4 March 2010.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Bioinformatics and Phylogenetic Analysis
Protein Modules An Introduction to Bioinformatics.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
The diversity of genomes and the tree of life
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comparative Genomics of the Eukaryotes
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
3- RIBOSOMAL RNA GENE RECONSTRUCITON  Phenetics Vs. Cladistics  Homology/Homoplasy/Orthology/Paralogy  Evolution Vs. Phylogeny  The relevance of the.
Phylogenetic Trees: Common Ancestry and Divergence 1B1: Organisms share many conserved core processes and features that evolved and are widely distributed.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
whole-genome duplications and large segmental duplications… …seem to be a common feature in eukaryotic genome evolution …play a crucial role in the evolution.
Chapter 24: Molecular and Genomic Evolution CHAPTER 24 Molecular and Genomic Evolution.
Protein World SARA Amsterdam Tim Hulsen.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
BDC331 Conservation Genetics 2015 Mr. Adriaan Engelbrecht Department of Biodiversity and Conservation Biology New Life Sciences Building Core 2, Room
Molecular and Genomic Evolution Getting at the Gene Pool.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Phylogeny & Systematics
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
Protein Evolution Introducing the use of Biology Workbench as a Bioinformatics Tool.
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
1 Genes and Proteins The genetic information contained in the nucleotide sequence of DNA specifies a particular type of protein Enzymes = proteins that.
Phylogeny and the Tree of Life
Sequence similarity, BLAST alignments & multiple sequence alignments
Basics of Comparative Genomics
Comparative Genomics.
Molecular Phylogeny Similarity among organisms (and their genes) is the result of descent from a common ancestor. Variation occurs via genetic drift and.
Pipelines for Computational Analysis (Bioinformatics)
In-Text Art, Ch. 16, p. 316 (1).
Genome Annotation Continued
There are four levels of structure in proteins
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
What do you with a whole genome sequence?
Basics of Comparative Genomics
Presentation transcript:

Graph partitioning in genomic data analysis Roland Barriot, Petra Langendijk-Genevaux, Yves Quentin, Gwennaele Fichant « Génomique des systèmes intégrés » group C LUSTERING AND B LOCK M ETHODS Toulouse, July 2015 Laboratoire de Microbiologie et Génétique Moléculaires – UMR5100

Outline Biological concepts and definitions – Genomes, genes, gene functions – Quest for Orthologs Strategy for inferring orthologs – Graph representation – Community detection and bi-clustering Application to ABC transporters Current challenges Clustering and Block Methods Roland Barriot2

Central dogma of molecular biology DNA →mRNA → protein Clustering and Block Methods Roland Barriot3 chromosome (DNA) genes alphabet = {A, T, C, G} 4 letters (nucleotides) alphabet = {A, U, C, G} alphabet = 20 letters (amino acids) hereditary information molecular machines

Sequence alignment Sequence comparisons and alignments – Query = protein sequence from Escherichia coli – Subject = protein sequence from Salmonella typhimurium → similarity score reflects sequence similarity (conservation) derived from a common ancestor? (orthologous sequences) similar sequence = similar function? Clustering and Block Methods Roland Barriot4 Speciation Evolution = mutations Today organisms chromosome (DNA) Common ancestor sequence similarity

Genomes Clustering and Block Methods Roland Barriot5 Full chromosome of Mycoplasma genitalium 1.2 Mbp (small) The genome (DNA) is the hereditary information that defines the potential of an organism. It is the sequence of nucleotides of the chromosomes of an organism. Gene function is predicted by sequence similarity.

Duplication events Need an orthology 1:1 relationship to predict gene function Clustering and Block Methods Roland Barriot6

Strategy for orthology 1:1 inference Infer putative links – sequence similarity → graph Prune graph to remove false positives – graph partitioning → communities Identify subclasses by genomic neighborhood conservation – bi-clustering → orthologs 1:1 Clustering and Block Methods Roland Barriot7

Putative orthology 1:1 inference Clustering and Block Methods Roland Barriot8 all against all sequences systematic comparison B:C sim(B,C) > max(sim(B,B2), sim(C, C2), sim(C,C22)) B2:C22  sim(B2,C22) < sim(C2,C22) ABB2CC2C22 Speciation Duplication BB2 CC2C22 sequence comparisons Evolution = mutations

Gene losses Clustering and Block Methods Roland Barriot9 rounds of duplications and losses lead to false positive links ABB2CC2C22 Speciation Duplication BB2 CC2C22 BBH sequence comparisons losses loss

False positive pruning Graph representation: nodes = genes, edges = orthologs Clustering and Block Methods Roland Barriot10 BB2CC2C22 B C B2 C2 Graph partitioning or Community detection

Orthologous systems: co-evolution Clustering and Block Methods Roland Barriot11 A Duplication pentose transport xylose transport arabinose transport Staphylococcus aureus CC2C22 Speciation (orthology) common ancestor (extinct) Today organisms Evolution pentose transport genes 1 single ABC transporter pentose transport proteins Escherichia coli Speciation BB2 pentose transport xylose transport Streptococcus pneumoniae chromosome (DNA)

Gene order conservation illustration Clustering and Block Methods Roland Barriot12

Orthologous systems Clustering and Block Methods Roland Barriot13 A B1 C1 D1 B2 C2 D2’ D2’’ Crossing: isortholog communities genomic neighborhood

Clustering of communities shared by the same genomes Clustering and Block Methods Roland Barriot14 B1 C1 B2 D1 C2 Genomic neighbors communities g1g2g4g3g7g5g6

Clustering of communities shared by the same genomes Clustering and Block Methods Roland Barriot15 Bicluster 2 Bicluster 1 B1 C1 B2 D1 C2 Genomic neighbors communities g7g5g4g3g1g2g6 Two core subfamilies

Application to ABC systems ABC transporters large family of paralogous genes up to >100 in a genome up to 15% of the genes in a prokaryotic genome in complete genomes of prokaryotes: ABCdb – ~2,000 genomes – ~350k genes encoding ABC systems Clustering and Block Methods Roland Barriot16 MSD NBD MSD NBD MSD NBD MSD NBD SBP ATPADP + P i ATPADP + P i ExporterImporter SBP: Solute Binding Protein MSD: Membrane Spanning Domain NBD: Nucleotide Binding Domain

Pipeline for ABC systems reconstruction and annotation Clustering and Block Methods Roland Barriot17 Complete genome ABC proteins identification & classification Assembly into functional systems Sub-family classification Functional prediction Evolutionary studies Sequence similarity search Profiles Chromosome localization Sub-family compatibility Evolutionary rules MSD NBDNBD NBDNBD SBP Example from Acidovorax citrulli AAC00-1:same subfamily (carbohydrates transport), same location MSDMSD NBD MSDMSD SBP MSDMSD NBD MSDMSD SBP

ABC systems in Lactococcus lactis Clustering and Block Methods Roland Barriot18 A_1 : import galactosides A_2 : import oligopeptides A_3 : résistance macrolides A_4 : import acides aminés A_5 : import di-saccharides A_6 : multidrogues resistance A_7 : export antibiotiques A_8 : import sidérophores A_9 : export peptides / antibiotiques A_11 : import phosphate A_12 et A_14 : ? A_18 : import phosphonate

ABC systems in Bacillus subtilis 206 proteins forming 80 systems Clustering and Block Methods Roland Barriot19

ABC Subfamily 1: carbohydrate importers ~100 curated genomes – 221 MSDs, 158 SBPs, 137 NBDs Clustering and Block Methods Roland Barriot20 MSDMSD NBD MSDMSD SBP MSD orthologs 1:1 graphSBP orthologs 1:1 graph NBD orthologs 1:1 graph

Communities detected walktrap method Clustering and Block Methods Roland Barriot21

Bi-clustering results Clustering and Block Methods Roland Barriot22 Subfamily of ABC carbohydrate importers 110 reconstructed systems out of 125 Autoinducer-2 in E. coli and S. typhimurium (Xavier and Bassler, 2005) B.cereus, B. anthracis, M. loti, P. multocida, S. flexneri Galactose importer in E. coli ( Harayama et al., 1983 ) B. halodurans, F. nucleatum, H. influenzae, M. loti, P. multocida, S. tuphimurium, S. flexneri and V. vulnificus Detect and isolate paralogous systems within the subfamily Iterative signature algorithm [Bergmann et al., 2003]

Current challenges Quest for orthologs : – Big data: Pace of new genome release – new vertices and edges Include more biological criteria – communities should not contain more than once each genome Co-clustering – cluster in parallel each domain – problem: graphs do not have the same vertices Clustering and Block Methods Roland Barriot23

Quest for Orthologs Clustering and Block Methods Roland Barriot24

Current dataset ~2,000 genomes – 350k genes coding ABC systems Subfamily 1 – MSDs: 2,836 vertices, 700k edges – SBPs: 2,407 vertices, 350k edges – NBDs: 1,670 vertices, 400k edges Graphs change with each new genome release! >1 new genome / day Clustering and Block Methods Roland Barriot25

Communities should not contain more than once each genome Include more biological criteria Clustering and Block Methods Roland Barriot26

Co-clustering Partition graphs in parallel by making use of other partner graph topologies Clustering and Block Methods Roland Barriot27

Conclusions Various graph topologies Need efficient methods – graph partitioning – bi-clustering Need accurate methods – essential for gene function prediction Clustering and Block Methods Roland Barriot28

Acknowledgements Clustering and Block Methods Roland Barriot29 Petra Langendijk-GenevauxGwennaele FichantYves Quentin Mathias Weyder

Available genomes and knowledge Model organism Escherichia coli: ~30% of genes with unknown function Clustering and Block Methods Roland Barriot30 count Genome size Genome size (in genes) distribution in ~2500 bacteria

Orthology 1:1 inference Clustering and Block Methods Roland Barriot31 requires 1:1 relationship = isorthology / no duplication since last speciation event all against all sequences systematic comparison B1:C1 sim(B1,C1) > max(sim(B1,B2), sim(C1, C2), sim(C1,C3)) B2:C2  sim(B2,C2) < sim(C2,C3) ABB2CC2C22 Speciation 1 Speciation 2 Duplication 1 Duplication 2 BB2 CC2C22 BBH sequence comparisons

Evolution Clustering and Block Methods Roland Barriot32 ABB2CC2C22 Speciation 2 Duplication 1 Duplication 2 Today organisms pentose transport pentose transport xylose transport pentose transport xylose transport arabinose transport Archaea Bacteria Pyrococcus furiosus Mycoplasma genitalium Escherichia coli

Bi-clustering genes encoding an ABC system should be neighbors on the chromosome Clustering and Block Methods Roland Barriot33

Communities detected Clustering and Block Methods Roland Barriot34

Definitions Homologs: sequences that derived from the same ancestral sequence – Paralogs: homologs which divergence is due to a duplication – Orthologs: homologs which divergence dates back to a speciation event – Isorthologs: orthologs for which no duplication event occurred after speciation orthologs 1:1 more likely to have retained the same function Clustering and Block Methods Roland Barriot35

Community detection Clustering and Block Methods Roland Barriot36

Clustering and Block Methods Roland Barriot37 Speciation Evolution = mutations Today organisms chromosome (DNA) Common ancestor sequence similarity