Bioinformatics and Evolutionary Genomics Genome Evolution (I) and Genomics Context for function prediction.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Journal Club Jenny Gu October 24, Introduction Defining the subset of Superfamilies in LUCA Examine adaptability and expansion of particular superfamilies.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Phylogenetic reconstruction
Duplication, rearrangement, and mutation of DNA contribute to genome evolution Chapter 21, Section 5.
Predicting interactions between genes based on genome sequence comparisons The “genomic context” component of STRING Bioinformatics seminar series
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Predicting interactions between genes based on genome Sequence comparisons The “genomic context” component of STRING Bioinformatics seminar series
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Evolution of minimal metabolic networks WANG Chao April 11, 2006.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Protein-protein interactions
Adaptive evolution of bacterial metabolic networks by horizontal gene transfer Chao Wang Dec 14, 2005.
Tree Pattern Matching in Phylogenetic Trees Automatic Search for Orthologs or Paralogs in Homologous Gene Sequence Databases By: Jean-François Dufayard,
FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.
Alternative splicing and evolution Daniel Jeffares.
BACKGROUND E. coli is a free living, gram negative bacterium which colonizes the lower gut of animals. Since it is a model organism, a lot of experimental.
Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species.
Protein Modules An Introduction to Bioinformatics.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Bioinformatics Genome anatomy Comparisons of some eukaryotic genomes Allignment of long genomic sequences Comparative genomics Oxford Grid Reconstruction.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.
Protein and RNA Families
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Comparative genomics Haixu Tang School of Informatics.
Anis Karimpour-Fard 1, Corrella Detweiler 2, Ryan T. Gill 3, and Lawrence Hunter 1 1 University of Colorado School of Medicine 2 MCD-Biology, University.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Functional and Evolutionary Attributes through Analysis of Metabolism Sophia Tsoka European Bioinformatics Institute Cambridge UK.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context.
1 Computational functional genomics Lital Haham Sivan Pearl.
Ribonucleotide reductases (RNRs) catalyse the reduction of ribonucleotides to their corresponding 2`-deoxyribonucleotides and therefore play an essential.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment Raja Jothi, Teresa.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Phylogenetic genome analysis, phylogenomics
Bioinformatics Overview
Evolution of eukaryotic genomes
Nothing in (computational) biology makes
Fig. 1. — The life cycle of S. papillosus. (A) The life cycle of S
Basics of Comparative Genomics
FLiPS Functional Linkage Prediction Service.
The Mimivirus Giant double stranded DNA virus Discovered in amoebas
Protein Interaction Networks
Chapter 19 Molecular Phylogenetics
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Gautam Dey, Tobias Meyer  Cell Systems 
Yamanishi, M., Itoh, M., Kanehisa, M.
Basics of Comparative Genomics
GENOMICS Copyright © 2009 Pearson Education, Inc..
Presentation transcript:

Bioinformatics and Evolutionary Genomics Genome Evolution (I) and Genomics Context for function prediction

(bacterial) Genome Evolution: added value of genomes How does gene content evolve?How does gene content evolve? How does gene order evolve?How does gene order evolve? How important are various evolutionary dynamics of genes on a genomic scale (e.g. gene fusion, gene loss, gene duplication): moving from anecdotes to trendsHow important are various evolutionary dynamics of genes on a genomic scale (e.g. gene fusion, gene loss, gene duplication): moving from anecdotes to trends How does gene content evolve?How does gene content evolve? How does gene order evolve?How does gene order evolve? How important are various evolutionary dynamics of genes on a genomic scale (e.g. gene fusion, gene loss, gene duplication): moving from anecdotes to trendsHow important are various evolutionary dynamics of genes on a genomic scale (e.g. gene fusion, gene loss, gene duplication): moving from anecdotes to trends

functionally associated proteins leave evolutionary traces of their relation in genomes Genomic context / in silico interaction prediction

Gene order evolution: -Establish orthologous relations between pairs of genomes (e.g. S-W best bidirectional hit approach -Put them in a dotplot, color the relative direction of transcription (Green for the same relative direction. Red for the opposite direction.) Gene order evolution: -Establish orthologous relations between pairs of genomes (e.g. S-W best bidirectional hit approach -Put them in a dotplot, color the relative direction of transcription (Green for the same relative direction. Red for the opposite direction.)

Evolution of genome organization: -In prokaryotes, genome inversions centered around the origin/terminus of replication are a major source of genome rearrangements. -This suggests that both replication forks are in close contact -> comparative genome analysis provides support for a hypothesis about genome replication “and a close proximity of the forks would increase the probability of reciprocal recombination or transposition between sequences at the two forks. That the forks are near each other is also consistent with the 'replication factory' model based on immunolocalization of components of the replication machinery in Bacillus subtilis” (Tillier and Collins, Nat. Gen)” b

Gene order evolves rapidly But …

Differential retention of divergent / convergent gene pairs suggests that conservation implies a functional association Operons Gene Order Evolution

Conserved gene order i.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene clusteri.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene cluster Contributes many reliable predictionsContributes many reliable predictions i.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene clusteri.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene cluster Contributes many reliable predictionsContributes many reliable predictions

Conserved gene order NB1 predicting operons is not trivial; in fact conserved gene order or functional association is a major clue NB2 using ‘only’ operons without requiring conservation results in much less reliable function prediction

Comparison to pathways conservation implies a functional association

Conserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoA “query” “target”

Biochemical assays confirm the function of members of COG0346 as a DL- methylmalonyl-CoA racemase

Gene Fusion “Rare” (especially in prokaryotes): ~3000 linked COGs in STRING v6 (~180 genomes)“Rare” (especially in prokaryotes): ~3000 linked COGs in STRING v6 (~180 genomes) But what about domain recombination?But what about domain recombination? “Rare” (especially in prokaryotes): ~3000 linked COGs in STRING v6 (~180 genomes)“Rare” (especially in prokaryotes): ~3000 linked COGs in STRING v6 (~180 genomes) But what about domain recombination?But what about domain recombination? FusionFusion

Gene fusion i.e. the orthologs of two genes in another organism are fused into one polypeptidei.e. the orthologs of two genes in another organism are fused into one polypeptide A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event:A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event: i.e. the orthologs of two genes in another organism are fused into one polypeptidei.e. the orthologs of two genes in another organism are fused into one polypeptide A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event:A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event:

Gene fusion: an example

Gene Content Evolution

What about HGT? Genome trees based on gene content: Escherichia coli Haemophilus influenzae shared genes Species specific genes genes genes genes

Genome trees based on gene content ( ) # shared OGs (spA, spB) Weighted average Genome size(spA, spB) Weighted average Genome size(spA, spB) \s sp1 sp2 sp3 sp4 … sp1 \ … sp2 \ … sp3 \1 0.3 … sp4 \1 … … … … … … \s sp1 sp2 sp3 sp4 … sp1 \ … sp2 \ … sp3 \1 0.3 … sp4 \1 … … … … … … Neighbor joining d d dist (spA, spB) = 1 – OG1 OG2 OG3 OG4 … sp … sp … sp … … … … … … OG1 OG2 OG3 OG4 … sp … sp … sp … … … … … … Presence / absence matrix:

Genome trees based on gene content are remarkably similar to consensus on ToL C. pneumoniae C. trachomatis M. tuberculosis M. pneumoniae M. genitalium B. subtilis T. pallidum T. maritima B. burgdorferi P. horikoshii M. thermoautotrophicum A. fulgidus M. jannaschii S. cerevisiae C. elegans A. aeolicus E. coli H. influenzae R. prowazekii H. pylori Synechocystis sp. H. pylori J99 A. pernix Proteobacteria Eukarya Euryarchaeota Spirochaetales 100

Reconstruction of Gene Content b

Deletion Gain Ancestral Genome Reconstruction of LUCA : patchy gene distributions

ParsimonyParsimony Attach a cost (c) to HGT / independent gain in terms of loss; find scenario with lowest costAttach a cost (c) to HGT / independent gain in terms of loss; find scenario with lowest cost At g = 1.5, 733 genes in LUCAAt g = 1.5, 733 genes in LUCA At g = 2, 956 genes in LUCAAt g = 2, 956 genes in LUCA Evolution is not parsimonious, minimal estimate?Evolution is not parsimonious, minimal estimate? Why not use gene trees?Why not use gene trees? Attach a cost (c) to HGT / independent gain in terms of loss; find scenario with lowest costAttach a cost (c) to HGT / independent gain in terms of loss; find scenario with lowest cost At g = 1.5, 733 genes in LUCAAt g = 1.5, 733 genes in LUCA At g = 2, 956 genes in LUCAAt g = 2, 956 genes in LUCA Evolution is not parsimonious, minimal estimate?Evolution is not parsimonious, minimal estimate? Why not use gene trees?Why not use gene trees? b

Nice results e.g. nucleotide biosynthesis

Another attempt to reconstruct the genome of LUCA over 1000 gene families, of which more than 90% are also functionally characterized.over 1000 gene families, of which more than 90% are also functionally characterized. a fairly complex genome similar to those of free-living prokaryotes, with a variety of functional capabilities including metabolic transformation, information processing, membrane/transport proteins and complex regulation,a fairly complex genome similar to those of free-living prokaryotes, with a variety of functional capabilities including metabolic transformation, information processing, membrane/transport proteins and complex regulation, over 1000 gene families, of which more than 90% are also functionally characterized.over 1000 gene families, of which more than 90% are also functionally characterized. a fairly complex genome similar to those of free-living prokaryotes, with a variety of functional capabilities including metabolic transformation, information processing, membrane/transport proteins and complex regulation,a fairly complex genome similar to those of free-living prokaryotes, with a variety of functional capabilities including metabolic transformation, information processing, membrane/transport proteins and complex regulation,

Presence / absence of genes Gene content  co-evolution. (The easy case, few genomes. ) Genomes share genes for phenotypes they have in common Differences between gene Content reflect differences in Phenotypic potentialities Differences between gene Content reflect differences in Phenotypic potentialities

Qualitative differential genome analysis: -Find “pathogen specific” specific proteins that can serve as drug targets -Relate the differences between genomes to the differences in the phenotypes Qualitative differential genome analysis: -Find “pathogen specific” specific proteins that can serve as drug targets -Relate the differences between genomes to the differences in the phenotypes

Three-way comparisons Huynen et al., 1998, FEBS Lett

Convergence in functional classes of gene content in small intra cellular bacterial parasites Zomorodipour & Andersson FEBS Letters 1999

Although we can, qualitatively, interpret the variations in shared gene content in terms of the phenotypes of the species, quantitatively they depend on the relative phylogenetic positions of the species. The closer two species are the larger fraction of their genes they share.

Presence / absence of genes L. innocua (non-pathogen) L. monocytogenes (pathogen)

Occurrence of genes L. innocua (non-pathogenic) L. monocytogenes (pathogenic) Genes involved in pathogenecity

Generalization: phylogenetic profiles / co- occurence Gene 1: Gene 2: Gene 3:.... Gene 1: Gene 2: Gene 3:.... species 1 species 2 species 3 species 4 species species 1 species 2 species 3 species 4 species Gene 1: Gene 2: Gene 3: Gene 1: Gene 2: Gene 3: species 1 species 2 species 3 species 4 species species 1 species 2 species 3 species 4 species

Co-occurrence of genes across genomes i.e. two genes have the same presence/ absence pattern over multiple genomes: i.e. two genes have the same presence/ absence pattern over multiple genomes: AKA phylogenetic profilesAKA phylogenetic profiles NB complete genomes absenceNB complete genomes absence Correction for phylogenetic signal needed → eventsCorrection for phylogenetic signal needed → events b

Predicting function of a disease gene protein with unknown function, frataxin, using co-occurrence of genes across genomes Friedreich’s ataxiaFriedreich’s ataxia No (homolog with) known functionNo (homolog with) known function Friedreich’s ataxiaFriedreich’s ataxia No (homolog with) known functionNo (homolog with) known function

A. a e o l i c u s S y n e c h o c y s t i s B. s u b t i l i s M. g e n i t a l i u m M. t u b e r c u l o s i s D. r a d i o d u r a n s R. p r o w a z e k i i C. c r e s c e n t u s M. l o t i N. m e n i n g i t i d i s X. f a s t i d i o s a P. a e r u g i n o s a B u c h n e r a V. c h o l e r a e H. i n f l u e n z a e P. m u l t o c i d a E. coli A. p e r n i x M. j a n n a s c h i i A. t h a l i a n a S. c e r e v i s i a e s C. j e j u n i C. a l b i c a n s S. p o m b e H. s a p i e n s C. e l e g a n H. pylori D.melan. cyaY Yfh1 hscB Jac1 hscA ssq1 Nfu1 iscA Isa1-2 fdx Yah1 Arh1 RnaM IscR Hyp iscS Nfs1 iscU Isu1-2 Atm1 Atm1 Frataxin has co-evolved with hscA and hscB indicating that it plays a role in iron-sulfur cluster assembly

Iron-Sulfur (2Fe-2S) cluster in the Rieske protein

Prediction: ~Confirmation:

functionally associated proteins leave evolutionary traces of their relation in genomes Genomic context / in silico interaction prediction

Evolutionary rate Chen &Dokholyan TiG 2006

Co-evolution: mirrortree Pavos & Valencia PEDS 2001

Co-evolution: mirrortree

Integrating genomic context scores into one single score (post-hoc) Compare each individual method against an independent benchmark (KEGG), and find “equivalency” Compare each individual method against an independent benchmark (KEGG), and find “equivalency” Multiply the chances that two proteins are not interacting and subtract from 1; naive bayesian i.e. assuming independence Multiply the chances that two proteins are not interacting and subtract from 1; naive bayesian i.e. assuming independence