Microbial Evolution Zoology/Anthro/Botany 410 Nicole T. Perna April24, 2014
A couple of key facts Prokaryotes have been around a long time ( GYA). Bacteria and Archaea diverged a very long time ago and are not more closely related to each other than to eukaryotes Prokaryotes exhibit tremendous diversity of habitats, lifestyles, and metabolic strategies
Important applications of microbial evolution
Critical Topics Already Introduced Genomic revolution and genome evolution – Core vs. Variable fractions of genomes – Pan-genome – Genome size and organization Horizontal (Lateral) Gene Transfer (HGT) – There is no “tree of life” – How frequent is HGT? Bacterial species - is there such a thing? What do we mean?
Assigned reading
Microbial genome sequence availability is exponentially increasing
NCBI Genome Project List As of 4/19/2012 (2013): 2029 complete bacterial genomes (2510) 134 complete archaea (262) 3313 draft bacteria (>10K) 44 draft archaea 4600 bacteria – no data yet 49 archaea – no data yet
How well sampled is prokaryotic diversity by current genome sequences? Koonin and Wolf 2008 perspective: Uncultivated organisms remain problematic Only 10% of the genes in major metagenomic samplings have no detectable homologs “The possibility, certainly, remains that major new and, perhaps, unusual groups of archaea and bacteria dwell in complex and unusual habitats. Nevertheless, it appears likely that the current collections of archaeal and bacterial genomes provide a reasonable approximation of the diversity of prokaryotic life forms on earth.”
Genomic Encyclopedia of Bacteria and Archaea (GEBA) Project Objective – sequence genomes selected solely for their phylogenetic novelty (plus in depth sampling of a single phylum) …based on 16S rDNA tree Wu et al. Nature Dec 24; 462(7276):
DY Wu et al. Nature 462, (2009) doi: /nature08656 Maximum-likelihood phylogenetic tree of the bacterial domain based on a concatenated alignment of 31 broadly conserved protein-coding genes 16. Phyla are distinguished by colour of the branch and GEBA genomes are indicated in red in the outer circle of species names. 53 GEBA bacteria accounted for 2.8– 4.4 times more phylogenetic diversity than randomly sampled subsets of 53 non- GEBA bacterial genomes
DY Wu et al. Nature 462, (2009) doi: /nature08656 Rate of discovery of protein families as a function of phylogenetic breadth of genomes. Even discovered a bacterial homolog of eukaryotic cytoskeleton protein, Actin
Evolution-oriented reasons to target genomes for sequencing Maximize sampling of diversity Understand structure of particular populations and/or species Make targeted comparisons to understand the genetic basis of phenotypic differences
Size and organization of microbial genomes (Koonin and Wolf 2008) Size Range = 180 Kbp – 13 Mbp
Structure of a prokaryotic genome One circular chromosome is typical. Some have other replicons, such as linear or circular plasmids. Some have more than one chromosome, generally distinguished from a plasmid by the presence of at least one “essential” gene. Some have linear chromosomes.
Fitch WM. Trends Genet May;16(5):
Analogy vs Homology Analogy The relationship of any two characters that have descended convergently from unrelated ancestors. Homology The relationship of any two characters that have descended, usually with divergence, from a common ancestral character.
Orthology The relationship of any two homologous characters whose common ancestor lies in the cenancestor of the taxa from which the two sequences were obtained. Paralogy The relationship of any two homologous characters arising from a duplication of the gene for that character. Xenology The relationship of any two homologous characters whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material for at least one of those characters.
Test Yourself A1 – B1 A1 – B2 A1 – C3 B1 – C2 C2 – C3 B2 – C3 C3 – AB1
Homology on a Genome-Scale How many and which genes are common to two or more organisms? Which genes differentiate one organism from another? How is homology related to function?
A phylogenetic perspective Orthologs are the set of genes/proteins with gene trees identical to the species tree. We can understand other types of homology relationships by comparison to the species tree. But often we don’t know the species tree, and phylogenetic methods are complex
Consider two genomes Use BLASTP to compare one set of proteins (proteome) to the other Which set will you use as the query and which as the database? What criteria will you use to define “a match”? GenomeA – gene 1 GenomeA – gene 2 GenomeA – gene 3 GenomeB– gene 1 GenomeB – gene 2 GenomeB – gene 3 A1, A3, B2 and B3 are homologs (assuming the aligned regions overlap)
Reciprocal Best Hits Use BLASTP to compare sets of proteins (proteome) to each other – First using GenomeA to query against GenomeB – Then using GenomeB to query against GenomeA – Save only one best match for each query – Save only the reciprocal best matches as “orthologs” GenomeA – gene 1 GenomeA – gene 2 GenomeA – gene 3 GenomeB– gene 1 GenomeB – gene 2 GenomeB – gene 3 GenomeA – gene 1 GenomeA – gene 2 GenomeA – gene 3 GenomeB– gene 1 GenomeB – gene 2 GenomeB – gene 3 GenomeA – gene 1 GenomeA – gene 2 GenomeA – gene 3 GenomeB– gene 1 GenomeB – gene 2 GenomeB – gene 3 Lose A3-B2 and A1-B3 homology
Software/Methods for Predicting Orthologs from Genome Sequences RBH RSD (Reciprocal Shortest Distance) INPARANOID RIO Orthostrapper Ortholuge TribeMCL OrthoMCL
Method Comparison Chen F, Mackey AJ, Vermunt JK, Roos DS. PLoS ONE Apr 18;2(4):e383.
Core and variable genes- single genome perspective A small number of genes have orthologs in all microbial genomes (core) More genes have orthologs in many genomes, but not all (shell) Some genes are rare and have orthologs in only a few genomes (cloud) Some are unique to one genome (ORFans)
Core and variable genes – species perspective (pan-genome) For some species as a whole, The number of core (plus shell) genes can be much smaller than the variable fraction (cloud plus ORFans) And the pan-genome can be very large Touchon et al. PLoS Genetics. 2009
Different types of pan-genomes Figure 3. Power law regression for species with open and closed pan-genomes. Tettelin et al. Curr Opin Microbiology 2008:11(5).
Open vs Closed Pan-genomes Open – Number of new genes discovered continues to grow as additional genomes of the species are sequenced – Organisms live in diverse environments and are genetically amenable to horizontal gene transfer Closed – Number of new genes discovered is very small as additional genomes of the species are sequenced – Organisms have little exposure to other organisms and/or are refractory to horizontal gene transfer
Horizontal Gene Transfer Mechanisms include conjugation, transduction and transformation Can introduce entirely new genes and gene clusters into genomes (grow the pan-genome) Can replace existing genes with functionally equivalent (?) xenologs (scramble phylogenetic history)
Horizontal Gene Transfer How prevalent is it? – We don’t know. Debates continue largely based on the challenges of separating the error associated with phylogenetic reconstruction from true differences in phylogenetic signal Who is doing it? – We don’t know. Same problem as above. – Good evidence that it is much more frequent within (some) species than between – Some evidence for relationship with evolutionary distance and/or commonality of enviroment
SSU rDNA perspective
EVOLUTION: Genome Data Shake Tree of Life E Pennisi - Science, sciencemag.org The ring of life provides evidence for a genome fusion origin of eukaryotes MC Rivera, JA Lake - Nature, 2004 The net of life: reconstructing the microbial phylogenetic network V Kunin, L Goldovsky, N Darzentas, CA … - Genome Research 2005 The tree of one percent T Dagan, W Martin - Genome biology, 2006 Uprooting the tree of life WF Doolittle - Evolution: a Scientific American reader, 2006
Comparison of phylogenies for nearly universally conserved genes 102 ML trees for 100 taxa Objective – compare topological distance between trees New metric called IS (inconsistency score) = fraction of splits two trees have in common The network of similarities among the nearly universal trees (NUTs). (a) Each node (green dot) denotes a NUT, and nodes are connected by edges if the similarity between the respective edges exceeds the indicated threshold. (b) The connectivity of 102 NUTs and the 14 1:1 NUTs depending on the topological similarity threshold.
Real trees are more similar to each other than randomly simulated trees Although no single tree appears to represent the evolutionary history of these organisms, there is distinctly preserved phylogenetic signal across the dataset as a whole
The big divide? Look for evidence of HGT between bacteria and archaea 56% of NUTs separated the groups perfectly 44% show at least one HGT – 13% from archaea to bacteria – 23% from bacteria to archaea – 8% both directions The supernetwork of the NUTs. Puigbò et al. Journal of Biology :59 doi: /jbiol159
Expanding to ~6800 other predicted ortholog clusters Network connectivity is greatly reduced Different functional categories of genes show different levels of connectedness Network representation of the 6,901 trees of the forest of life. The 102 NUTs are shown as red circles in the middle. The NUTs are connected to trees with similar topologies: trees with at least 50% of similarity with at least one NUT (P-value < 0.05) are shown as purple circles and connected to the NUTs. The rest of the trees are shown as green circles. Puigbò et al. Journal of Biology :59 doi: /jbiol159
Proc Natl Acad Sci U S A Oct 4;102(40):
Highways of obligate gene transfer within and among phyla and divisions of prokaryotes, based on analysis of the 22,348 protein trees for which a minimal edit path could be resolved Beiko R G et al. PNAS 2005;102: ©2005 by National Academy of Sciences
Horizontal Transfer within species Estimate that a given basepair is 100 times more likely to have undergone a recombination event than a point mutation within the species E. coli, so how can we justify representing the relationship between strains with a tree like structure? Modeling and simulation support inference of a tree summarizing dominant signal AS LONG AS patterns of recombination are more or less random between lineages Touchon et al. PLoS Genetics. 2009
Major processes affecting prokaryotic genome evolution (Koonin and Wolf, 2008) (1) Genome streamlining under strong selection. (2) Neutral gene loss and genome degradation under weak selection (or neutral). (3) Innovation and complexification via gene duplication. (4) Innovation via operon shuffling. (5) Innovation and complexification via HGT, in particular, of partially selfish operons, a process that often leads to nonorthologous gene displacement. (6) Replicon fusion, propagation of mobile elements and other interactions between the relatively stable chromosomes and the mobilome.
Test Yourself A1 – B1 A1 – B2 A1 – C3 B1 – C2 C2 – C3 B2 – C3 C3 – AB1
Test Yourself A1 – B1 = Ortho A1 – B2 = Ortho A1 – C3 = Ortho B1 – C2 = Para (out) C2 – C3 = Para (in) B2 – C3 = Ortho C3 – AB1= Xeno