Accurate gene phylogeny across multiple complete genomes Species Informed Distance-based Reconstruction Matt Rasmussen and Manolis Kellis
The goal Determine the evolutionary history of every gene in multiple complete genomes
The goal Determine the evolutionary history of every gene in multiple complete genomes From phylogenies determine: Orthologs Paralogs Duplications Losses Family expansions Varying rates of evolution Etc…
Contrast of the phylogenetic method with alternative methods Pair-wise sequence comparison Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families
Contrast of the phylogenetic method with alternative methods Pair-wise sequence comparison Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families Can they be applied genome-wide?
What is the accuracy of current phylogenetic methods? Tricky question: Requires knowing the correct phylogeny by an independent means Previously, Simulation Or avoid accuracy and focus on robustness (bootstrap)
What is the accuracy of current phylogenetic methods? Use synteny to determine phylogeny by an independent means
What is the accuracy of current phylogenetic methods? Use synteny to determine phylogeny by an independent means Trees found by Max Likelihood (PHYML) Matches species topology
What is the accuracy of current phylogenetic methods? Phylogenies across 5154 syntenic one-to-one orthologs Etc… 316 other topologies Matches species topology
Reconstruction accuracy dependent on gene sequence length
Accuracy of current phylogenetic methods Average gene is too short Too few phylogenetically informative characters To make progress, must use additional information Current algorithms ignore species Designed for solving the species tree problem Whole genomes change the game We can assume species tree is known We would like to solve the gene tree problem Our approach: Design an algorithm specifically for the gene tree problem Key insight: use species tree to inform the gene tree reconstruction
What is the connection between species and gene evolution?
What is the connection between species and gene evolution?
What is the connection between species and gene evolution? 5154 gene trees
What is the connection between species and gene evolution? 5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site
Correlation between branch lengths Total tree length Relative branch lengths
Correlation between branch lengths Total tree length Relative branch lengths r = 0.957 asp branch lengths Mer branch lengths
Correlation between branch lengths Average gene tree
Correlation between branch lengths Average gene tree 93% of trees have a correlation greater than .8 with the average gene tree
Effect of normalization on branch correlation dvir dana
Effect of normalization on branch length distribution Relative branch lengths Normally distributed Absolute branch lengths Gamma distributed
A new model for gene family evolution: Two forces 1. Family rates 2. Species-specific rates Fj Sij ~gamma(a,b) ~normal(ui,sij) bij = Fj * Sij
Effects that we have seen are consequences of this model bij = Fj * Sij Total tree length Lj of one-to-one trees is proportional to family rate Fj If species rates have small standard deviations we expect branch correlation
The standard deviation of every species-specific rate is nearly ¼ of the mean
What is the meaning of the species-specific rate? The normal is partly due to error in estimating evolutionary distance If we fit normals only on long sequences, the standard deviation goes down Species-specific means are not affected by sequence length.
All of these affects also hold for 17 fungi and 4 mammals 12 Flies 17 Fungi Absolute branches distributed by gamma Relative branches distributed by normal Absolute branches distributed by gamma Relative branches distributed by normal 3 < Mean / sdev < 4 3 < Mean / sdev < 4
A new strategy for gene tree reconstruction Traditional Maximum Likelihood methods Propose many topologies For each topology Calculate the likelihood of seeing such a tree Return tree that achieves max likelihood We show that one can calculate the likelihood of a tree being generated by our model Thus, we can create our own phylogenetic algorithm that uses species information to reconstruct gene trees.
Likelihood calculation: simple case INPUT: a distance matrix with all pair-wise distances between genes
Likelihood calculation: simple case Propose a topology Fit branch lengths to topology Estimate Family rate Normalize tree
Likelihood calculation: simple case Reconcile gene tree to species tree Determines actual path of evolution through species tree Algorithms exist to do this fast (linear time)
Likelihood calculation: simple case Pc Pa Pb Pd Pe d Pf Compare branch lengths to distributions Allows us to calculate a likelihood for every branch
Likelihood calculation: simple case Pc Pa Pb Pd Pe d Every branch is highly likely Tree is Highly likely Pf Because branches are independent, likelihood of tree is product of branch likelihoods
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Propose another topology This one differs only by rooting Most branch have same length (just different name) w = e (human) x = c (rat) y = d (mouse) z = b (rodent) Two branches are now merged v = a + f (dog/hmr)
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Reconcile gene tree to species tree
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr)
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Rat Py Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Mouse and rat branches have the same likelihood as before Px = Pc Py = Pd
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Rat Py Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same distribution for dog, but now dog branch is too long. Why? v = a + f Pv < Pf
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Rat Py ? Mouse w1 w2 Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch w goes from Eutherian to Human (two species branches) Which distribution should we use?
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) The distribution is the sum of two independent normals w= w1 + w2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22) Branch w is too short, Pw < Pe
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px ? z1 z2 Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same case for z. Two species branches Distribution is sum of two indep. normals z = z1 + z2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22)
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch z is too short Pz < Pb
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Some branches are less likely Tree is less likely
Bringing it all together Turns out we find the likelihood of any tree by breaking it down into 1 of three cases Main advantage: do not explicitly penalize dup/loss Only ensure branch lengths are close to what we expect given our model
Example of reconstructing tree with dup/loss: hemoglobin genes D H M R D H M R Hemoglobin alpha Hemoglobin beta This is now the correct topology
Example of reconstructing tree with dup/loss Px Pz Rat Py Pw Mouse Human Pv Dog All branches are highly likely Tree is highly likely z w v Branch z is now longer Branch w is now longer Branch v is just the right length D H M R D H M R Hemoglobin alpha Hemoglobin beta
Evaluation: Datasets Real datasets 5154 syntenic one-to-ones from 12 flies 739 syntenic one-to-ones from 17 fungi 200 Neighboring fly orthologs 220 Whole genome duplicates in 7 yeasts Simulated (using our gene family model) More complex events Neighboring orthologs WGD trees klac, kwal, agos sbay, smik, spar, scer
Evaluation
Apply genome-wide for 17 fungi Cluster genes Build alignment for each cluster Build tree for each alignment Reconcile to species tree to determine all duplications and losses
General trees follow the model we learned from one-to-one trees
GO enrichment in top 50 trees with most duplications term pval plasma membrane -1.50E-11 helicase activity 4.72E-12 ammonium transporter activity 1.46E-09 telomere maintenance via recombination 1.66E-09 DNA helicase activity 4.11E-09 transport 4.40E-07 transporter activity 4.80E-07 membrane 1.27E-06 alcohol dehydrogenase activity 6.29E-06 nitrogen utilization ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity alcohol dehydrogenase (NADP+) activity 3.84E-05 magnesium ion transport sodium ion transport basic amino acid transporter activity lysophospholipase activity nuclear nucleosome 4.17E-05 cellular component unknown 0.000129 translational elongation 0.000139 oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor 0.00015 translation elongation factor activity 0.000231 oxidoreductase activity 0.000353 alpha-glucosidase activity 0.000365 fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation 0.000515 ribosome 0.000724 thiamin biosynthesis myo-inositol transport 0.00114 myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity permease activity calcium-transporting ATPase activity
GO enrichment in top 50 trees with most gene losses helicase activity 4.49E-12 telomere maintenance via recombination 1.60E-09 DNA helicase activity 3.95E-09 GTPase activity 7.11E-09 ubiquitin conjugating enzyme activity 6.79E-08 translational elongation 4.24E-07 alcohol dehydrogenase activity 6.17E-06 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity translation elongation factor activity 9.27E-06 alcohol dehydrogenase (NADP+) activity 3.78E-05 1,3-beta-glucan synthase activity sodium ion transport IMP dehydrogenase activity ribosome 4.35E-05 protein serine/threonine phosphatase activity 0.000139 oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor 0.000148 structural constituent of cytoskeleton 0.00018 alpha-glucosidase activity 0.00036 fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation 0.000505 ammonium transporter activity 0.000701 hydrogen-exporting ATPase activity, phosphorylative mechanism 0.001129 GTP biosynthesis myo-inositol transport regulation of pH myo-inositol transporter activity maltose catabolism aconitate hydratase activity calcium-transporting ATPase activity IMP cyclohydrolase activity plasma membrane 0.00117 transporter activity 0.00179 protein phosphatase type 2A activity 0.001867
GO enrichment in top 50 trees with most genes term pval helicase activity 1.55E-11 telomere maintenance via recombination 3.83E-09 DNA helicase activity 1.05E-08 plasma membrane 2.84E-08 1,3-beta-glucanosyltransferase activity 7.94E-08 transporter activity 1.45E-07 transport 1.71E-07 pyruvate decarboxylase activity 2.09E-06 protein amino acid O-linked glycosylation 2.29E-06 Golgi apparatus 7.73E-06 alcohol dehydrogenase activity 1.01E-05 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity GTPase activity 1.57E-05 cyclin-dependent protein kinase holoenzyme complex 1.70E-05 alpha-1,2-mannosyltransferase activity 2.95E-05 regulation of glycogen biosynthesis cytosine-purine permease activity 5.51E-05 sodium ion transport basic amino acid transporter activity lysophospholipase activity membrane 0.000159 cell wall (sensu Fungi) 0.000386 alpha-glucosidase activity 0.00052 fermentation proteasome complex (sensu Eukaryota) 3-chloroallyl aldehyde dehydrogenase activity oxidoreductase activity 0.000557 cyclin-dependent protein kinase regulator activity 0.00097 ammonium transporter activity 0.001011 myo-inositol transport 0.00145 myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity
GO enrichment in top 10 trees with most genes DNA helicase activity 3.79E-13 telomere maintenance via recombination 4.61E-13 helicase activity 8.05E-13 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 5.92E-08 alcohol dehydrogenase activity sodium ion transport 1.15E-06 fermentation 1.13E-05 calcium-transporting ATPase activity 0.00011 NADPH dehydrogenase activity transport 0.000146 oxidoreductase activity 0.000178 membrane 0.000235 alcohol dehydrogenase (NADP+) activity 0.000328 alcohol metabolism 0.000652 transporter activity 0.000735 plasma membrane 0.000846 multidrug transport 0.001078
Supplemental figure
# Duplications vs rel sub/site for each species branch
Orthologs and paralogs human mouse rat dog rabbit paralogs orthologs Underdstand orth and para is a basic requirement for comparative genomic studies Orth arise from speciation and vertical from a single ancestral gene and therefore typically preserve the ancestral function Para on the other hand arise by gene duplication and likely to take on new functions Orthologs arise by speciation typically keep same function Paralogs arise by duplication typically take on new functions
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr)
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Py w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Mouse and rat branches have the same likelihood as before Px = Pc Py = Pd
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Py Pv w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Dog is now too long. Why? v = a + f Pv < Pf
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Py Pv w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Human is now too short, because it must now cross an extra species
Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf
Figure 4 a. Gene-tree with correct topology scores highly b. Gene-tree with incorrect topology scores poorly Figure 4. Gene-tree evaluation with a richer species-tree model