Accurate gene phylogeny across multiple complete genomes

Accurate gene phylogeny across multiple complete genomes
Species Informed Distance-based Reconstruction Matt Rasmussen and Manolis Kellis

The goal Determine the evolutionary history
of every gene in multiple complete genomes

The goal Determine the evolutionary history
of every gene in multiple complete genomes From phylogenies determine: Orthologs Paralogs Duplications Losses Family expansions Varying rates of evolution Etc…

Contrast of the phylogenetic method with alternative methods
Pair-wise sequence comparison Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families

Contrast of the phylogenetic method with alternative methods
Pair-wise sequence comparison Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families Can they be applied genome-wide?

What is the accuracy of current phylogenetic methods?
Tricky question: Requires knowing the correct phylogeny by an independent means Previously, Simulation Or avoid accuracy and focus on robustness (bootstrap)

Use synteny to determine phylogeny by an independent means

Use synteny to determine phylogeny by an independent means Trees found by Max Likelihood (PHYML) Matches species topology

Phylogenies across 5154 syntenic one-to-one orthologs Etc… 316 other topologies Matches species topology

Reconstruction accuracy dependent on gene sequence length

Accuracy of current phylogenetic methods
Average gene is too short Too few phylogenetically informative characters To make progress, must use additional information Current algorithms ignore species Designed for solving the species tree problem Whole genomes change the game We can assume species tree is known We would like to solve the gene tree problem Our approach: Design an algorithm specifically for the gene tree problem Key insight: use species tree to inform the gene tree reconstruction

What is the connection between species and gene evolution?

5154 gene trees

5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site

Correlation between branch lengths
Total tree length Relative branch lengths

Total tree length Relative branch lengths r = 0.957 asp branch lengths Mer branch lengths

Average gene tree

Average gene tree 93% of trees have a correlation greater than .8 with the average gene tree

Effect of normalization on branch correlation
dvir dana

Effect of normalization on branch length distribution
Relative branch lengths Normally distributed Absolute branch lengths Gamma distributed

A new model for gene family evolution: Two forces
1. Family rates 2. Species-specific rates Fj Sij ~gamma(a,b) ~normal(ui,sij) bij = Fj * Sij

Effects that we have seen are consequences of this model
bij = Fj * Sij Total tree length Lj of one-to-one trees is proportional to family rate Fj If species rates have small standard deviations we expect branch correlation

The standard deviation of every species-specific rate is nearly ¼ of the mean

What is the meaning of the species-specific rate?
The normal is partly due to error in estimating evolutionary distance If we fit normals only on long sequences, the standard deviation goes down Species-specific means are not affected by sequence length.

All of these affects also hold for 17 fungi and 4 mammals
12 Flies 17 Fungi Absolute branches distributed by gamma Relative branches distributed by normal Absolute branches distributed by gamma Relative branches distributed by normal 3 < Mean / sdev < 4 3 < Mean / sdev < 4

A new strategy for gene tree reconstruction
Traditional Maximum Likelihood methods Propose many topologies For each topology Calculate the likelihood of seeing such a tree Return tree that achieves max likelihood We show that one can calculate the likelihood of a tree being generated by our model Thus, we can create our own phylogenetic algorithm that uses species information to reconstruct gene trees.

Likelihood calculation: simple case
INPUT: a distance matrix with all pair-wise distances between genes

Propose a topology Fit branch lengths to topology Estimate Family rate Normalize tree

Reconcile gene tree to species tree Determines actual path of evolution through species tree Algorithms exist to do this fast (linear time)

Pc Pa Pb Pd Pe d Pf Compare branch lengths to distributions Allows us to calculate a likelihood for every branch

Pc Pa Pb Pd Pe d Every branch is highly likely  Tree is Highly likely Pf Because branches are independent, likelihood of tree is product of branch likelihoods

Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Propose another topology This one differs only by rooting Most branch have same length (just different name) w = e (human) x = c (rat) y = d (mouse) z = b (rodent) Two branches are now merged v = a + f (dog/hmr)

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Reconcile gene tree to species tree

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr)

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Mouse and rat branches have the same likelihood as before Px = Pc Py = Pd

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same distribution for dog, but now dog branch is too long. Why? v = a + f Pv < Pf

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py ? Mouse w1 w2 Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch w goes from Eutherian to Human (two species branches) Which distribution should we use?

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) The distribution is the sum of two independent normals w= w1 + w2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22) Branch w is too short, Pw < Pe

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px ? z1 z2 Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same case for z. Two species branches Distribution is sum of two indep. normals z = z1 + z2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22)

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch z is too short Pz < Pb

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Some branches are less likely  Tree is less likely

Bringing it all together
Turns out we find the likelihood of any tree by breaking it down into 1 of three cases Main advantage: do not explicitly penalize dup/loss Only ensure branch lengths are close to what we expect given our model

Example of reconstructing tree with dup/loss: hemoglobin genes
D H M R D H M R Hemoglobin alpha Hemoglobin beta This is now the correct topology 

Example of reconstructing tree with dup/loss
Px Pz Rat Py Pw Mouse Human Pv Dog All branches are highly likely  Tree is highly likely z w v Branch z is now longer Branch w is now longer Branch v is just the right length D H M R D H M R Hemoglobin alpha Hemoglobin beta

Evaluation: Datasets Real datasets
5154 syntenic one-to-ones from 12 flies 739 syntenic one-to-ones from 17 fungi 200 Neighboring fly orthologs 220 Whole genome duplicates in 7 yeasts Simulated (using our gene family model) More complex events Neighboring orthologs WGD trees klac, kwal, agos sbay, smik, spar, scer

Evaluation

Apply genome-wide for 17 fungi
Cluster genes Build alignment for each cluster Build tree for each alignment Reconcile to species tree to determine all duplications and losses

General trees follow the model we learned from one-to-one trees

GO enrichment in top 50 trees with most duplications
term pval plasma membrane -1.50E-11 helicase activity 4.72E-12 ammonium transporter activity 1.46E-09 telomere maintenance via recombination 1.66E-09 DNA helicase activity 4.11E-09 transport 4.40E-07 transporter activity 4.80E-07 membrane 1.27E-06 alcohol dehydrogenase activity 6.29E-06 nitrogen utilization ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity alcohol dehydrogenase (NADP+) activity 3.84E-05 magnesium ion transport sodium ion transport basic amino acid transporter activity lysophospholipase activity nuclear nucleosome 4.17E-05 cellular component unknown translational elongation oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor translation elongation factor activity oxidoreductase activity alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation ribosome thiamin biosynthesis myo-inositol transport myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity permease activity calcium-transporting ATPase activity

GO enrichment in top 50 trees with most gene losses
helicase activity 4.49E-12 telomere maintenance via recombination 1.60E-09 DNA helicase activity 3.95E-09 GTPase activity 7.11E-09 ubiquitin conjugating enzyme activity 6.79E-08 translational elongation 4.24E-07 alcohol dehydrogenase activity 6.17E-06 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity translation elongation factor activity 9.27E-06 alcohol dehydrogenase (NADP+) activity 3.78E-05 1,3-beta-glucan synthase activity sodium ion transport IMP dehydrogenase activity ribosome 4.35E-05 protein serine/threonine phosphatase activity oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor structural constituent of cytoskeleton alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation ammonium transporter activity hydrogen-exporting ATPase activity, phosphorylative mechanism GTP biosynthesis myo-inositol transport regulation of pH myo-inositol transporter activity maltose catabolism aconitate hydratase activity calcium-transporting ATPase activity IMP cyclohydrolase activity plasma membrane transporter activity protein phosphatase type 2A activity

GO enrichment in top 50 trees with most genes
term pval helicase activity 1.55E-11 telomere maintenance via recombination 3.83E-09 DNA helicase activity 1.05E-08 plasma membrane 2.84E-08 1,3-beta-glucanosyltransferase activity 7.94E-08 transporter activity 1.45E-07 transport 1.71E-07 pyruvate decarboxylase activity 2.09E-06 protein amino acid O-linked glycosylation 2.29E-06 Golgi apparatus 7.73E-06 alcohol dehydrogenase activity 1.01E-05 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity GTPase activity 1.57E-05 cyclin-dependent protein kinase holoenzyme complex 1.70E-05 alpha-1,2-mannosyltransferase activity 2.95E-05 regulation of glycogen biosynthesis cytosine-purine permease activity 5.51E-05 sodium ion transport basic amino acid transporter activity lysophospholipase activity membrane cell wall (sensu Fungi) alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) 3-chloroallyl aldehyde dehydrogenase activity oxidoreductase activity cyclin-dependent protein kinase regulator activity ammonium transporter activity myo-inositol transport myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity

GO enrichment in top 10 trees with most genes
DNA helicase activity 3.79E-13 telomere maintenance via recombination 4.61E-13 helicase activity 8.05E-13 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 5.92E-08 alcohol dehydrogenase activity sodium ion transport 1.15E-06 fermentation 1.13E-05 calcium-transporting ATPase activity NADPH dehydrogenase activity transport oxidoreductase activity membrane alcohol dehydrogenase (NADP+) activity alcohol metabolism transporter activity plasma membrane multidrug transport

Supplemental figure

# Duplications vs rel sub/site for each species branch

Orthologs and paralogs
human mouse rat dog rabbit paralogs orthologs Underdstand orth and para is a basic requirement for comparative genomic studies Orth arise from speciation and vertical from a single ancestral gene and therefore typically preserve the ancestral function Para on the other hand arise by gene duplication and likely to take on new functions Orthologs arise by speciation typically keep same function Paralogs arise by duplication typically take on new functions

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr)

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Py w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Mouse and rat branches have the same likelihood as before Px = Pc Py = Pd

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Py Pv w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Dog is now too long. Why? v = a + f Pv < Pf

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Py Pv w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Human is now too short, because it must now cross an extra species

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf

Figure 4 a. Gene-tree with correct topology scores highly b. Gene-tree with incorrect topology scores poorly Figure 4. Gene-tree evaluation with a richer species-tree model

Accurate gene phylogeny across multiple complete genomes

Similar presentations

Presentation on theme: "Accurate gene phylogeny across multiple complete genomes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accurate gene phylogeny across multiple complete genomes

Similar presentations

Presentation on theme: "Accurate gene phylogeny across multiple complete genomes"— Presentation transcript:

Similar presentations

About project

Feedback