Matt Rasmussen and Manolis Kellis

Matt Rasmussen and Manolis Kellis
Phylogenomics of mammalian, fly and fungal genomes Matt Rasmussen and Manolis Kellis Friday , April 2007 MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard

Multiple fully sequenced clades of genomes
32 mammals 12 flies ~20 fungi and many more…

Comparative genomics Identify conserved regions
Annotate genes, regulatory elements Study gene and genome evolution Recognize orthologs / paralogs Gene duplications / losses Linage-specific expansions Varying rates of evolution And much more…

Comparative genomics requires correct orthology/phylogeny
Orthologs and paralogs are best understood in the context of a phylogeny (Fitch 1970) Phylogenies are necessary for inferring duplications and losses (Goodman 1979) Goal of phylogenomics (Eisen 1998): Determine the phylogeny of every gene family in multiple complete genomes

Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

Contrast of the phylogenetic method with alternative methods
Pair-wise sequence comparison Best reciprocal BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Requires close genomes Unable to resolve tandem duplications Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families

Contrast of the phylogenetic method with alternative methods
Pair-wise sequence comparison Best reciprocal BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Requires close genomes Unable to resolve tandem duplications Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families Can they be applied genome-wide?

What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

What is the accuracy of current phylogenetic methods?
Tricky question Requires knowing the correct phylogeny independently Previous studies used: Simulation Experimental evolution in lab Others mostly avoided accuracy question and focused on robustness (i.e. bootstrap)

Importance of correct gene trees

Loss Duplication X

Mammalian example Fast-evolving rodents lead to lots of errors! D H M R opossum D H M R opossum

Mammalian example Fast-evolving rodents lead to lots of errors! D H M R D H M R opossum D H M R opossum Implies 1 duplication and at least 3 losses

Use synteny to determine phylogeny by an independent means

Use synteny to determine phylogeny by an independent means Trees found by Max Likelihood (PHYML) Matches species topology

Phylogenies across 5154 syntenic one-to-one orthologs Etc… 316 other topologies Matches species topology

Inaccuracies dependent on sequence length

Inaccuracies depends on divergence

Inaccuracies due to lack of information
Average gene is too short Too few phylogenetically informative characters To make progress, must use additional information Current algorithms ignore species Designed for solving the species tree problem Whole genomes change the game Assume species tree is known Solve the gene tree problem Our approach: Design an algorithm specifically for the gene tree problem Key insight: use species tree to inform the gene tree reconstruction

What is the connection between species and gene evolution?

5154 gene trees

5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site

Branch lengths across the genome
22 species branches dmel dsec dsim dere dyak dpse … …

22 species branches dmel dsec dsim dere dyak dpse … … 5154 gene families

22 species branches dmel dsec dsim dere dyak dpse … … 5154 gene families bij =branch length in jth species of the ith gene tree

Gene trees share similar rates: correlation
22 species branches mer (CG6875) asp (CG14228) 5154 gene families

22 species branches mer (CG6875) asp (CG14228) 5154 gene families r = 0.957 slope = 2.10 asp branch lengths mer branch lengths

22 species branches 5154 gene families Average gene tree

22 species branches 5154 gene families Average gene tree 93% of trees have a correlation greater than 0.8 with the average gene tree

Initial results suggest two forces of gene evolution:
1. Gene-specific rates 2. Species-specific rates Branch = Gene rate * Species rate Can we really de-couple gene-specific and species-specific rates?

Study connection between species and gene evolution
5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site

Tree normalization 22 species branches 5154 gene families

Tree normalization Absolute lengths Relative lengths
22 species branches 22 species branches 5154 gene families 5154 gene families Absolute lengths Relative lengths tree normalization (a.k.a. row normalization)

Tree normalization Absolute lengths Relative lengths
22 species branches 22 species branches 5154 gene families 5154 gene families Absolute lengths Relative lengths tree normalization (a.k.a. row normalization) bij =branch length in jth species of the ith gene tree gj = sumj bij (gene-specific rate) sij = bij / gj (species-specific rate)

Effect of normalization on branch length distributions
Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir 5154 gene families 5154 gene families

Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir 5154 gene families 5154 gene families Gamma distributed Cannot reject gamma for 14/22

Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir 5154 gene families 5154 gene families Total tree length gj Gamma distributed Cannot reject gamma for 14/22

Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir dvir 5154 gene families 5154 gene families st. dev typically 1/3 to ¼ of mean Gamma distributed Cannot reject gamma for 14/22 Approx. normally distributed Cannot reject normal for 9/22

Effect of normalization on branch correlation
Absolute lengths Relative lengths dvir dana dvir dana dvir dana

Absolute lengths Relative lengths dvir dana dvir dana dvir dana average r = 0.61

Absolute lengths Relative lengths dvir dana dvir dana dvir dana average r = 0.61 average r = 0.09

A new model for gene family evolution: Two independent forces
1. Gene-specific rates 2. Species-specific rates G Si =gamma(a,b) =normal(ui,si) bij = gj * sij

Effects that we have seen are consequences of this model
bij = gj * sij If species rates have small standard deviations we expect branch correlation

What is the meaning of the species-specific rate?
The normal is partly due to error in estimating evolutionary distance If we fit normals only on long sequences, the standard deviation goes down Species-specific means are not affected by sequence length.

All these properties hold for 12 flies, 17 fungi, 4 mammals
12 Drosophila 9 Saccharomycete Abs (16/16) Relative (15/16) Absolute branches fit gamma (14/22 significant) Relative branches fit normal (9/22 significant) 9 Candida Abs (14/14) Relative (12/14)

A new strategy for gene tree reconstruction
SPecies Informed Distance-base Reconstruction Traditional Maximum Likelihood methods Propose many topologies For each topology Calculate the likelihood of seeing such a tree Return tree that achieves max likelihood We show that one can calculate the likelihood of a tree being generated by our model Thus, we can create our own phylogenetic algorithm that uses species information to reconstruct gene trees.

Likelihood calculation: simple case
INPUT: a distance matrix with all pair-wise distances between genes

Propose a topology Fit branch lengths to topology Estimate gene-specific rate Normalize tree

Reconcile gene tree to species tree Determines actual path of evolution through species tree Algorithms exist to do this fast (linear time) Page 1994, Eulenstein 1997, Zmaske 2001

Pc Pa Pb Pd Pe d Pf Compare branch lengths to distributions Allows us to calculate a likelihood for every branch

Pc Pa Pb Pd Pe d Every branch is highly likely  Tree is Highly likely Pf Because branches are independent, likelihood of tree is product of branch likelihoods

A new phylogenetic method: Learning across complete genomes
Outline Why use phylogeny? Part I Inaccuracies Part II The Model A new phylogenetic method: Learning across complete genomes Part III The Method Simple case Complex case General case Part IV The Results

Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Propose another topology This one differs only by rooting Most branch have same length (just different name) w = e (human) x = c (rat) y = d (mouse) z = b (rodent) Two branches are now merged v = a + f (dog/hmr)

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Reconcile gene tree to species tree

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr)

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Mouse and rat branches have the same likelihood as before Px = Pc Py = Pd

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same distribution for dog, but now dog branch is too long. Why? v = a + f Pv < Pf

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py ? Mouse w1 w2 Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch w goes from Eutherian to Human (two species branches) Which distribution should we use?

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) The distribution is the sum of two independent normals w= w1 + w2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22) Branch w is too short, Pw < Pe

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px ? z1 z2 Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same case for z. Two species branches Distribution is sum of two indep. normals z = z1 + z2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22)

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch z is too short Pz < Pb

Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Some branches are less likely  Tree is less likely

A new phylogenetic method: Learning across complete genomes
Outline Why use phylogeny? Part I Inaccuracies Part II The Model A new phylogenetic method: Learning across complete genomes Part III The Method Simple case Complex case General case Part IV The Results

Bringing it all together
Turns out we find the likelihood of any tree by breaking it down into 1 of three cases Main advantage: do not explicitly penalize dup/loss Only ensure branch lengths are close to what we expect given our model

Example of reconstructing tree with dup/loss: hemoglobin genes
D H M R D H M R Hemoglobin alpha Hemoglobin beta This is now the correct topology 

Example of reconstructing tree with dup/loss
Px Pz Rat Py Pw Mouse Human Pv Dog All branches are highly likely  Tree is highly likely z w v Branch z is now longer Branch w is now longer Branch v is just the right length D H M R D H M R Hemoglobin alpha Hemoglobin beta

Evaluation: Datasets Real datasets
5154 syntenic one-to-ones from 12 flies 739 syntenic one-to-ones from 9 fungi 138 Whole genome duplicates in 9 fungi Simulated (using our gene family model) More complex events 12 Drosophila 9 Saccharomycete

Evaluation real data

Evaluation real data WGD klac kwal agos scas1 cgla1 cgla2 scas2
s.stricto1 s.stricto2 Pre-duplication

Branch lengths through WGD well approximated by model
scer scas WGD spar cgla klac scas1 cgla1 cgla2 scas2 kwal agos s.stricto1 s.stricto2 Pre-duplication smik sbay

No apparent topology bias in reconstructing simulation data

Apply genome-wide for 17 fungi
Cluster genes Build alignment for each cluster Build tree for each alignment Reconcile to species tree to determine all duplications and losses

Identify lineage-specific acceleration
Not just fast-evolving genes, but genes that are faster than expected

Better understanding of phylogenetic accuracy in real data
Contributions Better understanding of phylogenetic accuracy in real data New model for gene family evolution Gene-specific and specific rates New phylogenomic algorithm Increased accuracy for reconstruction

Acknowledgements Manolis Kellis Kellis Lab Fly datasets
Ameya Deoras, Pouya Kheradpour, Mike Lin, Alex Stark Fly datasets NIH Doug Smith and Fly analysis consortium Candida datasets NIAID Bruce Birren and Christina Cuomo NIH Training Grant CSAIL / Broad / Whitehead

The standard deviation of every species-specific rate is nearly ¼ of the mean

GO enrichment in top 50 trees with most duplications
term pval plasma membrane -1.50E-11 helicase activity 4.72E-12 ammonium transporter activity 1.46E-09 telomere maintenance via recombination 1.66E-09 DNA helicase activity 4.11E-09 transport 4.40E-07 transporter activity 4.80E-07 membrane 1.27E-06 alcohol dehydrogenase activity 6.29E-06 nitrogen utilization ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity alcohol dehydrogenase (NADP+) activity 3.84E-05 magnesium ion transport sodium ion transport basic amino acid transporter activity lysophospholipase activity nuclear nucleosome 4.17E-05 cellular component unknown translational elongation oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor translation elongation factor activity oxidoreductase activity alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation ribosome thiamin biosynthesis myo-inositol transport myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity permease activity calcium-transporting ATPase activity

GO enrichment in top 50 trees with most gene losses
helicase activity 4.49E-12 telomere maintenance via recombination 1.60E-09 DNA helicase activity 3.95E-09 GTPase activity 7.11E-09 ubiquitin conjugating enzyme activity 6.79E-08 translational elongation 4.24E-07 alcohol dehydrogenase activity 6.17E-06 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity translation elongation factor activity 9.27E-06 alcohol dehydrogenase (NADP+) activity 3.78E-05 1,3-beta-glucan synthase activity sodium ion transport IMP dehydrogenase activity ribosome 4.35E-05 protein serine/threonine phosphatase activity oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor structural constituent of cytoskeleton alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation ammonium transporter activity hydrogen-exporting ATPase activity, phosphorylative mechanism GTP biosynthesis myo-inositol transport regulation of pH myo-inositol transporter activity maltose catabolism aconitate hydratase activity calcium-transporting ATPase activity IMP cyclohydrolase activity plasma membrane transporter activity protein phosphatase type 2A activity

GO enrichment in top 10 trees with most genes
DNA helicase activity 3.79E-13 telomere maintenance via recombination 4.61E-13 helicase activity 8.05E-13 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 5.92E-08 alcohol dehydrogenase activity sodium ion transport 1.15E-06 fermentation 1.13E-05 calcium-transporting ATPase activity NADPH dehydrogenase activity transport oxidoreductase activity membrane alcohol dehydrogenase (NADP+) activity alcohol metabolism transporter activity plasma membrane multidrug transport

Gene functions that evolve rapidly
Analyses more robust at GO category level

Orthologs and paralogs
human mouse rat dog rabbit paralogs orthologs Underdstand orth and para is a basic requirement for comparative genomic studies Orth arise from speciation and vertical from a single ancestral gene and therefore typically preserve the ancestral function Para on the other hand arise by gene duplication and likely to take on new functions Orthologs arise by speciation typically keep same function Paralogs arise by duplication typically take on new functions

Complement Ka/Ks studies
Ks saturates very rapidly Gene-specific rates hold across much larger distances

GO enrichment in top 50 trees with most genes
term pval helicase activity 1.55E-11 telomere maintenance via recombination 3.83E-09 DNA helicase activity 1.05E-08 plasma membrane 2.84E-08 1,3-beta-glucanosyltransferase activity 7.94E-08 transporter activity 1.45E-07 transport 1.71E-07 pyruvate decarboxylase activity 2.09E-06 protein amino acid O-linked glycosylation 2.29E-06 Golgi apparatus 7.73E-06 alcohol dehydrogenase activity 1.01E-05 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity GTPase activity 1.57E-05 cyclin-dependent protein kinase holoenzyme complex 1.70E-05 alpha-1,2-mannosyltransferase activity 2.95E-05 regulation of glycogen biosynthesis cytosine-purine permease activity 5.51E-05 sodium ion transport basic amino acid transporter activity lysophospholipase activity membrane cell wall (sensu Fungi) alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) 3-chloroallyl aldehyde dehydrogenase activity oxidoreductase activity cyclin-dependent protein kinase regulator activity ammonium transporter activity myo-inositol transport myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity

Correlation between duplication rate and mutation rate

Matt Rasmussen and Manolis Kellis

Similar presentations

Presentation on theme: "Matt Rasmussen and Manolis Kellis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Matt Rasmussen and Manolis Kellis

Similar presentations

Presentation on theme: "Matt Rasmussen and Manolis Kellis"— Presentation transcript:

Similar presentations

About project

Feedback