Download presentation
Presentation is loading. Please wait.
1
Matt Rasmussen and Manolis Kellis
Phylogenomics of mammalian, fly and fungal genomes Matt Rasmussen and Manolis Kellis Friday , April 2007 MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard
2
Multiple fully sequenced clades of genomes
32 mammals 12 flies ~20 fungi and many more…
3
Comparative genomics Identify conserved regions
Annotate genes, regulatory elements Study gene and genome evolution Recognize orthologs / paralogs Gene duplications / losses Linage-specific expansions Varying rates of evolution And much more…
4
Comparative genomics requires correct orthology/phylogeny
Orthologs and paralogs are best understood in the context of a phylogeny (Fitch 1970) Phylogenies are necessary for inferring duplications and losses (Goodman 1979) Goal of phylogenomics (Eisen 1998): Determine the phylogeny of every gene family in multiple complete genomes
5
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
6
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
7
Contrast of the phylogenetic method with alternative methods
Pair-wise sequence comparison Best reciprocal BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Requires close genomes Unable to resolve tandem duplications Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families
8
Contrast of the phylogenetic method with alternative methods
Pair-wise sequence comparison Best reciprocal BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Requires close genomes Unable to resolve tandem duplications Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families Can they be applied genome-wide?
9
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
10
What is the accuracy of current phylogenetic methods?
Tricky question Requires knowing the correct phylogeny independently Previous studies used: Simulation Experimental evolution in lab Others mostly avoided accuracy question and focused on robustness (i.e. bootstrap)
11
Importance of correct gene trees
12
Importance of correct gene trees
13
Importance of correct gene trees
Loss Duplication X
14
Importance of correct gene trees
Mammalian example Fast-evolving rodents lead to lots of errors! D H M R opossum D H M R opossum
15
Importance of correct gene trees
Mammalian example Fast-evolving rodents lead to lots of errors! D H M R D H M R opossum D H M R opossum Implies 1 duplication and at least 3 losses
16
What is the accuracy of current phylogenetic methods?
Use synteny to determine phylogeny by an independent means
17
What is the accuracy of current phylogenetic methods?
Use synteny to determine phylogeny by an independent means Trees found by Max Likelihood (PHYML) Matches species topology
18
What is the accuracy of current phylogenetic methods?
Phylogenies across 5154 syntenic one-to-one orthologs Etc… 316 other topologies Matches species topology
19
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
20
Inaccuracies dependent on sequence length
21
Inaccuracies depends on divergence
22
Inaccuracies due to lack of information
Average gene is too short Too few phylogenetically informative characters To make progress, must use additional information Current algorithms ignore species Designed for solving the species tree problem Whole genomes change the game Assume species tree is known Solve the gene tree problem Our approach: Design an algorithm specifically for the gene tree problem Key insight: use species tree to inform the gene tree reconstruction
23
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
24
What is the connection between species and gene evolution?
25
What is the connection between species and gene evolution?
5154 gene trees
26
What is the connection between species and gene evolution?
5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site
27
Branch lengths across the genome
22 species branches dmel dsec dsim dere dyak dpse … …
28
Branch lengths across the genome
22 species branches dmel dsec dsim dere dyak dpse … … 5154 gene families
29
Branch lengths across the genome
22 species branches dmel dsec dsim dere dyak dpse … … 5154 gene families bij =branch length in jth species of the ith gene tree
30
Gene trees share similar rates: correlation
22 species branches mer (CG6875) asp (CG14228) 5154 gene families
31
Gene trees share similar rates: correlation
22 species branches mer (CG6875) asp (CG14228) 5154 gene families r = 0.957 slope = 2.10 asp branch lengths mer branch lengths
32
Gene trees share similar rates: correlation
22 species branches 5154 gene families Average gene tree
33
Gene trees share similar rates: correlation
22 species branches 5154 gene families Average gene tree 93% of trees have a correlation greater than 0.8 with the average gene tree
34
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
35
Initial results suggest two forces of gene evolution:
1. Gene-specific rates 2. Species-specific rates Branch = Gene rate * Species rate Can we really de-couple gene-specific and species-specific rates?
36
Study connection between species and gene evolution
5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site
37
Tree normalization 22 species branches 5154 gene families
38
Tree normalization Absolute lengths Relative lengths
22 species branches 22 species branches 5154 gene families 5154 gene families Absolute lengths Relative lengths tree normalization (a.k.a. row normalization)
39
Tree normalization Absolute lengths Relative lengths
22 species branches 22 species branches 5154 gene families 5154 gene families Absolute lengths Relative lengths tree normalization (a.k.a. row normalization) bij =branch length in jth species of the ith gene tree gj = sumj bij (gene-specific rate) sij = bij / gj (species-specific rate)
40
Effect of normalization on branch length distributions
Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir 5154 gene families 5154 gene families
41
Effect of normalization on branch length distributions
Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir 5154 gene families 5154 gene families Gamma distributed Cannot reject gamma for 14/22
42
Effect of normalization on branch length distributions
Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir 5154 gene families 5154 gene families Total tree length gj Gamma distributed Cannot reject gamma for 14/22
43
Effect of normalization on branch length distributions
Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir dvir 5154 gene families 5154 gene families st. dev typically 1/3 to ¼ of mean Gamma distributed Cannot reject gamma for 14/22 Approx. normally distributed Cannot reject normal for 9/22
44
Effect of normalization on branch correlation
Absolute lengths Relative lengths dvir dana dvir dana dvir dana
45
Effect of normalization on branch correlation
Absolute lengths Relative lengths dvir dana dvir dana dvir dana average r = 0.61
46
Effect of normalization on branch correlation
Absolute lengths Relative lengths dvir dana dvir dana dvir dana average r = 0.61 average r = 0.09
47
A new model for gene family evolution: Two independent forces
1. Gene-specific rates 2. Species-specific rates G Si =gamma(a,b) =normal(ui,si) bij = gj * sij
48
Effects that we have seen are consequences of this model
bij = gj * sij If species rates have small standard deviations we expect branch correlation
49
What is the meaning of the species-specific rate?
The normal is partly due to error in estimating evolutionary distance If we fit normals only on long sequences, the standard deviation goes down Species-specific means are not affected by sequence length.
50
All these properties hold for 12 flies, 17 fungi, 4 mammals
12 Drosophila 9 Saccharomycete Abs (16/16) Relative (15/16) Absolute branches fit gamma (14/22 significant) Relative branches fit normal (9/22 significant) 9 Candida Abs (14/14) Relative (12/14)
51
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
52
A new strategy for gene tree reconstruction
SPecies Informed Distance-base Reconstruction Traditional Maximum Likelihood methods Propose many topologies For each topology Calculate the likelihood of seeing such a tree Return tree that achieves max likelihood We show that one can calculate the likelihood of a tree being generated by our model Thus, we can create our own phylogenetic algorithm that uses species information to reconstruct gene trees.
53
Likelihood calculation: simple case
INPUT: a distance matrix with all pair-wise distances between genes
54
Likelihood calculation: simple case
Propose a topology Fit branch lengths to topology Estimate gene-specific rate Normalize tree
55
Likelihood calculation: simple case
Reconcile gene tree to species tree Determines actual path of evolution through species tree Algorithms exist to do this fast (linear time) Page 1994, Eulenstein 1997, Zmaske 2001
56
Likelihood calculation: simple case
Pc Pa Pb Pd Pe d Pf Compare branch lengths to distributions Allows us to calculate a likelihood for every branch
57
Likelihood calculation: simple case
Pc Pa Pb Pd Pe d Every branch is highly likely Tree is Highly likely Pf Because branches are independent, likelihood of tree is product of branch likelihoods
58
A new phylogenetic method: Learning across complete genomes
Outline Why use phylogeny? Part I Inaccuracies Part II The Model A new phylogenetic method: Learning across complete genomes Part III The Method Simple case Complex case General case Part IV The Results
59
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf
60
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Propose another topology This one differs only by rooting Most branch have same length (just different name) w = e (human) x = c (rat) y = d (mouse) z = b (rodent) Two branches are now merged v = a + f (dog/hmr)
61
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Reconcile gene tree to species tree
62
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr)
63
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Rat Py Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Mouse and rat branches have the same likelihood as before Px = Pc Py = Pd
64
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Rat Py Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same distribution for dog, but now dog branch is too long. Why? v = a + f Pv < Pf
65
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Rat Py ? Mouse w1 w2 Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch w goes from Eutherian to Human (two species branches) Which distribution should we use?
66
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) The distribution is the sum of two independent normals w= w1 + w2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22) Branch w is too short, Pw < Pe
67
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px ? z1 z2 Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same case for z. Two species branches Distribution is sum of two indep. normals z = z1 + z2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22)
68
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch z is too short Pz < Pb
69
Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Some branches are less likely Tree is less likely
70
A new phylogenetic method: Learning across complete genomes
Outline Why use phylogeny? Part I Inaccuracies Part II The Model A new phylogenetic method: Learning across complete genomes Part III The Method Simple case Complex case General case Part IV The Results
71
Bringing it all together
Turns out we find the likelihood of any tree by breaking it down into 1 of three cases Main advantage: do not explicitly penalize dup/loss Only ensure branch lengths are close to what we expect given our model
72
Example of reconstructing tree with dup/loss: hemoglobin genes
D H M R D H M R Hemoglobin alpha Hemoglobin beta This is now the correct topology
73
Example of reconstructing tree with dup/loss
Px Pz Rat Py Pw Mouse Human Pv Dog All branches are highly likely Tree is highly likely z w v Branch z is now longer Branch w is now longer Branch v is just the right length D H M R D H M R Hemoglobin alpha Hemoglobin beta
74
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
75
Evaluation: Datasets Real datasets
5154 syntenic one-to-ones from 12 flies 739 syntenic one-to-ones from 9 fungi 138 Whole genome duplicates in 9 fungi Simulated (using our gene family model) More complex events 12 Drosophila 9 Saccharomycete
76
Evaluation real data
77
Evaluation real data
78
Evaluation real data WGD klac kwal agos scas1 cgla1 cgla2 scas2
s.stricto1 s.stricto2 Pre-duplication
79
Branch lengths through WGD well approximated by model
scer scas WGD spar cgla klac scas1 cgla1 cgla2 scas2 kwal agos s.stricto1 s.stricto2 Pre-duplication smik sbay
80
No apparent topology bias in reconstructing simulation data
81
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
82
Apply genome-wide for 17 fungi
Cluster genes Build alignment for each cluster Build tree for each alignment Reconcile to species tree to determine all duplications and losses
83
Identify lineage-specific acceleration
Not just fast-evolving genes, but genes that are faster than expected
84
Better understanding of phylogenetic accuracy in real data
Contributions Better understanding of phylogenetic accuracy in real data New model for gene family evolution Gene-specific and specific rates New phylogenomic algorithm Increased accuracy for reconstruction
85
Acknowledgements Manolis Kellis Kellis Lab Fly datasets
Ameya Deoras, Pouya Kheradpour, Mike Lin, Alex Stark Fly datasets NIH Doug Smith and Fly analysis consortium Candida datasets NIAID Bruce Birren and Christina Cuomo NIH Training Grant CSAIL / Broad / Whitehead
86
Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications
88
The standard deviation of every species-specific rate is nearly ¼ of the mean
89
GO enrichment in top 50 trees with most duplications
term pval plasma membrane -1.50E-11 helicase activity 4.72E-12 ammonium transporter activity 1.46E-09 telomere maintenance via recombination 1.66E-09 DNA helicase activity 4.11E-09 transport 4.40E-07 transporter activity 4.80E-07 membrane 1.27E-06 alcohol dehydrogenase activity 6.29E-06 nitrogen utilization ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity alcohol dehydrogenase (NADP+) activity 3.84E-05 magnesium ion transport sodium ion transport basic amino acid transporter activity lysophospholipase activity nuclear nucleosome 4.17E-05 cellular component unknown translational elongation oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor translation elongation factor activity oxidoreductase activity alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation ribosome thiamin biosynthesis myo-inositol transport myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity permease activity calcium-transporting ATPase activity
90
GO enrichment in top 50 trees with most gene losses
helicase activity 4.49E-12 telomere maintenance via recombination 1.60E-09 DNA helicase activity 3.95E-09 GTPase activity 7.11E-09 ubiquitin conjugating enzyme activity 6.79E-08 translational elongation 4.24E-07 alcohol dehydrogenase activity 6.17E-06 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity translation elongation factor activity 9.27E-06 alcohol dehydrogenase (NADP+) activity 3.78E-05 1,3-beta-glucan synthase activity sodium ion transport IMP dehydrogenase activity ribosome 4.35E-05 protein serine/threonine phosphatase activity oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor structural constituent of cytoskeleton alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation ammonium transporter activity hydrogen-exporting ATPase activity, phosphorylative mechanism GTP biosynthesis myo-inositol transport regulation of pH myo-inositol transporter activity maltose catabolism aconitate hydratase activity calcium-transporting ATPase activity IMP cyclohydrolase activity plasma membrane transporter activity protein phosphatase type 2A activity
91
GO enrichment in top 10 trees with most genes
DNA helicase activity 3.79E-13 telomere maintenance via recombination 4.61E-13 helicase activity 8.05E-13 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 5.92E-08 alcohol dehydrogenase activity sodium ion transport 1.15E-06 fermentation 1.13E-05 calcium-transporting ATPase activity NADPH dehydrogenase activity transport oxidoreductase activity membrane alcohol dehydrogenase (NADP+) activity alcohol metabolism transporter activity plasma membrane multidrug transport
92
Gene functions that evolve rapidly
Analyses more robust at GO category level
93
Orthologs and paralogs
human mouse rat dog rabbit paralogs orthologs Underdstand orth and para is a basic requirement for comparative genomic studies Orth arise from speciation and vertical from a single ancestral gene and therefore typically preserve the ancestral function Para on the other hand arise by gene duplication and likely to take on new functions Orthologs arise by speciation typically keep same function Paralogs arise by duplication typically take on new functions
94
Complement Ka/Ks studies
Ks saturates very rapidly Gene-specific rates hold across much larger distances
95
GO enrichment in top 50 trees with most genes
term pval helicase activity 1.55E-11 telomere maintenance via recombination 3.83E-09 DNA helicase activity 1.05E-08 plasma membrane 2.84E-08 1,3-beta-glucanosyltransferase activity 7.94E-08 transporter activity 1.45E-07 transport 1.71E-07 pyruvate decarboxylase activity 2.09E-06 protein amino acid O-linked glycosylation 2.29E-06 Golgi apparatus 7.73E-06 alcohol dehydrogenase activity 1.01E-05 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity GTPase activity 1.57E-05 cyclin-dependent protein kinase holoenzyme complex 1.70E-05 alpha-1,2-mannosyltransferase activity 2.95E-05 regulation of glycogen biosynthesis cytosine-purine permease activity 5.51E-05 sodium ion transport basic amino acid transporter activity lysophospholipase activity membrane cell wall (sensu Fungi) alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) 3-chloroallyl aldehyde dehydrogenase activity oxidoreductase activity cyclin-dependent protein kinase regulator activity ammonium transporter activity myo-inositol transport myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity
96
Correlation between duplication rate and mutation rate
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.