Presentation is loading. Please wait.

Presentation is loading. Please wait.

Matt Rasmussen and Manolis Kellis

Similar presentations


Presentation on theme: "Matt Rasmussen and Manolis Kellis"— Presentation transcript:

1 Matt Rasmussen and Manolis Kellis
Phylogenomics of mammalian, fly and fungal genomes Matt Rasmussen and Manolis Kellis Friday , April 2007 MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard

2 Multiple fully sequenced clades of genomes
32 mammals 12 flies ~20 fungi and many more…

3 Comparative genomics Identify conserved regions
Annotate genes, regulatory elements Study gene and genome evolution Recognize orthologs / paralogs Gene duplications / losses Linage-specific expansions Varying rates of evolution And much more…

4 Comparative genomics requires correct orthology/phylogeny
Orthologs and paralogs are best understood in the context of a phylogeny (Fitch 1970) Phylogenies are necessary for inferring duplications and losses (Goodman 1979) Goal of phylogenomics (Eisen 1998): Determine the phylogeny of every gene family in multiple complete genomes

5 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

6 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

7 Contrast of the phylogenetic method with alternative methods
Pair-wise sequence comparison Best reciprocal BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Requires close genomes Unable to resolve tandem duplications Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families

8 Contrast of the phylogenetic method with alternative methods
Pair-wise sequence comparison Best reciprocal BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Requires close genomes Unable to resolve tandem duplications Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families Can they be applied genome-wide?

9 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

10 What is the accuracy of current phylogenetic methods?
Tricky question Requires knowing the correct phylogeny independently Previous studies used: Simulation Experimental evolution in lab Others mostly avoided accuracy question and focused on robustness (i.e. bootstrap)

11 Importance of correct gene trees

12 Importance of correct gene trees

13 Importance of correct gene trees
Loss Duplication X

14 Importance of correct gene trees
Mammalian example Fast-evolving rodents lead to lots of errors! D H M R opossum D H M R opossum

15 Importance of correct gene trees
Mammalian example Fast-evolving rodents lead to lots of errors! D H M R D H M R opossum D H M R opossum Implies 1 duplication and at least 3 losses

16 What is the accuracy of current phylogenetic methods?
Use synteny to determine phylogeny by an independent means

17 What is the accuracy of current phylogenetic methods?
Use synteny to determine phylogeny by an independent means Trees found by Max Likelihood (PHYML) Matches species topology

18 What is the accuracy of current phylogenetic methods?
Phylogenies across 5154 syntenic one-to-one orthologs Etc… 316 other topologies Matches species topology

19 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

20 Inaccuracies dependent on sequence length

21 Inaccuracies depends on divergence

22 Inaccuracies due to lack of information
Average gene is too short Too few phylogenetically informative characters To make progress, must use additional information Current algorithms ignore species Designed for solving the species tree problem Whole genomes change the game Assume species tree is known Solve the gene tree problem Our approach: Design an algorithm specifically for the gene tree problem Key insight: use species tree to inform the gene tree reconstruction

23 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

24 What is the connection between species and gene evolution?

25 What is the connection between species and gene evolution?
5154 gene trees

26 What is the connection between species and gene evolution?
5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site

27 Branch lengths across the genome
22 species branches dmel dsec dsim dere dyak dpse …

28 Branch lengths across the genome
22 species branches dmel dsec dsim dere dyak dpse … 5154 gene families

29 Branch lengths across the genome
22 species branches dmel dsec dsim dere dyak dpse … 5154 gene families bij =branch length in jth species of the ith gene tree

30 Gene trees share similar rates: correlation
22 species branches mer (CG6875) asp (CG14228) 5154 gene families

31 Gene trees share similar rates: correlation
22 species branches mer (CG6875) asp (CG14228) 5154 gene families r = 0.957 slope = 2.10 asp branch lengths mer branch lengths

32 Gene trees share similar rates: correlation
22 species branches 5154 gene families Average gene tree

33 Gene trees share similar rates: correlation
22 species branches 5154 gene families Average gene tree 93% of trees have a correlation greater than 0.8 with the average gene tree

34 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

35 Initial results suggest two forces of gene evolution:
1. Gene-specific rates 2. Species-specific rates Branch = Gene rate * Species rate Can we really de-couple gene-specific and species-specific rates?

36 Study connection between species and gene evolution
5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site

37 Tree normalization 22 species branches 5154 gene families

38 Tree normalization Absolute lengths Relative lengths
22 species branches 22 species branches 5154 gene families 5154 gene families Absolute lengths Relative lengths tree normalization (a.k.a. row normalization)

39 Tree normalization Absolute lengths Relative lengths
22 species branches 22 species branches 5154 gene families 5154 gene families Absolute lengths Relative lengths tree normalization (a.k.a. row normalization) bij =branch length in jth species of the ith gene tree gj = sumj bij (gene-specific rate) sij = bij / gj (species-specific rate)

40 Effect of normalization on branch length distributions
Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir 5154 gene families 5154 gene families

41 Effect of normalization on branch length distributions
Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir 5154 gene families 5154 gene families Gamma distributed Cannot reject gamma for 14/22

42 Effect of normalization on branch length distributions
Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir 5154 gene families 5154 gene families Total tree length gj Gamma distributed Cannot reject gamma for 14/22

43 Effect of normalization on branch length distributions
Absolute lengths Relative lengths 22 species branches 22 species branches dvir dvir dvir 5154 gene families 5154 gene families st. dev typically 1/3 to ¼ of mean Gamma distributed Cannot reject gamma for 14/22 Approx. normally distributed Cannot reject normal for 9/22

44 Effect of normalization on branch correlation
Absolute lengths Relative lengths dvir dana dvir dana dvir dana

45 Effect of normalization on branch correlation
Absolute lengths Relative lengths dvir dana dvir dana dvir dana average r = 0.61

46 Effect of normalization on branch correlation
Absolute lengths Relative lengths dvir dana dvir dana dvir dana average r = 0.61 average r = 0.09

47 A new model for gene family evolution: Two independent forces
1. Gene-specific rates 2. Species-specific rates G Si =gamma(a,b) =normal(ui,si) bij = gj * sij

48 Effects that we have seen are consequences of this model
bij = gj * sij If species rates have small standard deviations we expect branch correlation

49 What is the meaning of the species-specific rate?
The normal is partly due to error in estimating evolutionary distance If we fit normals only on long sequences, the standard deviation goes down Species-specific means are not affected by sequence length.

50 All these properties hold for 12 flies, 17 fungi, 4 mammals
12 Drosophila 9 Saccharomycete Abs (16/16) Relative (15/16) Absolute branches fit gamma (14/22 significant) Relative branches fit normal (9/22 significant) 9 Candida Abs (14/14) Relative (12/14)

51 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

52 A new strategy for gene tree reconstruction
SPecies Informed Distance-base Reconstruction Traditional Maximum Likelihood methods Propose many topologies For each topology Calculate the likelihood of seeing such a tree Return tree that achieves max likelihood We show that one can calculate the likelihood of a tree being generated by our model Thus, we can create our own phylogenetic algorithm that uses species information to reconstruct gene trees.

53 Likelihood calculation: simple case
INPUT: a distance matrix with all pair-wise distances between genes

54 Likelihood calculation: simple case
Propose a topology Fit branch lengths to topology Estimate gene-specific rate Normalize tree

55 Likelihood calculation: simple case
Reconcile gene tree to species tree Determines actual path of evolution through species tree Algorithms exist to do this fast (linear time) Page 1994, Eulenstein 1997, Zmaske 2001

56 Likelihood calculation: simple case
Pc Pa Pb Pd Pe d Pf Compare branch lengths to distributions Allows us to calculate a likelihood for every branch

57 Likelihood calculation: simple case
Pc Pa Pb Pd Pe d Every branch is highly likely  Tree is Highly likely Pf Because branches are independent, likelihood of tree is product of branch likelihoods

58 A new phylogenetic method: Learning across complete genomes
Outline Why use phylogeny? Part I Inaccuracies Part II The Model A new phylogenetic method: Learning across complete genomes Part III The Method Simple case Complex case General case Part IV The Results

59 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf

60 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Propose another topology This one differs only by rooting Most branch have same length (just different name) w = e (human) x = c (rat) y = d (mouse) z = b (rodent) Two branches are now merged v = a + f (dog/hmr)

61 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Reconcile gene tree to species tree

62 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr)

63 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Mouse and rat branches have the same likelihood as before Px = Pc Py = Pd

64 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same distribution for dog, but now dog branch is too long. Why? v = a + f Pv < Pf

65 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py ? Mouse w1 w2 Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch w goes from Eutherian to Human (two species branches) Which distribution should we use?

66 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) The distribution is the sum of two independent normals w= w1 + w2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22) Branch w is too short, Pw < Pe

67 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px ? z1 z2 Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same case for z. Two species branches Distribution is sum of two indep. normals z = z1 + z2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22)

68 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch z is too short Pz < Pb

69 Likelihood calculation: complex case
Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Some branches are less likely  Tree is less likely

70 A new phylogenetic method: Learning across complete genomes
Outline Why use phylogeny? Part I Inaccuracies Part II The Model A new phylogenetic method: Learning across complete genomes Part III The Method Simple case Complex case General case Part IV The Results

71 Bringing it all together
Turns out we find the likelihood of any tree by breaking it down into 1 of three cases Main advantage: do not explicitly penalize dup/loss Only ensure branch lengths are close to what we expect given our model

72 Example of reconstructing tree with dup/loss: hemoglobin genes
D H M R D H M R Hemoglobin alpha Hemoglobin beta This is now the correct topology 

73 Example of reconstructing tree with dup/loss
Px Pz Rat Py Pw Mouse Human Pv Dog All branches are highly likely  Tree is highly likely z w v Branch z is now longer Branch w is now longer Branch v is just the right length D H M R D H M R Hemoglobin alpha Hemoglobin beta

74 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

75 Evaluation: Datasets Real datasets
5154 syntenic one-to-ones from 12 flies 739 syntenic one-to-ones from 9 fungi 138 Whole genome duplicates in 9 fungi Simulated (using our gene family model) More complex events 12 Drosophila 9 Saccharomycete

76 Evaluation real data

77 Evaluation real data

78 Evaluation real data WGD klac kwal agos scas1 cgla1 cgla2 scas2
s.stricto1 s.stricto2 Pre-duplication

79 Branch lengths through WGD well approximated by model
scer scas WGD spar cgla klac scas1 cgla1 cgla2 scas2 kwal agos s.stricto1 s.stricto2 Pre-duplication smik sbay

80 No apparent topology bias in reconstructing simulation data

81 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

82 Apply genome-wide for 17 fungi
Cluster genes Build alignment for each cluster Build tree for each alignment Reconcile to species tree to determine all duplications and losses

83 Identify lineage-specific acceleration
Not just fast-evolving genes, but genes that are faster than expected

84 Better understanding of phylogenetic accuracy in real data
Contributions Better understanding of phylogenetic accuracy in real data New model for gene family evolution Gene-specific and specific rates New phylogenomic algorithm Increased accuracy for reconstruction

85 Acknowledgements Manolis Kellis Kellis Lab Fly datasets
Ameya Deoras, Pouya Kheradpour, Mike Lin, Alex Stark Fly datasets NIH Doug Smith and Fly analysis consortium Candida datasets NIAID Bruce Birren and Christina Cuomo NIH Training Grant CSAIL / Broad / Whitehead

86 Outline Why use phylogeny?
What is the accuracy of phylogenetic methods? Part I Inaccuracies Understanding sources of error Modeling common properties of gene trees Part II The Model Gene-specific and species-specific rates A new phylogenetic method: Learning across complete genomes Part III The Method Increase in phylogenetic accuracy Part IV The Results Additional applications

87

88 The standard deviation of every species-specific rate is nearly ¼ of the mean

89 GO enrichment in top 50 trees with most duplications
term pval plasma membrane -1.50E-11 helicase activity 4.72E-12 ammonium transporter activity 1.46E-09 telomere maintenance via recombination 1.66E-09 DNA helicase activity 4.11E-09 transport 4.40E-07 transporter activity 4.80E-07 membrane 1.27E-06 alcohol dehydrogenase activity 6.29E-06 nitrogen utilization ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity alcohol dehydrogenase (NADP+) activity 3.84E-05 magnesium ion transport sodium ion transport basic amino acid transporter activity lysophospholipase activity nuclear nucleosome 4.17E-05 cellular component unknown translational elongation oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor translation elongation factor activity oxidoreductase activity alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation ribosome thiamin biosynthesis myo-inositol transport myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity permease activity calcium-transporting ATPase activity

90 GO enrichment in top 50 trees with most gene losses
helicase activity 4.49E-12 telomere maintenance via recombination 1.60E-09 DNA helicase activity 3.95E-09 GTPase activity 7.11E-09 ubiquitin conjugating enzyme activity 6.79E-08 translational elongation 4.24E-07 alcohol dehydrogenase activity 6.17E-06 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity translation elongation factor activity 9.27E-06 alcohol dehydrogenase (NADP+) activity 3.78E-05 1,3-beta-glucan synthase activity sodium ion transport IMP dehydrogenase activity ribosome 4.35E-05 protein serine/threonine phosphatase activity oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor structural constituent of cytoskeleton alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation ammonium transporter activity hydrogen-exporting ATPase activity, phosphorylative mechanism GTP biosynthesis myo-inositol transport regulation of pH myo-inositol transporter activity maltose catabolism aconitate hydratase activity calcium-transporting ATPase activity IMP cyclohydrolase activity plasma membrane transporter activity protein phosphatase type 2A activity

91 GO enrichment in top 10 trees with most genes
DNA helicase activity 3.79E-13 telomere maintenance via recombination 4.61E-13 helicase activity 8.05E-13 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 5.92E-08 alcohol dehydrogenase activity sodium ion transport 1.15E-06 fermentation 1.13E-05 calcium-transporting ATPase activity NADPH dehydrogenase activity transport oxidoreductase activity membrane alcohol dehydrogenase (NADP+) activity alcohol metabolism transporter activity plasma membrane multidrug transport

92 Gene functions that evolve rapidly
Analyses more robust at GO category level

93 Orthologs and paralogs
human mouse rat dog rabbit paralogs orthologs Underdstand orth and para is a basic requirement for comparative genomic studies Orth arise from speciation and vertical from a single ancestral gene and therefore typically preserve the ancestral function Para on the other hand arise by gene duplication and likely to take on new functions Orthologs arise by speciation typically keep same function Paralogs arise by duplication typically take on new functions

94 Complement Ka/Ks studies
Ks saturates very rapidly Gene-specific rates hold across much larger distances

95 GO enrichment in top 50 trees with most genes
term pval helicase activity 1.55E-11 telomere maintenance via recombination 3.83E-09 DNA helicase activity 1.05E-08 plasma membrane 2.84E-08 1,3-beta-glucanosyltransferase activity 7.94E-08 transporter activity 1.45E-07 transport 1.71E-07 pyruvate decarboxylase activity 2.09E-06 protein amino acid O-linked glycosylation 2.29E-06 Golgi apparatus 7.73E-06 alcohol dehydrogenase activity 1.01E-05 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity GTPase activity 1.57E-05 cyclin-dependent protein kinase holoenzyme complex 1.70E-05 alpha-1,2-mannosyltransferase activity 2.95E-05 regulation of glycogen biosynthesis cytosine-purine permease activity 5.51E-05 sodium ion transport basic amino acid transporter activity lysophospholipase activity membrane cell wall (sensu Fungi) alpha-glucosidase activity fermentation proteasome complex (sensu Eukaryota) 3-chloroallyl aldehyde dehydrogenase activity oxidoreductase activity cyclin-dependent protein kinase regulator activity ammonium transporter activity myo-inositol transport myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity

96 Correlation between duplication rate and mutation rate


Download ppt "Matt Rasmussen and Manolis Kellis"

Similar presentations


Ads by Google