Accurate gene phylogeny across multiple complete genomes

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Phylogenetics workshop: Protein sequence phylogeny week 2 Darren Soanes.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
Bioinformatics and Phylogenetic Analysis
FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Phylogenetic trees Sushmita Roy BMI/CS 576
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Phylogenetic Trees: Common Ancestry and Divergence 1B1: Organisms share many conserved core processes and features that evolved and are widely distributed.
Introduction to Phylogenetics
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Using blast to study gene evolution – an example.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Phylogeny Ch. 7 & 8.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Phylogenetics.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Phylogeny and the Tree of Life
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Distance based phylogenetics
Basics of Comparative Genomics
Genomes and Their Evolution
Comparative Genomics.
Multiple Alignment and Phylogenetic Trees
Ab initio gene prediction
Inferring phylogenetic trees: Distance and maximum likelihood methods
Molecular Evolution.
Summary and Recommendations
Mattew Mazowita, Lani Haque, and David Sankoff
Evolutionary Biology Concepts
SEG5010 Presentation Zhou Lanjun.
Matt Rasmussen and Manolis Kellis
Gautam Dey, Tobias Meyer  Cell Systems 
Basics of Comparative Genomics
Summary and Recommendations
Study phylogeny in the context of species evolution
Presentation transcript:

Accurate gene phylogeny across multiple complete genomes Species Informed Distance-based Reconstruction Matt Rasmussen and Manolis Kellis

The goal Determine the evolutionary history of every gene in multiple complete genomes

The goal Determine the evolutionary history of every gene in multiple complete genomes From phylogenies determine: Orthologs Paralogs Duplications Losses Family expansions Varying rates of evolution Etc…

Contrast of the phylogenetic method with alternative methods Pair-wise sequence comparison Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families

Contrast of the phylogenetic method with alternative methods Pair-wise sequence comparison Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Phylogenetic methods Phylogeny of family clusters orthologs near each other Traditionally applied to specific families Can they be applied genome-wide?

What is the accuracy of current phylogenetic methods? Tricky question: Requires knowing the correct phylogeny by an independent means Previously, Simulation Or avoid accuracy and focus on robustness (bootstrap)

What is the accuracy of current phylogenetic methods? Use synteny to determine phylogeny by an independent means

What is the accuracy of current phylogenetic methods? Use synteny to determine phylogeny by an independent means Trees found by Max Likelihood (PHYML) Matches species topology

What is the accuracy of current phylogenetic methods? Phylogenies across 5154 syntenic one-to-one orthologs Etc… 316 other topologies Matches species topology

Reconstruction accuracy dependent on gene sequence length

Accuracy of current phylogenetic methods Average gene is too short Too few phylogenetically informative characters To make progress, must use additional information Current algorithms ignore species Designed for solving the species tree problem Whole genomes change the game We can assume species tree is known We would like to solve the gene tree problem Our approach: Design an algorithm specifically for the gene tree problem Key insight: use species tree to inform the gene tree reconstruction

What is the connection between species and gene evolution?

What is the connection between species and gene evolution?

What is the connection between species and gene evolution? 5154 gene trees

What is the connection between species and gene evolution? 5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site

Correlation between branch lengths Total tree length Relative branch lengths

Correlation between branch lengths Total tree length Relative branch lengths r = 0.957 asp branch lengths Mer branch lengths

Correlation between branch lengths Average gene tree

Correlation between branch lengths Average gene tree 93% of trees have a correlation greater than .8 with the average gene tree

Effect of normalization on branch correlation dvir dana

Effect of normalization on branch length distribution Relative branch lengths Normally distributed Absolute branch lengths Gamma distributed

A new model for gene family evolution: Two forces 1. Family rates 2. Species-specific rates Fj Sij ~gamma(a,b) ~normal(ui,sij) bij = Fj * Sij

Effects that we have seen are consequences of this model bij = Fj * Sij Total tree length Lj of one-to-one trees is proportional to family rate Fj If species rates have small standard deviations we expect branch correlation

The standard deviation of every species-specific rate is nearly ¼ of the mean

What is the meaning of the species-specific rate? The normal is partly due to error in estimating evolutionary distance If we fit normals only on long sequences, the standard deviation goes down Species-specific means are not affected by sequence length.

All of these affects also hold for 17 fungi and 4 mammals 12 Flies 17 Fungi Absolute branches distributed by gamma Relative branches distributed by normal Absolute branches distributed by gamma Relative branches distributed by normal 3 < Mean / sdev < 4 3 < Mean / sdev < 4

A new strategy for gene tree reconstruction Traditional Maximum Likelihood methods Propose many topologies For each topology Calculate the likelihood of seeing such a tree Return tree that achieves max likelihood We show that one can calculate the likelihood of a tree being generated by our model Thus, we can create our own phylogenetic algorithm that uses species information to reconstruct gene trees.

Likelihood calculation: simple case INPUT: a distance matrix with all pair-wise distances between genes

Likelihood calculation: simple case Propose a topology Fit branch lengths to topology Estimate Family rate Normalize tree

Likelihood calculation: simple case Reconcile gene tree to species tree Determines actual path of evolution through species tree Algorithms exist to do this fast (linear time)

Likelihood calculation: simple case Pc Pa Pb Pd Pe d Pf Compare branch lengths to distributions Allows us to calculate a likelihood for every branch

Likelihood calculation: simple case Pc Pa Pb Pd Pe d Every branch is highly likely  Tree is Highly likely Pf Because branches are independent, likelihood of tree is product of branch likelihoods

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Propose another topology This one differs only by rooting Most branch have same length (just different name) w = e (human) x = c (rat) y = d (mouse) z = b (rodent) Two branches are now merged v = a + f (dog/hmr)

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Reconcile gene tree to species tree

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr)

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Mouse and rat branches have the same likelihood as before Px = Pc Py = Pd

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same distribution for dog, but now dog branch is too long. Why? v = a + f Pv < Pf

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py ? Mouse w1 w2 Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch w goes from Eutherian to Human (two species branches) Which distribution should we use?

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) The distribution is the sum of two independent normals w= w1 + w2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22) Branch w is too short, Pw < Pe

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px ? z1 z2 Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Same case for z. Two species branches Distribution is sum of two indep. normals z = z1 + z2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22)

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Branch z is too short Pz < Pb

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Pz Rat Py Pw Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Some branches are less likely  Tree is less likely

Bringing it all together Turns out we find the likelihood of any tree by breaking it down into 1 of three cases Main advantage: do not explicitly penalize dup/loss Only ensure branch lengths are close to what we expect given our model

Example of reconstructing tree with dup/loss: hemoglobin genes D H M R D H M R Hemoglobin alpha Hemoglobin beta This is now the correct topology 

Example of reconstructing tree with dup/loss Px Pz Rat Py Pw Mouse Human Pv Dog All branches are highly likely  Tree is highly likely z w v Branch z is now longer Branch w is now longer Branch v is just the right length D H M R D H M R Hemoglobin alpha Hemoglobin beta

Evaluation: Datasets Real datasets 5154 syntenic one-to-ones from 12 flies 739 syntenic one-to-ones from 17 fungi 200 Neighboring fly orthologs 220 Whole genome duplicates in 7 yeasts Simulated (using our gene family model) More complex events Neighboring orthologs WGD trees klac, kwal, agos sbay, smik, spar, scer

Evaluation

Apply genome-wide for 17 fungi Cluster genes Build alignment for each cluster Build tree for each alignment Reconcile to species tree to determine all duplications and losses

General trees follow the model we learned from one-to-one trees

GO enrichment in top 50 trees with most duplications term pval plasma membrane -1.50E-11 helicase activity 4.72E-12 ammonium transporter activity 1.46E-09 telomere maintenance via recombination 1.66E-09 DNA helicase activity 4.11E-09 transport 4.40E-07 transporter activity 4.80E-07 membrane 1.27E-06 alcohol dehydrogenase activity 6.29E-06 nitrogen utilization ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity alcohol dehydrogenase (NADP+) activity 3.84E-05 magnesium ion transport sodium ion transport basic amino acid transporter activity lysophospholipase activity nuclear nucleosome 4.17E-05 cellular component unknown 0.000129 translational elongation 0.000139 oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor 0.00015 translation elongation factor activity 0.000231 oxidoreductase activity 0.000353 alpha-glucosidase activity 0.000365 fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation 0.000515 ribosome 0.000724 thiamin biosynthesis myo-inositol transport 0.00114 myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity permease activity calcium-transporting ATPase activity

GO enrichment in top 50 trees with most gene losses helicase activity 4.49E-12 telomere maintenance via recombination 1.60E-09 DNA helicase activity 3.95E-09 GTPase activity 7.11E-09 ubiquitin conjugating enzyme activity 6.79E-08 translational elongation 4.24E-07 alcohol dehydrogenase activity 6.17E-06 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity translation elongation factor activity 9.27E-06 alcohol dehydrogenase (NADP+) activity 3.78E-05 1,3-beta-glucan synthase activity sodium ion transport IMP dehydrogenase activity ribosome 4.35E-05 protein serine/threonine phosphatase activity 0.000139 oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor 0.000148 structural constituent of cytoskeleton 0.00018 alpha-glucosidase activity 0.00036 fermentation proteasome complex (sensu Eukaryota) transmembrane receptor activity protein amino acid O-linked glycosylation 0.000505 ammonium transporter activity 0.000701 hydrogen-exporting ATPase activity, phosphorylative mechanism 0.001129 GTP biosynthesis myo-inositol transport regulation of pH myo-inositol transporter activity maltose catabolism aconitate hydratase activity calcium-transporting ATPase activity IMP cyclohydrolase activity plasma membrane 0.00117 transporter activity 0.00179 protein phosphatase type 2A activity 0.001867

GO enrichment in top 50 trees with most genes term pval helicase activity 1.55E-11 telomere maintenance via recombination 3.83E-09 DNA helicase activity 1.05E-08 plasma membrane 2.84E-08 1,3-beta-glucanosyltransferase activity 7.94E-08 transporter activity 1.45E-07 transport 1.71E-07 pyruvate decarboxylase activity 2.09E-06 protein amino acid O-linked glycosylation 2.29E-06 Golgi apparatus 7.73E-06 alcohol dehydrogenase activity 1.01E-05 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism alpha-1,3-mannosyltransferase activity GTPase activity 1.57E-05 cyclin-dependent protein kinase holoenzyme complex 1.70E-05 alpha-1,2-mannosyltransferase activity 2.95E-05 regulation of glycogen biosynthesis cytosine-purine permease activity 5.51E-05 sodium ion transport basic amino acid transporter activity lysophospholipase activity membrane 0.000159 cell wall (sensu Fungi) 0.000386 alpha-glucosidase activity 0.00052 fermentation proteasome complex (sensu Eukaryota) 3-chloroallyl aldehyde dehydrogenase activity oxidoreductase activity 0.000557 cyclin-dependent protein kinase regulator activity 0.00097 ammonium transporter activity 0.001011 myo-inositol transport 0.00145 myo-inositol transporter activity maltose catabolism glycerophospholipid metabolism NADPH dehydrogenase activity

GO enrichment in top 10 trees with most genes DNA helicase activity 3.79E-13 telomere maintenance via recombination 4.61E-13 helicase activity 8.05E-13 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 5.92E-08 alcohol dehydrogenase activity sodium ion transport 1.15E-06 fermentation 1.13E-05 calcium-transporting ATPase activity 0.00011 NADPH dehydrogenase activity transport 0.000146 oxidoreductase activity 0.000178 membrane 0.000235 alcohol dehydrogenase (NADP+) activity 0.000328 alcohol metabolism 0.000652 transporter activity 0.000735 plasma membrane 0.000846 multidrug transport 0.001078

Supplemental figure

# Duplications vs rel sub/site for each species branch

Orthologs and paralogs human mouse rat dog rabbit paralogs orthologs Underdstand orth and para is a basic requirement for comparative genomic studies Orth arise from speciation and vertical from a single ancestral gene and therefore typically preserve the ancestral function Para on the other hand arise by gene duplication and likely to take on new functions Orthologs arise by speciation typically keep same function Paralogs arise by duplication typically take on new functions

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr)

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Py w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Mouse and rat branches have the same likelihood as before Px = Pc Py = Pd

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Py Pv w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Dog is now too long. Why? v = a + f Pv < Pf

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf Px Py Pv w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Human is now too short, because it must now cross an extra species

Likelihood calculation: complex case Pc Pa Pb Pd d Pe Every branch is highly likely  Tree is Highly likely Pf

Figure 4 a. Gene-tree with correct topology scores highly b. Gene-tree with incorrect topology scores poorly Figure 4. Gene-tree evaluation with a richer species-tree model