Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical.

Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical

Charles Darwin and Alfred Russell Wallace Evolution as descent with modification, implying relationships between organisms by unbroken genetic lines Phylogenetics seeks to determine these genetic relationships Darwin’s sketch: the first phylogenetic tree? Charles Darwin Alfred Russel Wallace

Since the publication of Origin of the Species in 1859 people have been trying to infer the evolutionary “tree of life”. Ernst Haeckel’s Tree of Life (1866)

Haeckel’s Pedigree of Man

Why molecular phylogeny Most molecules evolve independently of adaptations affecting morphology. It is fairly easy to find genes that are present in all species of interest, e.g., a 12S RNA molecule in mitochondria is functional over all mammals. Useful mathematical models of sequence evolution have been developed that underpin attempts to infer evolutionary trees

Golden Mole Mole Whale ?

Golden MoleMoleWhale Laurasiatheria Afrotheria

? hedgehog tenrec elephant

LaurasiatheriaAfrotheria hedgehogtenrec elephant

A brief and incomplete history of molecular phylogenetics 60s 70s 80s 90s 00s Antibodies DNA-DNA hybridisation SarichWilson Assessing support - Bootstrap Explicit Models - Maximum likelihood Systematic bias – Felsenstein Zone Felsenstein Sequence Data (Amino acid then DNA) Parsimony Distance based More complex models - Bayesian methods Various perils...anomalous gene trees, non identifiable models Population processes, gene trees in species trees MORE Sequence Data PCR

The molecular phylogeny problem ACCGCTTA ACTGCTTA ACTGCTAAACTGCTTA ACTGCTAA ACCCCTTA Time ACCCCATA …ACCCCTTA… …ACCCCATA… …ACTGCTTA… …ACTGCTAA… We see the aligned modern day sequences And want to recover the underlying evolutionary tree. ?

Sequence evolution is modelled as a Markov process time A C Consider a single edge in a phylogeny, i.e. evolution of a single species, and the evolution of a single DNA base amongst the possible states {A, C, G, T}. The probability of mutating from state i to j over a length of time t depends only on the current state i and the potential future state j, not on any of the previous history of the sequence, and can be written p ij (t). A T G time t

Continuous time Markov chains ACGT A p AA p AC p AG p AT C p CA p CC p CG p CT G p GA p GC p GG p GT T p TA p TC p TG p TT M = ACGT A -q A* q AC q AG q AT C q CA -q C* q CG q CT G q GA q GC -q G* q GT T q TA q TC q TG -q T* Q = Where q i* = Σ j q ij, j ≠ i i.e. rows sum to zero. M = exp(Qt) Transition matrixInstataneous rate matrix Typically we restrict to stationary, reversible models, with the stationary distribution denoted by π. So, π Q = 0, and D(π)Q is symmetric.

Models of nucleotide substitution Jukes Cantor (JC) –All substitutions equally likely –Base frequencies equal Kimura 2 Parameter (K2P) –Transitions and transversions at different rates –Base frequencies equal HKY model –Transitions and transversions at different rates –Base frequencies different General Time Reversible (GTR) AG C T α α αα α α

Models of nucleotide substitution Jukes Cantor (JC) –All substitutions equally likely –Base frequencies equal Kimura 2 Parameter (K2P) –Transitions and transversions at different rates –Base frequencies equal HKY model –Transitions and transversions at different rates –Base frequencies different General Time Reversible (GTR) AG C T β β αα α α

Models of nucleotide substitution Jukes Cantor (JC) –All substitutions equally likely –Base frequencies equal Kimura 2 Parameter (K2P) –Transitions and transversions at different rates –Base frequencies equal HKY model –Transitions and transversions at different rates –Base frequencies different General Time Reversible (GTR) AG C T β γ ζα ε δ

Models define probability distributions on site patterns The model θ consists of: the tree topology, edge weights, Q matrix*, and root distribution π. * More generally, this could be a set of Q matrices 1 2 3 M1M1 M2M2 M 12 M3M3 Edge weights t 1, t 2, t 3, t 12 M e = exp(Qt e ) p ijk = Σ x,y M 1 (x,i) M 2 (x,j) M 12 (y,x) M 3 (y,k) π(y) x y

Tree estimation using maximum likelihood For a given set of parameters θ we can calculate the probability of any particular site pattern. The overall probability of an alignment is then taken to be the the product of the probabilities for each site (i.i.d assumption). This is the likelihood function, i.e. the probability of the data given the model. We can then use optimisation techniques to find the model parameters (tree topology, edge lengths, parameters of the substitution model) that maximise the likelihood.

Extra features of sequence evolution that can be modelled Site to site rate variation (usually modelled by a gamma distribution) Invariant sites BUT Some parts of reality are problematic… –Base composition bias –Sites that are free to vary change across the tree –Non independence of sites

Likelihood versus parsimony (the Felsenstein Zone) Prior to the introduction of ML to phylogenetics community by Joe Felsenstein Maximum Parsimony (MP) was the most widely used method for estimating phylogenetic trees. MP chooses the tree that requires the fewest mutations to explain the data A B C D A A G G G A A D B A G G A A A C

Likelihood versus parsimony (the Felsenstein Zone) The MP criterion has been shown to be statistically inconsistent on some trees under the models of nucleotide substitution discussed previously. Likelihood is statisitically consistent (given the correct model). Felsenstein (1978)Hendy & Penny (1989)

Assessing confidence It is not just of interest to get a point estimate of the phylogenetic tree. We would also like some measure of confidence in our point estimate. – Is our tree likely to change if we get more data? – How robust is our result to sampling error? The bootstrap is a useful tool for answering these sorts of questions.

The bootstrap (Felsenstein 1985) For each bootstrap sample: –Create a new alignment (of the same length as the original) by resampling the columns of the observed alignment –Construct a tree for the ‘bootstrap’ alignment The bootstrap support for each edge is the number of bootstrap trees that edge appears in.

1234567 a ATATAAA bATTATAA cTAAAATA dTATAAAT 1224567 a ATTTAAA bATTATAA cTAAAATA dTAAAAAT 1334567 a AAATAAA bATTATAA cTAAAATA dTTTAAAT 1234567 a ATATAAA bATTATAA cTAAAATA dTATAAAT 1244567 a ATTTAAA bATAATAA cTAAAATA dTAAAAAT a a aa b b bb c c c c d d d d a b c d 0.75

Example where the bootstrap is useful Simulate data on the four taxon tree below (JC model) Use sequence lengths of 100, 1000, and 10000 100100010000 ((a,b),(c,d)) 5.7%97%100% ((a,c),(b,d)) 42.8%<5%0 ((a,d),(b,c)) 49.8%<5%0 0.2 0.01 abcd

Example where it is not so useful Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences Use total sequence lengths of 100, 1000, and 10000 100100010000 ((a,b),(c,d)) 64%80%98% ((a,c),(b,d)) 33%20%<5 ((a,d),(b,c)) 3%0%<5 0.1 0.05 abc d 0.1 0.05 acb d 55% 45%

Genome-scale phylogeny Data sets with many concatenated genes –Rokas et al, Nature 2003 (106 genes, 8 taxa) –Goremykin et al, MBE 2004 (61 genes, 14 taxa) Estimated trees have very high bootstrap support. BUT... trees are sensitive to: model used, method used, data-coding.

Case study: The Amborella Wars

Angiosperms Grasses A New Caladonian shrub

NJ bootstrap with ML distances using a GTR + gamma model 0 20 40 60 80 100 0.10.150.20.250.30.350.40.450.50.550.60.650.70.75 alpha (gamma shape parameter) bootstrap support Amb+Nym Grasses Amb Nym Skewed ratesEqual rates

Sensitivity to model choice Phylogenomic datasets may involve hundreds of genes for many species. These data sets create challenges for current phylogenetic methods, as different genes have different functions and hence evolve under different processes. One question is how best to model this heterogeneity to give reliable phylogenetic estimates of the species tree.

Example C. albicans S. kluyveri S. kudriavzevii S. bayanus S. cerevisiae S. paradoxus S. mikatae S. castellii Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa

Two extremes How many parameters do we need to adequately represent the branches of all (unrooted) gene trees ? Between 13 (consensus tree) & 13 x 106 = 1378 Too few parameters introduces bias Too many parameters increases the variance

Stochastic partitioning Attempts to cluster genes into classes that have evolved in a similar fashion. Each class is allowed its own set of parameters (e.g. branch lengths or model of nucleotide substitution)

Algorithm overview 1.Randomly assign the n genes to k classes. 2.Optimise parameters for each class 3.Compute the posterior probability for each gene with the parameters from each class. 4.Move each gene into the class for which it has highest posterior probability 5.Go to step 2, when no genes change class STOP

How many classes?

Conclusions regarding stochastic partitioning Pros –AIC/BIC allows you a quantitative method to choose how many parameters are needed. –Identifies groups of genes under similar constraints Cons –Slow –Randomized algorithm so different starting points lead to different partitions.

Brief Tour… Combinatorics of tree space Graph Theory Stochastic Models, Inference & Probability Theory Algebraic Geometry Lie groups, representation theory ….

Figure 2 Matsen and Steel (2007) …the underlying assumption was that mixture model data on one topology can be distinguished from data evolved on an unmixed tree of another topology given enough data and the ``correct'' method. Here we show that this assumption can be false. For biologists our results imply that, for example, the combined data from two genes whose phylogenetic trees differ only in terms of branch lengths can perfectly fit a tree of a different topology. Identifiability

Elizabeth Allman John Rhodes Algebraic geometry approach

The boundary of phylogenetics and population genetics Fisher-Wright model Phylogenetic tree

Gene trees in species phylogenies James DegnanNoah Rosenberg

Representation theory, Lie groups, Markov invariants, closure of model classes Jeremy Sumner Peter Jarvis

http://www.maths.utas.edu.au/phylomania/phylomania2011.htm

Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical.

Similar presentations

Presentation on theme: "Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical.

Similar presentations

Presentation on theme: "Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical."— Presentation transcript:

Similar presentations

About project

Feedback